Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On 17-04-29 06:04 PM, Alexei Starovoitov wrote: > On 4/28/17 3:58 AM, Jesper Dangaard Brouer wrote: >> On Thu, 27 Apr 2017 16:31:14 -0700 >> Alexei Starovoitovwrote: >> >>> On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote: When registering/attaching a XDP/bpf program, we would just send the file-descriptor for this port-map along (like we do with the bpf_prog FD). Plus, it own ingress-port number this program is in the port-map. It is not clear to me, in-which-data-structure on the kernel-side we store this reference to the port-map and ingress-port. As today we only have the "raw" struct bpf_prog pointer. I see several options: 1. Create a new xdp_prog struct that contains existing bpf_prog, a port-map pointer and ingress-port. (IMHO easiest solution) 2. Just create a new pointer to port-map and store it in driver rx-ring struct (like existing bpf_prog), but this create a race-challenge replacing (cmpxchg) the program (or perhaps it's not a problem as it runs under rcu and RTNL-lock). 3. Extend bpf_prog to store this port-map and ingress-port, and have a fast-way to access it. I assume it will be accessible via bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP. >>> >>> I'm not sure I completely follow the 3 proposals. >>> Are you suggesting to have only one netdev_array per program? >> >> Yes, but I can see you have a more clever idea below. >> >>> Why not to allow any number like we do for tailcall+prog_array, etc? >> >>> We can teach verifier to allow new helper >>> bpf_tx_port(netdev_array, port_num); >>> to only be used with netdev_array map type. >>> It will fetch netdevice pointer from netdev_array[port_num] >>> and will tx the packet into it. >> >> I love it. >> >> I just don't like the "netdev" part of the name "netdev_array" as one >> basic ideas of a port tabel, is that a port can be anything that can >> consume a XDP_buff packet. This generalization allow us to move code >> out of the drivers. We might be on the same page, as I do imagine that >> netdev_array or port_array is just a struct bpf_map pointer, and the >> bpf_map->map_type will tell us that this bpf_map contains net_device >> pointers. Thus, when later introducing a new type of redirect (like to >> a socket or remote-CPU) then we just add a new bpf_map_type for this, >> without needing to change anything in the drivers, right? > > In theory, yes, but in practice I doubt it will be so easy. > We probably shouldn't allow very different types of netdev > into the same netdev_array or port_array (whatever the name). > They need to be similar enough, otherwise we'd have to do run-time > checks. If they're all the same, these checks can be done at > insertion time instead. > I think we can just have different map types for each redirect type. So a netdev_map_array, socket_map_array, etc. >> Do you imagine that bpf-side bpf_tx_port() returns XDP_REDIRECT? >> Or does it return if the call was successful (e.g validate port_num >> existed in map)? > > don't know :) > we need to brainstorm pros and cons. > >> On the kernel side, we need to receive this info "port_array" and >> "port_num", given you don't provide the call a xdp_buff/ctx, then I >> assume you want the per-CPU temp-store solution. Then during the >> XDP_REDIRECT action we call a core redirect function that based on the >> bpf_map_type does a lookup, and find the net_device ptr. > > hmm. didn't think that far either :) > indeed makes sense to pass 'ctx' into such helper as well, > so it's easier to deal with original netdev. > >>> We can make it similar to bpf_tail_call(), so that program will >>> finish on successful bpf_tx_port() or >>> make it into 'delayed' tx which will be executed when program finishes. >>> Not sure which approach is better. >> >> I know you are talking about something slightly different, about >> delaying TX. >> >> But I want to mention (as I've done before) that it is important (for >> me) that we get bulking working/integrated. I imagine the driver will >> call a function that will delay the TX/redirect action and at the end >> of the NAPI cycle have a function that flush packets, bulk per >> destination port. >> >> I was wondering where to store these delayed TX packets, but now that >> we have an associated bpf_map data-structure (netdev_array), I'm thinking >> about storing packets (ordered by port) inside that. And then have a >> bpf_tx_flush(netdev_array) call in the driver (for every port-table-map >> seen, which will likely be small). > > makes sense to me as well. > Ideally we should try to make an api such, that batching or no-batching > can be kernel choice. The program will just do > xdp_tx_port(...something here...) > and the kernel does the best for performance. If it needs to delay > the result to do batching the api should allow that transparently. > Like right now xdp program does 'return
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On 4/28/17 12:43 PM, Hannes Frederic Sowa wrote: On 28.04.2017 07:30, Alexei Starovoitov wrote: On 4/27/17 10:06 PM, John Fastabend wrote: That is more or less what I was thinking as well. The other question I have though is should we have a bpf_redirect() call for the simple case where I use the ifindex directly. This will be helpful for taking existing programs from tc_cls into xdp. I think it makes sense to have both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect(). I think so too. Once netdevice is stored into netdev_array map the netdevice is pinned and we need to figure out what to do if somebody tries to delete it. Should we add a new netlink notifier that this netdev's refcnt is almost zero and it's only in netdev_array(s) ? We basically do that automatically in netdev_wait_allrefs: pr_emerg("unregister_netdevice: waiting for %s to become free. Usage count = %d\n", dev->name, refcnt); It is a very unpleasant warning and users probably think about a bug in the kernel at first. I don't think we should wait for user space to clean that up but have to do it automatically from the kernel. Maybe we can introduce a special value that basically NOPs the transmission. The hash table itself would install a netdevice notifier and would clean all tables. Could definitely cause some storm in the kernel, if a lot of keys are mapped to the same interface. or should it be deleted from the array(s) automatically and then user space will be notified post-deletion? Both approaches have their pros and cons. I am leaning more towards deleting it automatically. But walking all tables and in there all keys might cause some unwanted load spikes. yeah, when netdev is being unregistered we have to drop the ref to avoid the warning. Speaking of delete notifiers for maps. Until recently all deletions were explicit and the program could do extra perf_event_submit() if it needed to notify user space of deletion. But right now we have LRU map that deletes automatically and this netdev_array will be deleting automatically too, so we probably need some sort of generic map notifier for all map types. It can be as simple as user space sleeping on read(map_fd, buf); or waiting for epoll on map_fd and kernel wakes it up when map element is deleted and pushes 'key/value' pair into the buf. That should be generic for all map types, but it needs some mechanism to avoid blocking, since number of deletions can be huge. Something similar to perf's event record 'lost N records' should do the trick.
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On 4/28/17 3:58 AM, Jesper Dangaard Brouer wrote: On Thu, 27 Apr 2017 16:31:14 -0700 Alexei Starovoitovwrote: On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote: When registering/attaching a XDP/bpf program, we would just send the file-descriptor for this port-map along (like we do with the bpf_prog FD). Plus, it own ingress-port number this program is in the port-map. It is not clear to me, in-which-data-structure on the kernel-side we store this reference to the port-map and ingress-port. As today we only have the "raw" struct bpf_prog pointer. I see several options: 1. Create a new xdp_prog struct that contains existing bpf_prog, a port-map pointer and ingress-port. (IMHO easiest solution) 2. Just create a new pointer to port-map and store it in driver rx-ring struct (like existing bpf_prog), but this create a race-challenge replacing (cmpxchg) the program (or perhaps it's not a problem as it runs under rcu and RTNL-lock). 3. Extend bpf_prog to store this port-map and ingress-port, and have a fast-way to access it. I assume it will be accessible via bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP. I'm not sure I completely follow the 3 proposals. Are you suggesting to have only one netdev_array per program? Yes, but I can see you have a more clever idea below. Why not to allow any number like we do for tailcall+prog_array, etc? We can teach verifier to allow new helper bpf_tx_port(netdev_array, port_num); to only be used with netdev_array map type. It will fetch netdevice pointer from netdev_array[port_num] and will tx the packet into it. I love it. I just don't like the "netdev" part of the name "netdev_array" as one basic ideas of a port tabel, is that a port can be anything that can consume a XDP_buff packet. This generalization allow us to move code out of the drivers. We might be on the same page, as I do imagine that netdev_array or port_array is just a struct bpf_map pointer, and the bpf_map->map_type will tell us that this bpf_map contains net_device pointers. Thus, when later introducing a new type of redirect (like to a socket or remote-CPU) then we just add a new bpf_map_type for this, without needing to change anything in the drivers, right? In theory, yes, but in practice I doubt it will be so easy. We probably shouldn't allow very different types of netdev into the same netdev_array or port_array (whatever the name). They need to be similar enough, otherwise we'd have to do run-time checks. If they're all the same, these checks can be done at insertion time instead. Do you imagine that bpf-side bpf_tx_port() returns XDP_REDIRECT? Or does it return if the call was successful (e.g validate port_num existed in map)? don't know :) we need to brainstorm pros and cons. On the kernel side, we need to receive this info "port_array" and "port_num", given you don't provide the call a xdp_buff/ctx, then I assume you want the per-CPU temp-store solution. Then during the XDP_REDIRECT action we call a core redirect function that based on the bpf_map_type does a lookup, and find the net_device ptr. hmm. didn't think that far either :) indeed makes sense to pass 'ctx' into such helper as well, so it's easier to deal with original netdev. We can make it similar to bpf_tail_call(), so that program will finish on successful bpf_tx_port() or make it into 'delayed' tx which will be executed when program finishes. Not sure which approach is better. I know you are talking about something slightly different, about delaying TX. But I want to mention (as I've done before) that it is important (for me) that we get bulking working/integrated. I imagine the driver will call a function that will delay the TX/redirect action and at the end of the NAPI cycle have a function that flush packets, bulk per destination port. I was wondering where to store these delayed TX packets, but now that we have an associated bpf_map data-structure (netdev_array), I'm thinking about storing packets (ordered by port) inside that. And then have a bpf_tx_flush(netdev_array) call in the driver (for every port-table-map seen, which will likely be small). makes sense to me as well. Ideally we should try to make an api such, that batching or no-batching can be kernel choice. The program will just do xdp_tx_port(...something here...) and the kernel does the best for performance. If it needs to delay the result to do batching the api should allow that transparently. Like right now xdp program does 'return XDP_TX;' and on the kernel side we can either do immediate xmit (like we do today) or can change it to do batching and programs don't need to change. We can also extend this netdev_array into broadcast/multicast. Like bpf_tx_allports(_array); call from the program will xmit the packet to all netdevices in that 'netdev_array' map type. When broadcasting you often don't want to broadcast the packet out of the incoming interface. How can you support this?
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On 28.04.2017 07:30, Alexei Starovoitov wrote: > On 4/27/17 10:06 PM, John Fastabend wrote: >> That is more or less what I was thinking as well. The other question >> I have though is should we have a bpf_redirect() call for the simple >> case where I use the ifindex directly. This will be helpful for taking >> existing programs from tc_cls into xdp. I think it makes sense to have >> both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect(). > > I think so too. > Once netdevice is stored into netdev_array map the netdevice is pinned > and we need to figure out what to do if somebody tries to delete it. > Should we add a new netlink notifier that this netdev's refcnt is > almost zero and it's only in netdev_array(s) ? We basically do that automatically in netdev_wait_allrefs: pr_emerg("unregister_netdevice: waiting for %s to become free. Usage count = %d\n", dev->name, refcnt); It is a very unpleasant warning and users probably think about a bug in the kernel at first. I don't think we should wait for user space to clean that up but have to do it automatically from the kernel. Maybe we can introduce a special value that basically NOPs the transmission. The hash table itself would install a netdevice notifier and would clean all tables. Could definitely cause some storm in the kernel, if a lot of keys are mapped to the same interface. > or should it be deleted from the array(s) automatically and > then user space will be notified post-deletion? > Both approaches have their pros and cons. I am leaning more towards deleting it automatically. But walking all tables and in there all keys might cause some unwanted load spikes. > Whereas raw ifindex approach (via bpf_redirect) doesn't have these > caveats. It's clear to both bpf prog and user space that ifindex > can be stale and user space needs to monitor netdevs and update > programs/maps. A separate type for ifindex as key or value might be nice to expose this information directly via the kernel (fdinfo etc.) but at the same time, debugging infrastructure from user space can also easily deal with that. Another approach would be: ifindexes are allocated cyclic and also are signed int and not u32 during allocation. Maybe we can negate the ifindex during transmission in the table and thus mark it as stale (or set it to -1)? This update would be done by bpf_tx_*ports() that take a reference to a table and a key and submit the packets on the appropriate ports and can flag the relevant ifindexes as stale. Just wanted to draft this idea, I am not particular happy with that idea. Maybe someone comes up with a better one. Thanks, Hannes
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On Thu, 27 Apr 2017 16:31:14 -0700 Alexei Starovoitovwrote: > On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote: > > When registering/attaching a XDP/bpf program, we would just send the > > file-descriptor for this port-map along (like we do with the bpf_prog > > FD). Plus, it own ingress-port number this program is in the port-map. > > > > It is not clear to me, in-which-data-structure on the kernel-side we > > store this reference to the port-map and ingress-port. As today we only > > have the "raw" struct bpf_prog pointer. I see several options: > > > > 1. Create a new xdp_prog struct that contains existing bpf_prog, > > a port-map pointer and ingress-port. (IMHO easiest solution) > > > > 2. Just create a new pointer to port-map and store it in driver rx-ring > > struct (like existing bpf_prog), but this create a race-challenge > > replacing (cmpxchg) the program (or perhaps it's not a problem as it > > runs under rcu and RTNL-lock). > > > > 3. Extend bpf_prog to store this port-map and ingress-port, and have a > > fast-way to access it. I assume it will be accessible via > > bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP. > > I'm not sure I completely follow the 3 proposals. > Are you suggesting to have only one netdev_array per program? Yes, but I can see you have a more clever idea below. > Why not to allow any number like we do for tailcall+prog_array, etc? > We can teach verifier to allow new helper > bpf_tx_port(netdev_array, port_num); > to only be used with netdev_array map type. > It will fetch netdevice pointer from netdev_array[port_num] > and will tx the packet into it. I love it. I just don't like the "netdev" part of the name "netdev_array" as one basic ideas of a port tabel, is that a port can be anything that can consume a XDP_buff packet. This generalization allow us to move code out of the drivers. We might be on the same page, as I do imagine that netdev_array or port_array is just a struct bpf_map pointer, and the bpf_map->map_type will tell us that this bpf_map contains net_device pointers. Thus, when later introducing a new type of redirect (like to a socket or remote-CPU) then we just add a new bpf_map_type for this, without needing to change anything in the drivers, right? Do you imagine that bpf-side bpf_tx_port() returns XDP_REDIRECT? Or does it return if the call was successful (e.g validate port_num existed in map)? On the kernel side, we need to receive this info "port_array" and "port_num", given you don't provide the call a xdp_buff/ctx, then I assume you want the per-CPU temp-store solution. Then during the XDP_REDIRECT action we call a core redirect function that based on the bpf_map_type does a lookup, and find the net_device ptr. > We can make it similar to bpf_tail_call(), so that program will > finish on successful bpf_tx_port() or > make it into 'delayed' tx which will be executed when program finishes. > Not sure which approach is better. I know you are talking about something slightly different, about delaying TX. But I want to mention (as I've done before) that it is important (for me) that we get bulking working/integrated. I imagine the driver will call a function that will delay the TX/redirect action and at the end of the NAPI cycle have a function that flush packets, bulk per destination port. I was wondering where to store these delayed TX packets, but now that we have an associated bpf_map data-structure (netdev_array), I'm thinking about storing packets (ordered by port) inside that. And then have a bpf_tx_flush(netdev_array) call in the driver (for every port-table-map seen, which will likely be small). > We can also extend this netdev_array into broadcast/multicast. Like > bpf_tx_allports(_array); > call from the program will xmit the packet to all netdevices > in that 'netdev_array' map type. When broadcasting you often don't want to broadcast the packet out of the incoming interface. How can you support this? Normally you would know your ingress port, and then excluded that port in the broadcast. But with many netdev_array's how do the program know it's own ingress port. > The map-in-map support can be trivially extended to allow netdev_array, > then the program can create N multicast groups of netdevices. > Each multicast group == one netdev_array map. > The user space will populate a hashmap with these netdev_arrays and > bpf kernel side can select dynamically which multicast group to use > to send the packets to. > bpf kernel side may look like: > struct bpf_netdev_array *netdev_array = bpf_map_lookup_elem(, key); > if (!netdev_array) >... > if (my_condition) > bpf_tx_allports(netdev_array); /* broadcast to all netdevices */ > else > bpf_tx_port(netdev_array, port_num); /* tx into one netdevice */ > > that's an artificial example. Just trying to point out > that we shouldn't restrict the feature too soon. I like how you solve the multicast problem. (But I do
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On 4/27/17 10:06 PM, John Fastabend wrote: That is more or less what I was thinking as well. The other question I have though is should we have a bpf_redirect() call for the simple case where I use the ifindex directly. This will be helpful for taking existing programs from tc_cls into xdp. I think it makes sense to have both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect(). I think so too. Once netdevice is stored into netdev_array map the netdevice is pinned and we need to figure out what to do if somebody tries to delete it. Should we add a new netlink notifier that this netdev's refcnt is almost zero and it's only in netdev_array(s) ? or should it be deleted from the array(s) automatically and then user space will be notified post-deletion? Both approaches have their pros and cons. Whereas raw ifindex approach (via bpf_redirect) doesn't have these caveats. It's clear to both bpf prog and user space that ifindex can be stale and user space needs to monitor netdevs and update programs/maps.
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On 17-04-27 04:31 PM, Alexei Starovoitov wrote: > On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote: >> When registering/attaching a XDP/bpf program, we would just send the >> file-descriptor for this port-map along (like we do with the bpf_prog >> FD). Plus, it own ingress-port number this program is in the port-map. >> >> It is not clear to me, in-which-data-structure on the kernel-side we >> store this reference to the port-map and ingress-port. As today we only >> have the "raw" struct bpf_prog pointer. I see several options: >> >> 1. Create a new xdp_prog struct that contains existing bpf_prog, >> a port-map pointer and ingress-port. (IMHO easiest solution) >> >> 2. Just create a new pointer to port-map and store it in driver rx-ring >> struct (like existing bpf_prog), but this create a race-challenge >> replacing (cmpxchg) the program (or perhaps it's not a problem as it >> runs under rcu and RTNL-lock). >> >> 3. Extend bpf_prog to store this port-map and ingress-port, and have a >> fast-way to access it. I assume it will be accessible via >> bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP. > > I'm not sure I completely follow the 3 proposals. > Are you suggesting to have only one netdev_array per program? > Why not to allow any number like we do for tailcall+prog_array, etc? > We can teach verifier to allow new helper > bpf_tx_port(netdev_array, port_num); > to only be used with netdev_array map type. > It will fetch netdevice pointer from netdev_array[port_num] > and will tx the packet into it. > We can make it similar to bpf_tail_call(), so that program will > finish on successful bpf_tx_port() or > make it into 'delayed' tx which will be executed when program finishes. > Not sure which approach is better. My reaction would be to make it finish on success but would like to write a few programs first and see. I can't think of any use _not_ to terminate but maybe there is something I'm missing. > > We can also extend this netdev_array into broadcast/multicast. Like > bpf_tx_allports(_array); > call from the program will xmit the packet to all netdevices > in that 'netdev_array' map type. Yep nice solution to the multicast problem. > > The map-in-map support can be trivially extended to allow netdev_array, > then the program can create N multicast groups of netdevices. > Each multicast group == one netdev_array map. > The user space will populate a hashmap with these netdev_arrays and > bpf kernel side can select dynamically which multicast group to use > to send the packets to. > bpf kernel side may look like: > struct bpf_netdev_array *netdev_array = bpf_map_lookup_elem(, key); > if (!netdev_array) > ... > if (my_condition) >bpf_tx_allports(netdev_array); /* broadcast to all netdevices */ > else >bpf_tx_port(netdev_array, port_num); /* tx into one netdevice */ > > that's an artificial example. Just trying to point out > that we shouldn't restrict the feature too soon. > That is more or less what I was thinking as well. The other question I have though is should we have a bpf_redirect() call for the simple case where I use the ifindex directly. This will be helpful for taking existing programs from tc_cls into xdp. I think it makes sense to have both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect(). .John
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote: When registering/attaching a XDP/bpf program, we would just send the file-descriptor for this port-map along (like we do with the bpf_prog FD). Plus, it own ingress-port number this program is in the port-map. It is not clear to me, in-which-data-structure on the kernel-side we store this reference to the port-map and ingress-port. As today we only have the "raw" struct bpf_prog pointer. I see several options: 1. Create a new xdp_prog struct that contains existing bpf_prog, a port-map pointer and ingress-port. (IMHO easiest solution) 2. Just create a new pointer to port-map and store it in driver rx-ring struct (like existing bpf_prog), but this create a race-challenge replacing (cmpxchg) the program (or perhaps it's not a problem as it runs under rcu and RTNL-lock). 3. Extend bpf_prog to store this port-map and ingress-port, and have a fast-way to access it. I assume it will be accessible via bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP. I'm not sure I completely follow the 3 proposals. Are you suggesting to have only one netdev_array per program? Why not to allow any number like we do for tailcall+prog_array, etc? We can teach verifier to allow new helper bpf_tx_port(netdev_array, port_num); to only be used with netdev_array map type. It will fetch netdevice pointer from netdev_array[port_num] and will tx the packet into it. We can make it similar to bpf_tail_call(), so that program will finish on successful bpf_tx_port() or make it into 'delayed' tx which will be executed when program finishes. Not sure which approach is better. We can also extend this netdev_array into broadcast/multicast. Like bpf_tx_allports(_array); call from the program will xmit the packet to all netdevices in that 'netdev_array' map type. The map-in-map support can be trivially extended to allow netdev_array, then the program can create N multicast groups of netdevices. Each multicast group == one netdev_array map. The user space will populate a hashmap with these netdev_arrays and bpf kernel side can select dynamically which multicast group to use to send the packets to. bpf kernel side may look like: struct bpf_netdev_array *netdev_array = bpf_map_lookup_elem(, key); if (!netdev_array) ... if (my_condition) bpf_tx_allports(netdev_array); /* broadcast to all netdevices */ else bpf_tx_port(netdev_array, port_num); /* tx into one netdevice */ that's an artificial example. Just trying to point out that we shouldn't restrict the feature too soon.
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On Wed, 26 Apr 2017 16:55:44 -0400 Andy Gospodarekwrote: > On Wed, Apr 26, 2017 at 10:58:45AM -0700, Alexei Starovoitov wrote: > > On 4/26/17 9:35 AM, John Fastabend wrote: > > > > > > > As Alexei also mentioned before, ifindex vs port makes no real > > > > difference seen from the bpf program side. It is userspace's > > > > responsibility to add ifindex/port's to the bpf-maps, according to how > > > > the bpf program "policy" want to "connect" these ports. The > > > > port-table system add one extra step, of also adding this port to the > > > > port-table (which lives inside the kernel). > > > > > > > > > > I'm not sure I understand the "lives inside the kernel" bit. I assumed > > > the 'map' should be a bpf map and behave like any other bpf map. > > > > > > I wanted a new map to be defined, something like this from the bpf > > > programmer > > > side. > > > > > > struct bpf_map_def SEC("maps") port_table = > > > .type = BPF_MAP_TYPE_PORT_CONNECTION, > > > .key_size = sizeof(u32), > > > .value_size = BPF_PORT_CONNECTION_SIZE, > > > .max_entries = 256, > > > }; > > > > I like the idea. > > We have prog_array, perf_event_array, cgroup_array map specializations. > > This one can be new netdev_array with some new bpf_redirect-like helper > > accessing it. > > > > > > When loading the XDP program, we also need to pass along a port table > > > > "id" this XDP program is associated with (and if it doesn't exists you > > > > create it). And your userspace "control-plane" application also need > > > > to know this port table "id", when adding a new port. > > > > > > So the user space application that is loading the program also needs > > > to handle this map. This seems correct to me. But I don't see the > > > value in making some new port table when we already have well understood > > > framework for maps. > > > > +1 > > > > > > > > > > The concept of having multiple port tables is key. As this implies we > > > > can have several simultaneous "data-planes" that is *isolated* from > > > > each-other. Think about how network-namespaces/containers want > > > > isolation. A subtle thing I'm afraid to mention, is that oppose to the > > > > ifindex model, a port table with mapping to a net_device pointer, would > > > > allow (faster) delivery into the container's inner net_device, which > > > > sort of violates the isolation, but I would argue it is not a problem > > > > as this net_device pointer could only be added from a process within the > > > > namespace. I like this feature, but it could easily be disallowed via > > > > port insertion-time validation. > > > > > > > > > > I think the above optimization should be allowed. And agree multiple port > > > tables (maps?) is needed. Again all this points to using standard maps > > > logic in my mind. For permissions and different domains, which I think > > > you were starting to touch on, it looks like we could extend the pinning > > > API. > > > At the moment it does an inode_permission(inode, MAY_WRITE) check but I > > > presume this could be extended. None of this would be needed in v1 and > > > could be added subsequently. read-only maps seems doable. > > > > this is great idea. Once BPF_MAP_TYPE_NETDEV_ARRAY is populated > > the user space can make it readonly to prevent further changes. > > > > From user space it can be done similar to perf_events/cgroups as well. > > bpf_map_update_elem(_array, _num, ) > > should work. > > For bpf_map_lookup_elem() from such netdev_array we can return > > ifindex back. > > The bpf_map_show_fdinfo() can be customized as well to pretty print > > ifindexes of netdevs stored in there. > > > > I agree with both of you on all of these points. Having the port > redirection in a new type of map and/or array seems like the way to go. > > I understood Jesper's perspecitive when thinking about a way to pass a > port-table id down, but I think the idea that the userspace loader code > defining the maps is going to be the one making this link is the right > idea and handling things like ifindex changes (rather than identifiers > that perform lookups in other tables) is going to have to be yet another > exercise left up to the...user. :-) > I love this idea. Integrating the port table closer with the bpf-maps infrastructure makes sense. This gives me a place to hook the code into, instead of (re)inventing a new infrastructure for this port table, and the interface will be more natural from a bpf-API point of view. When registering/attaching a XDP/bpf program, we would just send the file-descriptor for this port-map along (like we do with the bpf_prog FD). Plus, it own ingress-port number this program is in the port-map. It is not clear to me, in-which-data-structure on the kernel-side we store this reference to the port-map and ingress-port. As today we only have the "raw" struct bpf_prog pointer. I see several options: 1. Create a new xdp_prog struct that
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On Wed, Apr 26, 2017 at 10:58:45AM -0700, Alexei Starovoitov wrote: > On 4/26/17 9:35 AM, John Fastabend wrote: > > > > > As Alexei also mentioned before, ifindex vs port makes no real > > > difference seen from the bpf program side. It is userspace's > > > responsibility to add ifindex/port's to the bpf-maps, according to how > > > the bpf program "policy" want to "connect" these ports. The > > > port-table system add one extra step, of also adding this port to the > > > port-table (which lives inside the kernel). > > > > > > > I'm not sure I understand the "lives inside the kernel" bit. I assumed > > the 'map' should be a bpf map and behave like any other bpf map. > > > > I wanted a new map to be defined, something like this from the bpf > > programmer > > side. > > > > struct bpf_map_def SEC("maps") port_table = > > .type = BPF_MAP_TYPE_PORT_CONNECTION, > > .key_size = sizeof(u32), > > .value_size = BPF_PORT_CONNECTION_SIZE, > > .max_entries = 256, > > }; > > I like the idea. > We have prog_array, perf_event_array, cgroup_array map specializations. > This one can be new netdev_array with some new bpf_redirect-like helper > accessing it. > > > > When loading the XDP program, we also need to pass along a port table > > > "id" this XDP program is associated with (and if it doesn't exists you > > > create it). And your userspace "control-plane" application also need > > > to know this port table "id", when adding a new port. > > > > So the user space application that is loading the program also needs > > to handle this map. This seems correct to me. But I don't see the > > value in making some new port table when we already have well understood > > framework for maps. > > +1 > > > > > > > The concept of having multiple port tables is key. As this implies we > > > can have several simultaneous "data-planes" that is *isolated* from > > > each-other. Think about how network-namespaces/containers want > > > isolation. A subtle thing I'm afraid to mention, is that oppose to the > > > ifindex model, a port table with mapping to a net_device pointer, would > > > allow (faster) delivery into the container's inner net_device, which > > > sort of violates the isolation, but I would argue it is not a problem > > > as this net_device pointer could only be added from a process within the > > > namespace. I like this feature, but it could easily be disallowed via > > > port insertion-time validation. > > > > > > > I think the above optimization should be allowed. And agree multiple port > > tables (maps?) is needed. Again all this points to using standard maps > > logic in my mind. For permissions and different domains, which I think > > you were starting to touch on, it looks like we could extend the pinning > > API. > > At the moment it does an inode_permission(inode, MAY_WRITE) check but I > > presume this could be extended. None of this would be needed in v1 and > > could be added subsequently. read-only maps seems doable. > > this is great idea. Once BPF_MAP_TYPE_NETDEV_ARRAY is populated > the user space can make it readonly to prevent further changes. > > From user space it can be done similar to perf_events/cgroups as well. > bpf_map_update_elem(_array, _num, ) > should work. > For bpf_map_lookup_elem() from such netdev_array we can return > ifindex back. > The bpf_map_show_fdinfo() can be customized as well to pretty print > ifindexes of netdevs stored in there. > I agree with both of you on all of these points. Having the port redirection in a new type of map and/or array seems like the way to go. I understood Jesper's perspecitive when thinking about a way to pass a port-table id down, but I think the idea that the userspace loader code defining the maps is going to be the one making this link is the right idea and handling things like ifindex changes (rather than identifiers that perform lookups in other tables) is going to have to be yet another exercise left up to the...user. :-)
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On 4/26/17 9:35 AM, John Fastabend wrote: As Alexei also mentioned before, ifindex vs port makes no real difference seen from the bpf program side. It is userspace's responsibility to add ifindex/port's to the bpf-maps, according to how the bpf program "policy" want to "connect" these ports. The port-table system add one extra step, of also adding this port to the port-table (which lives inside the kernel). I'm not sure I understand the "lives inside the kernel" bit. I assumed the 'map' should be a bpf map and behave like any other bpf map. I wanted a new map to be defined, something like this from the bpf programmer side. struct bpf_map_def SEC("maps") port_table = .type = BPF_MAP_TYPE_PORT_CONNECTION, .key_size = sizeof(u32), .value_size = BPF_PORT_CONNECTION_SIZE, .max_entries = 256, }; I like the idea. We have prog_array, perf_event_array, cgroup_array map specializations. This one can be new netdev_array with some new bpf_redirect-like helper accessing it. When loading the XDP program, we also need to pass along a port table "id" this XDP program is associated with (and if it doesn't exists you create it). And your userspace "control-plane" application also need to know this port table "id", when adding a new port. So the user space application that is loading the program also needs to handle this map. This seems correct to me. But I don't see the value in making some new port table when we already have well understood framework for maps. +1 The concept of having multiple port tables is key. As this implies we can have several simultaneous "data-planes" that is *isolated* from each-other. Think about how network-namespaces/containers want isolation. A subtle thing I'm afraid to mention, is that oppose to the ifindex model, a port table with mapping to a net_device pointer, would allow (faster) delivery into the container's inner net_device, which sort of violates the isolation, but I would argue it is not a problem as this net_device pointer could only be added from a process within the namespace. I like this feature, but it could easily be disallowed via port insertion-time validation. I think the above optimization should be allowed. And agree multiple port tables (maps?) is needed. Again all this points to using standard maps logic in my mind. For permissions and different domains, which I think you were starting to touch on, it looks like we could extend the pinning API. At the moment it does an inode_permission(inode, MAY_WRITE) check but I presume this could be extended. None of this would be needed in v1 and could be added subsequently. read-only maps seems doable. this is great idea. Once BPF_MAP_TYPE_NETDEV_ARRAY is populated the user space can make it readonly to prevent further changes. From user space it can be done similar to perf_events/cgroups as well. bpf_map_update_elem(_array, _num, ) should work. For bpf_map_lookup_elem() from such netdev_array we can return ifindex back. The bpf_map_show_fdinfo() can be customized as well to pretty print ifindexes of netdevs stored in there.
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
[...] >> Jesper, I was working up the code for the redirect piece for ixgbe and >> virtio, please use this as a base for your virtual port number table. I'll >> push an update onto github tomorrow. I think the table should drop in fairly >> nicely. > > Cool, I will do that. Then, I'll also have a redirect method to shape > this around, and I would have to benchmark/test your ixgbe redirect. > > (John please let me know, what github tree we are talking about, and > what branch) > > >> One piece that isn't clear to me is how do you plan to instantiate and >> program this table. Is it a new static bpf map that is created any >> time we see the redirect command? I think this would be preferred. > > (This is difficult to explain without us misunderstanding each-other) > Yep and I'm not sure I follow :) > As Alexei also mentioned before, ifindex vs port makes no real > difference seen from the bpf program side. It is userspace's > responsibility to add ifindex/port's to the bpf-maps, according to how > the bpf program "policy" want to "connect" these ports. The > port-table system add one extra step, of also adding this port to the > port-table (which lives inside the kernel). > I'm not sure I understand the "lives inside the kernel" bit. I assumed the 'map' should be a bpf map and behave like any other bpf map. I wanted a new map to be defined, something like this from the bpf programmer side. struct bpf_map_def SEC("maps") port_table = .type = BPF_MAP_TYPE_PORT_CONNECTION, .key_size = sizeof(u32), .value_size = BPF_PORT_CONNECTION_SIZE, .max_entries = 256, }; > When loading the XDP program, we also need to pass along a port table > "id" this XDP program is associated with (and if it doesn't exists you > create it). And your userspace "control-plane" application also need > to know this port table "id", when adding a new port. So the user space application that is loading the program also needs to handle this map. This seems correct to me. But I don't see the value in making some new port table when we already have well understood framework for maps. > > The concept of having multiple port tables is key. As this implies we > can have several simultaneous "data-planes" that is *isolated* from > each-other. Think about how network-namespaces/containers want > isolation. A subtle thing I'm afraid to mention, is that oppose to the > ifindex model, a port table with mapping to a net_device pointer, would > allow (faster) delivery into the container's inner net_device, which > sort of violates the isolation, but I would argue it is not a problem > as this net_device pointer could only be added from a process within the > namespace. I like this feature, but it could easily be disallowed via > port insertion-time validation. > I think the above optimization should be allowed. And agree multiple port tables (maps?) is needed. Again all this points to using standard maps logic in my mind. For permissions and different domains, which I think you were starting to touch on, it looks like we could extend the pinning API. At the moment it does an inode_permission(inode, MAY_WRITE) check but I presume this could be extended. None of this would be needed in v1 and could be added subsequently. read-only maps seems doable. > I'm not worried about the DROP case, I agree that is fine (as you also say). The problem is unintentionally sending a packet to a wrong ifindex. This is clearly an eBPF program error, BUT with XDP this becomes a very hard to debug program error. With TC-redirect/cls_bpf we can tcpdump the packets, with XDP there is no visibility into this happening (the NSA is going to love this "feature"). Maybe we could add yet-another tracepoint to allow debugging this. My proposal that we simply remove the possibility for such program errors, by as you say move the validation from run-time into static insertion-time, via a port table. >>> >>> I think lack of tcpdump-like debugging in xdp is a separate issue. >>> As I was saying in the other thread we have trivial 'xdpdump' >>> kern+user app that emits pcap file, but it's too specific to how we >>> use tail_calls+prog_array in our xdp setup. I'm working on the >>> program chaining that will be generic and allow us transparently >>> add multiple xdp or tc progs to the same attachment point and will >>> allow us to do 'xdpdump' at any point of this pipeline, so >>> debugging of what happened to the packet will be easier and done in >>> the same way for both tc and xdp. >>> btw in our experience working with both tc and xdp the tc+bpf was >>> actually harder to use and more bug prone. >>> >> >> Nice, the tcpdump-like debugging looks interesting. > > Yes, this xdpdump sound like a very useful tool. >
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On Tue, 25 Apr 2017 20:07:34 -0700 John Fastabendwrote: > On 17-04-25 05:26 PM, Alexei Starovoitov wrote: > > On Tue, Apr 25, 2017 at 11:34:53AM +0200, Jesper Dangaard Brouer wrote: > >>> Note the very first bpf patchset years ago contained the port table > >>> abstraction. ovs has concept of vports as well. These two very > >>> different projects needed port table to provide a layer of > >>> indirection between ifindex==netdev and virtual port number. > >>> This is still the case and I'd like to see this port table to be > >>> implemented for both cls_bpf and xdp. In that sense xdp is not > >>> special. > >> > >> Glad to hear you want to see this implemented, I will start coding on > >> this then. Good point with cls_bpf, I was planning to make this port > >> table strongly connected to XDP, guess I should also think of cls_bpf. > > > > perfect. > > I think we should try to make all additions to bpf networking world > > to be usable for both tc and xdp, since both are actively used and > > it wouldn't be great to have cool feature for one, but not the other. > > I think port table is an excellent candidate that applies to both. > > +1 > > Jesper, I was working up the code for the redirect piece for ixgbe and > virtio, please use this as a base for your virtual port number table. I'll > push an update onto github tomorrow. I think the table should drop in fairly > nicely. Cool, I will do that. Then, I'll also have a redirect method to shape this around, and I would have to benchmark/test your ixgbe redirect. (John please let me know, what github tree we are talking about, and what branch) > One piece that isn't clear to me is how do you plan to instantiate and > program this table. Is it a new static bpf map that is created any > time we see the redirect command? I think this would be preferred. (This is difficult to explain without us misunderstanding each-other) As Alexei also mentioned before, ifindex vs port makes no real difference seen from the bpf program side. It is userspace's responsibility to add ifindex/port's to the bpf-maps, according to how the bpf program "policy" want to "connect" these ports. The port-table system add one extra step, of also adding this port to the port-table (which lives inside the kernel). When loading the XDP program, we also need to pass along a port table "id" this XDP program is associated with (and if it doesn't exists you create it). And your userspace "control-plane" application also need to know this port table "id", when adding a new port. The concept of having multiple port tables is key. As this implies we can have several simultaneous "data-planes" that is *isolated* from each-other. Think about how network-namespaces/containers want isolation. A subtle thing I'm afraid to mention, is that oppose to the ifindex model, a port table with mapping to a net_device pointer, would allow (faster) delivery into the container's inner net_device, which sort of violates the isolation, but I would argue it is not a problem as this net_device pointer could only be added from a process within the namespace. I like this feature, but it could easily be disallowed via port insertion-time validation. > >> I'm not worried about the DROP case, I agree that is fine (as you > >> also say). The problem is unintentionally sending a packet to a > >> wrong ifindex. This is clearly an eBPF program error, BUT with > >> XDP this becomes a very hard to debug program error. With > >> TC-redirect/cls_bpf we can tcpdump the packets, with XDP there is > >> no visibility into this happening (the NSA is going to love this > >> "feature"). Maybe we could add yet-another tracepoint to allow > >> debugging this. My proposal that we simply remove the possibility > >> for such program errors, by as you say move the validation from > >> run-time into static insertion-time, via a port table. > > > > I think lack of tcpdump-like debugging in xdp is a separate issue. > > As I was saying in the other thread we have trivial 'xdpdump' > > kern+user app that emits pcap file, but it's too specific to how we > > use tail_calls+prog_array in our xdp setup. I'm working on the > > program chaining that will be generic and allow us transparently > > add multiple xdp or tc progs to the same attachment point and will > > allow us to do 'xdpdump' at any point of this pipeline, so > > debugging of what happened to the packet will be easier and done in > > the same way for both tc and xdp. > > btw in our experience working with both tc and xdp the tc+bpf was > > actually harder to use and more bug prone. > > > > Nice, the tcpdump-like debugging looks interesting. Yes, this xdpdump sound like a very useful tool. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On 17-04-25 05:26 PM, Alexei Starovoitov wrote: > On Tue, Apr 25, 2017 at 11:34:53AM +0200, Jesper Dangaard Brouer wrote: >>> Note the very first bpf patchset years ago contained the port table >>> abstraction. ovs has concept of vports as well. These two very >>> different projects needed port table to provide a layer of >>> indirection between ifindex==netdev and virtual port number. >>> This is still the case and I'd like to see this port table to be >>> implemented for both cls_bpf and xdp. In that sense xdp is not >>> special. >> >> Glad to hear you want to see this implemented, I will start coding on >> this then. Good point with cls_bpf, I was planning to make this port >> table strongly connected to XDP, guess I should also think of cls_bpf. > > perfect. > I think we should try to make all additions to bpf networking world > to be usable for both tc and xdp, since both are actively used and > it wouldn't be great to have cool feature for one, but not the other. > I think port table is an excellent candidate that applies to both. +1 Jesper, I was working up the code for the redirect piece for ixgbe and virtio, please use this as a base for your virtual port number table. I'll push an update onto github tomorrow. I think the table should drop in fairly nicely. One piece that isn't clear to me is how do you plan to instantiate and program this table. Is it a new static bpf map that is created any time we see the redirect command? I think this would be preferred. > >> I'm not worried about the DROP case, I agree that is fine (as you also >> say). The problem is unintentionally sending a packet to a wrong >> ifindex. This is clearly an eBPF program error, BUT with XDP this >> becomes a very hard to debug program error. With TC-redirect/cls_bpf >> we can tcpdump the packets, with XDP there is no visibility into this >> happening (the NSA is going to love this "feature"). Maybe we could add >> yet-another tracepoint to allow debugging this. My proposal that we >> simply remove the possibility for such program errors, by as you say >> move the validation from run-time into static insertion-time, via a >> port table. > > I think lack of tcpdump-like debugging in xdp is a separate issue. > As I was saying in the other thread we have trivial 'xdpdump' kern+user > app that emits pcap file, but it's too specific to how we use > tail_calls+prog_array in our xdp setup. I'm working on the program > chaining that will be generic and allow us transparently add multiple > xdp or tc progs to the same attachment point and will allow us to > do 'xdpdump' at any point of this pipeline, so debugging of what > happened to the packet will be easier and done in the same way > for both tc and xdp. > btw in our experience working with both tc and xdp the tc+bpf was > actually harder to use and more bug prone. > Nice, the tcpdump-like debugging looks interesting.
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On Tue, Apr 25, 2017 at 11:34:53AM +0200, Jesper Dangaard Brouer wrote: > > Note the very first bpf patchset years ago contained the port table > > abstraction. ovs has concept of vports as well. These two very > > different projects needed port table to provide a layer of > > indirection between ifindex==netdev and virtual port number. > > This is still the case and I'd like to see this port table to be > > implemented for both cls_bpf and xdp. In that sense xdp is not > > special. > > Glad to hear you want to see this implemented, I will start coding on > this then. Good point with cls_bpf, I was planning to make this port > table strongly connected to XDP, guess I should also think of cls_bpf. perfect. I think we should try to make all additions to bpf networking world to be usable for both tc and xdp, since both are actively used and it wouldn't be great to have cool feature for one, but not the other. I think port table is an excellent candidate that applies to both. > I'm not worried about the DROP case, I agree that is fine (as you also > say). The problem is unintentionally sending a packet to a wrong > ifindex. This is clearly an eBPF program error, BUT with XDP this > becomes a very hard to debug program error. With TC-redirect/cls_bpf > we can tcpdump the packets, with XDP there is no visibility into this > happening (the NSA is going to love this "feature"). Maybe we could add > yet-another tracepoint to allow debugging this. My proposal that we > simply remove the possibility for such program errors, by as you say > move the validation from run-time into static insertion-time, via a > port table. I think lack of tcpdump-like debugging in xdp is a separate issue. As I was saying in the other thread we have trivial 'xdpdump' kern+user app that emits pcap file, but it's too specific to how we use tail_calls+prog_array in our xdp setup. I'm working on the program chaining that will be generic and allow us transparently add multiple xdp or tc progs to the same attachment point and will allow us to do 'xdpdump' at any point of this pipeline, so debugging of what happened to the packet will be easier and done in the same way for both tc and xdp. btw in our experience working with both tc and xdp the tc+bpf was actually harder to use and more bug prone.
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On Thu, 20 Apr 2017 10:10:08 -0700 Alexei Starovoitovwrote: > On Thu, Apr 20, 2017 at 08:10:51AM +0200, Jesper Dangaard Brouer wrote: > > On Wed, 19 Apr 2017 19:56:13 -0700 > > Alexei Starovoitov wrote: > > > > > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote: > > > > > > > > Is there a concrete reason that all the proposed future cases like > > > > sockets > > > > have to be handled within the very same XDP_REDIRECT return code? F.e. > > > > why > > > > not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and > > > > future > > > > ones would get a different return code f.e. XDP_TX_SK only handling > > > > sockets > > > > when we get there implementation-wise? > > > > > > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things > > > out of this discussion. > > > XDP_REDIRECT should assume L2 raw packet is being redirected to another > > > L2 netdev. > > > If we make it too generic it will lose performance. > > > > > > For cls_bpf the ifindex concept is symmetric. The program can access it as > > > skb->ifindex on receive and can redirect to another ifindex via > > > bpf_redirect() helper. > > > Since netdev is not locked, it's actually big plus, since container > > > management > > > control plane can simply delete netns+veth and it goes away. The program > > > can > > > have dangling ifindex (if control plane is buggy and didn't update the > > > bpf side), > > > but it's harmless. Packets that redirect to non-existing ifindex are > > > dropped. > > > This approach already understood and works well, so for XDP I suggest to > > > use > > > the same approach initially before starting to reinvent the wheel. > > > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects > > > to L2 netdev only. That's it. Simple and easy. > > > I think the main use cases in John's and Jesper's minds is something like > > > xdpswitch where packets are coming from VMs and from physical eths and > > > being redirected to either physical eth or to VM via upcoming vhost+xdp > > > support. > > > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine. > > > > > > Once we have vhost+xdp and all other bits implemented, we must come back > > > to this discussion about having port mapping table. As I mentioned > > > during netconf I think it's very useful, but I don't think we should > > > gate vhost+xdp and xdp_redirect work on this discussion. > > > As far as this port mapping table we would need 'port' field in xdp_md as > > > well > > > and xdp_redirect_port() done via helper or via extra 'out_port' field in > > > xdp_md > > > plus another XDP_REDIRECT_PORT action code. > > > > Guess it would be easier to talk about if we name it "ingress_port" and > > "egress_port". > > > > > The actual port table (array) should be populated by user space with > > > netdevs > > > and these netdev will have their refcnt incremented. Then we'll have > > > discussion > > > what to do with netdev_unregister notifiers, whether they should be > > > auto-removed > > > from port table or bpf should have a chance to be notified and act on it. > > > Such port mapping will allow us to optimize inevitable call > > > dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex); > > > away, since netdevs will be stored in the port table and direct deref > > > port_map_array[xdp_md->out_port] will give us target netdev quickly. > > > > I agree with above paragraph, and is happy that you can see that this > > will actually be faster than using ifindex'es. > > > > > It's nice optimization and there are other more powerful optimizations we > > > can do with such port table (since we will know in advance which netdevs > > > the program will be redirecting too), but I still think we should do > > > ifindex based xdp_redirect first and only add this port table later. > > > > No, we cannot first do an ifindex based xdp_redirect. The point of the > > port table is to sandbox which ports XDP can use. > > hmm. port table cannot sandbox the ports. The only thing it does > from 'safety' point of view is moving the checks from run-time into > static insertion time. > So the checks that we would do on netdev after looking it up > based on ifindex are the same checks we will do at insertion time > into port table. The user space will insert/delete them live > from that port table, so from program point of view it's exactly > the same as ifindex. The ports can disappear and can be added > while the program is running. I agree, that from the eBPF programs point of view using an ifindex or a port number is the same. And I do like this model, that this is just a number seem from bpf. It provides a clean separation between the kernel and ebpf program world. > Note the very first bpf patchset years ago contained the port table > abstraction. ovs has concept of vports as well. These two very >
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On Thu, Apr 20, 2017 at 08:10:51AM +0200, Jesper Dangaard Brouer wrote: > On Wed, 19 Apr 2017 19:56:13 -0700 > Alexei Starovoitovwrote: > > > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote: > > > > > > Is there a concrete reason that all the proposed future cases like sockets > > > have to be handled within the very same XDP_REDIRECT return code? F.e. why > > > not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and > > > future > > > ones would get a different return code f.e. XDP_TX_SK only handling > > > sockets > > > when we get there implementation-wise? > > > > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things > > out of this discussion. > > XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 > > netdev. > > If we make it too generic it will lose performance. > > > > For cls_bpf the ifindex concept is symmetric. The program can access it as > > skb->ifindex on receive and can redirect to another ifindex via > > bpf_redirect() helper. > > Since netdev is not locked, it's actually big plus, since container > > management > > control plane can simply delete netns+veth and it goes away. The program can > > have dangling ifindex (if control plane is buggy and didn't update the bpf > > side), > > but it's harmless. Packets that redirect to non-existing ifindex are > > dropped. > > This approach already understood and works well, so for XDP I suggest to use > > the same approach initially before starting to reinvent the wheel. > > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects > > to L2 netdev only. That's it. Simple and easy. > > I think the main use cases in John's and Jesper's minds is something like > > xdpswitch where packets are coming from VMs and from physical eths and > > being redirected to either physical eth or to VM via upcoming vhost+xdp > > support. > > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine. > > > > Once we have vhost+xdp and all other bits implemented, we must come back > > to this discussion about having port mapping table. As I mentioned > > during netconf I think it's very useful, but I don't think we should > > gate vhost+xdp and xdp_redirect work on this discussion. > > As far as this port mapping table we would need 'port' field in xdp_md as > > well > > and xdp_redirect_port() done via helper or via extra 'out_port' field in > > xdp_md > > plus another XDP_REDIRECT_PORT action code. > > Guess it would be easier to talk about if we name it "ingress_port" and > "egress_port". > > > The actual port table (array) should be populated by user space with netdevs > > and these netdev will have their refcnt incremented. Then we'll have > > discussion > > what to do with netdev_unregister notifiers, whether they should be > > auto-removed > > from port table or bpf should have a chance to be notified and act on it. > > Such port mapping will allow us to optimize inevitable call > > dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex); > > away, since netdevs will be stored in the port table and direct deref > > port_map_array[xdp_md->out_port] will give us target netdev quickly. > > I agree with above paragraph, and is happy that you can see that this > will actually be faster than using ifindex'es. > > > It's nice optimization and there are other more powerful optimizations we > > can do with such port table (since we will know in advance which netdevs > > the program will be redirecting too), but I still think we should do > > ifindex based xdp_redirect first and only add this port table later. > > No, we cannot first do an ifindex based xdp_redirect. The point of the > port table is to sandbox which ports XDP can use. hmm. port table cannot sandbox the ports. The only thing it does from 'safety' point of view is moving the checks from run-time into static insertion time. So the checks that we would do on netdev after looking it up based on ifindex are the same checks we will do at insertion time into port table. The user space will insert/delete them live from that port table, so from program point of view it's exactly the same as ifindex. The ports can disappear and can be added while the program is running. Note the very first bpf patchset years ago contained the port table abstraction. ovs has concept of vports as well. These two very different projects needed port table to provide a layer of indirection between ifindex==netdev and virtual port number. This is still the case and I'd like to see this port table to be implemented for both cls_bpf and xdp. In that sense xdp is not special. > XDP is different than TC/cls_bpf, as it does "bypass", there is no > other layer that can stop or inspect these packets. The TC hooks > redirect into the network stack, which have all the usual facilities > available for filtering, inspection and debugging what is going on > (e.g. tcpdump works for TC
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On Wed, 19 Apr 2017 19:56:13 -0700 Alexei Starovoitovwrote: > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote: > > > > Is there a concrete reason that all the proposed future cases like sockets > > have to be handled within the very same XDP_REDIRECT return code? F.e. why > > not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and > > future > > ones would get a different return code f.e. XDP_TX_SK only handling sockets > > when we get there implementation-wise? > > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things > out of this discussion. > XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 > netdev. > If we make it too generic it will lose performance. > > For cls_bpf the ifindex concept is symmetric. The program can access it as > skb->ifindex on receive and can redirect to another ifindex via > bpf_redirect() helper. > Since netdev is not locked, it's actually big plus, since container management > control plane can simply delete netns+veth and it goes away. The program can > have dangling ifindex (if control plane is buggy and didn't update the bpf > side), > but it's harmless. Packets that redirect to non-existing ifindex are dropped. > This approach already understood and works well, so for XDP I suggest to use > the same approach initially before starting to reinvent the wheel. > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects > to L2 netdev only. That's it. Simple and easy. > I think the main use cases in John's and Jesper's minds is something like > xdpswitch where packets are coming from VMs and from physical eths and > being redirected to either physical eth or to VM via upcoming vhost+xdp > support. > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine. > > Once we have vhost+xdp and all other bits implemented, we must come back > to this discussion about having port mapping table. As I mentioned > during netconf I think it's very useful, but I don't think we should > gate vhost+xdp and xdp_redirect work on this discussion. > As far as this port mapping table we would need 'port' field in xdp_md as well > and xdp_redirect_port() done via helper or via extra 'out_port' field in > xdp_md > plus another XDP_REDIRECT_PORT action code. Guess it would be easier to talk about if we name it "ingress_port" and "egress_port". > The actual port table (array) should be populated by user space with netdevs > and these netdev will have their refcnt incremented. Then we'll have > discussion > what to do with netdev_unregister notifiers, whether they should be > auto-removed > from port table or bpf should have a chance to be notified and act on it. > Such port mapping will allow us to optimize inevitable call > dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex); > away, since netdevs will be stored in the port table and direct deref > port_map_array[xdp_md->out_port] will give us target netdev quickly. I agree with above paragraph, and is happy that you can see that this will actually be faster than using ifindex'es. > It's nice optimization and there are other more powerful optimizations we > can do with such port table (since we will know in advance which netdevs > the program will be redirecting too), but I still think we should do > ifindex based xdp_redirect first and only add this port table later. No, we cannot first do an ifindex based xdp_redirect. The point of the port table is to sandbox which ports XDP can use. XDP is different than TC/cls_bpf, as it does "bypass", there is no other layer that can stop or inspect these packets. The TC hooks redirect into the network stack, which have all the usual facilities available for filtering, inspection and debugging what is going on (e.g. tcpdump works for TC redirect). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On 17-04-19 09:58 PM, Alexei Starovoitov wrote: > On Wed, Apr 19, 2017 at 09:38:44PM -0700, John Fastabend wrote: >> On 17-04-19 07:56 PM, Alexei Starovoitov wrote: >>> On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote: Is there a concrete reason that all the proposed future cases like sockets have to be handled within the very same XDP_REDIRECT return code? F.e. why not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future ones would get a different return code f.e. XDP_TX_SK only handling sockets when we get there implementation-wise? >>> >>> yeah. let's keep redirect to sockets, tunnels, crypto and exotic things >>> out of this discussion. >>> XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 >>> netdev. >>> If we make it too generic it will lose performance. >>> >>> For cls_bpf the ifindex concept is symmetric. The program can access it as >>> skb->ifindex on receive and can redirect to another ifindex via >>> bpf_redirect() helper. >>> Since netdev is not locked, it's actually big plus, since container >>> management >>> control plane can simply delete netns+veth and it goes away. The program can >>> have dangling ifindex (if control plane is buggy and didn't update the bpf >>> side), >>> but it's harmless. Packets that redirect to non-existing ifindex are >>> dropped. >>> This approach already understood and works well, so for XDP I suggest to use >>> the same approach initially before starting to reinvent the wheel. >>> struct xdp_md needs ifindex field and xdp_redirect() helper that redirects >>> to L2 netdev only. That's it. Simple and easy. >>> I think the main use cases in John's and Jesper's minds is something like >>> xdpswitch where packets are coming from VMs and from physical eths and >>> being redirected to either physical eth or to VM via upcoming vhost+xdp >>> support. >>> This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine. >> >> hmm I must be missing something, bpf_redirect() helper should be used as a >> return statement, e.g. >> >> return bpf_redirect(ifindex, flags); >> >> Its a bit awkward to use in any other way. You would have to ensure the >> program returns TC_ACT_REDIRECT in all cases I presume. It seems incredibly >> fragile and gains nothing as far as I can tell. bpf_redirect uses per_cpu >> data and expects the core to call skb_do_redirect() to push the packet into >> the correct netdev. bpf_redirect() does not modify the skb->ifindex value, >> looking at the code now. >> >> Are you suggesting using xdp_md to store the ifindex value instead of a per >> cpu variable in the redirect helper? > > no. i'm suggesting to store in per-cpu scratch area just like cls_bpf does. > >> Do you really mean the xdp_md struct in >> the uapi headers? > > yes. since 'ifindex' needs to be part of xdp_md struct in read-only way. > Just like in cls_bpf does it. > Otherwise if we attach the same program to multiple taps it won't > know which tap the traffic arriving on and won't be able to redirect properly. > >> I don't see why it needs to be in the UAPI at all. If we >> don't like per cpu variables it could be pushed as part of xdp_buff I guess. > > It's not about like or dont-like per-cpu scratch area. > My main point: it works just fine for cls_bpf and i'm suggesting to do > the same for xdp_redirect, since no one ever complained about that bit of > cls_bpf. > >> My suggestion is we could add an ifindex to the xdp_md structure and have the >> receiving NIC driver populate the value assuming it is useful to programs. >> But, >> if we use cls_bpf as a model then xdp_md ifindex is more or less independent >> of >> the redirect helper. > > exactly. arriving ifindex is independent of xdp_redirect helper. > >> In my opinion we should avoid diverging cls bpf and xdp bpf >> in subtle ways like handling of ifindex and redirect. > > exactly. I'm saying the same thing. > I'm not sure which part of my proposal was so confusing. > Sorry about that. > Aha what you are suggesting is exactly what I prototyped on virtio_net so I'm happy :) I just didn't manage to parse it for whatever reason must be getting tired. Thanks, John
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On Wed, Apr 19, 2017 at 09:38:44PM -0700, John Fastabend wrote: > On 17-04-19 07:56 PM, Alexei Starovoitov wrote: > > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote: > >> > >> Is there a concrete reason that all the proposed future cases like sockets > >> have to be handled within the very same XDP_REDIRECT return code? F.e. why > >> not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and > >> future > >> ones would get a different return code f.e. XDP_TX_SK only handling sockets > >> when we get there implementation-wise? > > > > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things > > out of this discussion. > > XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 > > netdev. > > If we make it too generic it will lose performance. > > > > For cls_bpf the ifindex concept is symmetric. The program can access it as > > skb->ifindex on receive and can redirect to another ifindex via > > bpf_redirect() helper. > > Since netdev is not locked, it's actually big plus, since container > > management > > control plane can simply delete netns+veth and it goes away. The program can > > have dangling ifindex (if control plane is buggy and didn't update the bpf > > side), > > but it's harmless. Packets that redirect to non-existing ifindex are > > dropped. > > This approach already understood and works well, so for XDP I suggest to use > > the same approach initially before starting to reinvent the wheel. > > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects > > to L2 netdev only. That's it. Simple and easy. > > I think the main use cases in John's and Jesper's minds is something like > > xdpswitch where packets are coming from VMs and from physical eths and > > being redirected to either physical eth or to VM via upcoming vhost+xdp > > support. > > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine. > > hmm I must be missing something, bpf_redirect() helper should be used as a > return statement, e.g. > > return bpf_redirect(ifindex, flags); > > Its a bit awkward to use in any other way. You would have to ensure the > program returns TC_ACT_REDIRECT in all cases I presume. It seems incredibly > fragile and gains nothing as far as I can tell. bpf_redirect uses per_cpu > data and expects the core to call skb_do_redirect() to push the packet into > the correct netdev. bpf_redirect() does not modify the skb->ifindex value, > looking at the code now. > > Are you suggesting using xdp_md to store the ifindex value instead of a per > cpu variable in the redirect helper? no. i'm suggesting to store in per-cpu scratch area just like cls_bpf does. > Do you really mean the xdp_md struct in > the uapi headers? yes. since 'ifindex' needs to be part of xdp_md struct in read-only way. Just like in cls_bpf does it. Otherwise if we attach the same program to multiple taps it won't know which tap the traffic arriving on and won't be able to redirect properly. > I don't see why it needs to be in the UAPI at all. If we > don't like per cpu variables it could be pushed as part of xdp_buff I guess. It's not about like or dont-like per-cpu scratch area. My main point: it works just fine for cls_bpf and i'm suggesting to do the same for xdp_redirect, since no one ever complained about that bit of cls_bpf. > My suggestion is we could add an ifindex to the xdp_md structure and have the > receiving NIC driver populate the value assuming it is useful to programs. > But, > if we use cls_bpf as a model then xdp_md ifindex is more or less independent > of > the redirect helper. exactly. arriving ifindex is independent of xdp_redirect helper. > In my opinion we should avoid diverging cls bpf and xdp bpf > in subtle ways like handling of ifindex and redirect. exactly. I'm saying the same thing. I'm not sure which part of my proposal was so confusing. Sorry about that.
Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On 17-04-19 07:56 PM, Alexei Starovoitov wrote: > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote: >> >> Is there a concrete reason that all the proposed future cases like sockets >> have to be handled within the very same XDP_REDIRECT return code? F.e. why >> not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future >> ones would get a different return code f.e. XDP_TX_SK only handling sockets >> when we get there implementation-wise? > > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things > out of this discussion. > XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 > netdev. > If we make it too generic it will lose performance. > > For cls_bpf the ifindex concept is symmetric. The program can access it as > skb->ifindex on receive and can redirect to another ifindex via > bpf_redirect() helper. > Since netdev is not locked, it's actually big plus, since container management > control plane can simply delete netns+veth and it goes away. The program can > have dangling ifindex (if control plane is buggy and didn't update the bpf > side), > but it's harmless. Packets that redirect to non-existing ifindex are dropped. > This approach already understood and works well, so for XDP I suggest to use > the same approach initially before starting to reinvent the wheel. > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects > to L2 netdev only. That's it. Simple and easy. > I think the main use cases in John's and Jesper's minds is something like > xdpswitch where packets are coming from VMs and from physical eths and > being redirected to either physical eth or to VM via upcoming vhost+xdp > support. > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine. hmm I must be missing something, bpf_redirect() helper should be used as a return statement, e.g. return bpf_redirect(ifindex, flags); Its a bit awkward to use in any other way. You would have to ensure the program returns TC_ACT_REDIRECT in all cases I presume. It seems incredibly fragile and gains nothing as far as I can tell. bpf_redirect uses per_cpu data and expects the core to call skb_do_redirect() to push the packet into the correct netdev. bpf_redirect() does not modify the skb->ifindex value, looking at the code now. Are you suggesting using xdp_md to store the ifindex value instead of a per cpu variable in the redirect helper? Do you really mean the xdp_md struct in the uapi headers? I don't see why it needs to be in the UAPI at all. If we don't like per cpu variables it could be pushed as part of xdp_buff I guess. My suggestion is we could add an ifindex to the xdp_md structure and have the receiving NIC driver populate the value assuming it is useful to programs. But, if we use cls_bpf as a model then xdp_md ifindex is more or less independent of the redirect helper. In my opinion we should avoid diverging cls bpf and xdp bpf in subtle ways like handling of ifindex and redirect. > > Once we have vhost+xdp and all other bits implemented, we must come back > to this discussion about having port mapping table. As I mentioned > during netconf I think it's very useful, but I don't think we should > gate vhost+xdp and xdp_redirect work on this discussion. > As far as this port mapping table we would need 'port' field in xdp_md as well > and xdp_redirect_port() done via helper or via extra 'out_port' field in > xdp_md > plus another XDP_REDIRECT_PORT action code. > The actual port table (array) should be populated by user space with netdevs > and these netdev will have their refcnt incremented. Then we'll have > discussion > what to do with netdev_unregister notifiers, whether they should be > auto-removed > from port table or bpf should have a chance to be notified and act on it. > Such port mapping will allow us to optimize inevitable call > dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex); > away, since netdevs will be stored in the port table and direct deref > port_map_array[xdp_md->out_port] will give us target netdev quickly. > It's nice optimization and there are other more powerful optimizations we > can do with such port table (since we will know in advance which netdevs > the program will be redirecting too), but I still think we should do > ifindex based xdp_redirect first and only add this port table later. > Agreed on all this further optimization would be welcome of course after basic implementation is in place.
xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote: > > Is there a concrete reason that all the proposed future cases like sockets > have to be handled within the very same XDP_REDIRECT return code? F.e. why > not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future > ones would get a different return code f.e. XDP_TX_SK only handling sockets > when we get there implementation-wise? yeah. let's keep redirect to sockets, tunnels, crypto and exotic things out of this discussion. XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 netdev. If we make it too generic it will lose performance. For cls_bpf the ifindex concept is symmetric. The program can access it as skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() helper. Since netdev is not locked, it's actually big plus, since container management control plane can simply delete netns+veth and it goes away. The program can have dangling ifindex (if control plane is buggy and didn't update the bpf side), but it's harmless. Packets that redirect to non-existing ifindex are dropped. This approach already understood and works well, so for XDP I suggest to use the same approach initially before starting to reinvent the wheel. struct xdp_md needs ifindex field and xdp_redirect() helper that redirects to L2 netdev only. That's it. Simple and easy. I think the main use cases in John's and Jesper's minds is something like xdpswitch where packets are coming from VMs and from physical eths and being redirected to either physical eth or to VM via upcoming vhost+xdp support. This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine. Once we have vhost+xdp and all other bits implemented, we must come back to this discussion about having port mapping table. As I mentioned during netconf I think it's very useful, but I don't think we should gate vhost+xdp and xdp_redirect work on this discussion. As far as this port mapping table we would need 'port' field in xdp_md as well and xdp_redirect_port() done via helper or via extra 'out_port' field in xdp_md plus another XDP_REDIRECT_PORT action code. The actual port table (array) should be populated by user space with netdevs and these netdev will have their refcnt incremented. Then we'll have discussion what to do with netdev_unregister notifiers, whether they should be auto-removed from port table or bpf should have a chance to be notified and act on it. Such port mapping will allow us to optimize inevitable call dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex); away, since netdevs will be stored in the port table and direct deref port_map_array[xdp_md->out_port] will give us target netdev quickly. It's nice optimization and there are other more powerful optimizations we can do with such port table (since we will know in advance which netdevs the program will be redirecting too), but I still think we should do ifindex based xdp_redirect first and only add this port table later.