subject:"xdp_redirect ifindex vs port. Was\: best API for returning\/setting egress port\?"

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-30 Thread John Fastabend

On 17-04-29 06:04 PM, Alexei Starovoitov wrote:
> On 4/28/17 3:58 AM, Jesper Dangaard Brouer wrote:
>> On Thu, 27 Apr 2017 16:31:14 -0700
>> Alexei Starovoitov  wrote:
>>
>>> On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote:
 When registering/attaching a XDP/bpf program, we would just send the
 file-descriptor for this port-map along (like we do with the bpf_prog
 FD). Plus, it own ingress-port number this program is in the port-map.

 It is not clear to me, in-which-data-structure on the kernel-side we
 store this reference to the port-map and ingress-port. As today we only
 have the "raw" struct bpf_prog pointer. I see several options:

 1. Create a new xdp_prog struct that contains existing bpf_prog,
 a port-map pointer and ingress-port. (IMHO easiest solution)

 2. Just create a new pointer to port-map and store it in driver rx-ring
 struct (like existing bpf_prog), but this create a race-challenge
 replacing (cmpxchg) the program (or perhaps it's not a problem as it
 runs under rcu and RTNL-lock).

 3. Extend bpf_prog to store this port-map and ingress-port, and have a
 fast-way to access it.  I assume it will be accessible via
 bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP.
>>>
>>> I'm not sure I completely follow the 3 proposals.
>>> Are you suggesting to have only one netdev_array per program?
>>
>> Yes, but I can see you have a more clever idea below.
>>
>>> Why not to allow any number like we do for tailcall+prog_array, etc?
>>
>>> We can teach verifier to allow new helper
>>>  bpf_tx_port(netdev_array, port_num);
>>> to only be used with netdev_array map type.
>>> It will fetch netdevice pointer from netdev_array[port_num]
>>> and will tx the packet into it.
>>
>> I love it.
>>
>> I just don't like the "netdev" part of the name "netdev_array" as one
>> basic ideas of a port tabel, is that a port can be anything that can
>> consume a XDP_buff packet.  This generalization allow us to move code
>> out of the drivers.  We might be on the same page, as I do imagine that
>> netdev_array or port_array is just a struct bpf_map pointer, and the
>> bpf_map->map_type will tell us that this bpf_map contains net_device
>> pointers.  Thus, when later introducing a new type of redirect (like to
>> a socket or remote-CPU) then we just add a new bpf_map_type for this,
>> without needing to change anything in the drivers, right?
> 
> In theory, yes, but in practice I doubt it will be so easy.
> We probably shouldn't allow very different types of netdev
> into the same netdev_array or port_array (whatever the name).
> They need to be similar enough, otherwise we'd have to do run-time
> checks. If they're all the same, these checks can be done at
> insertion time instead.
> 

I think we can just have different map types for each redirect type. So
a netdev_map_array, socket_map_array, etc.

>> Do you imagine that bpf-side bpf_tx_port() returns XDP_REDIRECT?
>> Or does it return if the call was successful (e.g validate port_num
>> existed in map)?
> 
> don't know :)
> we need to brainstorm pros and cons.
> 
>> On the kernel side, we need to receive this info "port_array" and
>> "port_num", given you don't provide the call a xdp_buff/ctx, then I
>> assume you want the per-CPU temp-store solution.  Then during the
>> XDP_REDIRECT action we call a core redirect function that based on the
>> bpf_map_type does a lookup, and find the net_device ptr.
> 
> hmm. didn't think that far either :)
> indeed makes sense to pass 'ctx' into such helper as well,
> so it's easier to deal with original netdev.
> 
>>> We can make it similar to bpf_tail_call(), so that program will
>>> finish on successful bpf_tx_port() or
>>> make it into 'delayed' tx which will be executed when program finishes.
>>> Not sure which approach is better.
>>
>> I know you are talking about something slightly different, about
>> delaying TX.
>>
>> But I want to mention (as I've done before) that it is important (for
>> me) that we get bulking working/integrated.   I imagine the driver will
>> call a function that will delay the TX/redirect action and at the end
>> of the NAPI cycle have a function that flush packets, bulk per
>> destination port.
>>
>> I was wondering where to store these delayed TX packets, but now that
>> we have an associated bpf_map data-structure (netdev_array), I'm thinking
>> about storing packets (ordered by port) inside that.  And then have a
>> bpf_tx_flush(netdev_array) call in the driver (for every port-table-map
>> seen, which will likely be small).
> 
> makes sense to me as well.
> Ideally we should try to make an api such, that batching or no-batching
> can be kernel choice. The program will just do
> xdp_tx_port(...something here...)
> and the kernel does the best for performance. If it needs to delay
> the result to do batching the api should allow that transparently.
> Like right now xdp program does 'return

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-29 Thread Alexei Starovoitov


On 4/28/17 12:43 PM, Hannes Frederic Sowa wrote:

On 28.04.2017 07:30, Alexei Starovoitov wrote:

On 4/27/17 10:06 PM, John Fastabend wrote:

That is more or less what I was thinking as well. The other question
I have though is should we have a bpf_redirect() call for the simple
case where I use the ifindex directly. This will be helpful for taking
existing programs from tc_cls into xdp. I think it makes sense to have
both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect().


I think so too.
Once netdevice is stored into netdev_array map the netdevice is pinned
and we need to figure out what to do if somebody tries to delete it.
Should we add a new netlink notifier that this netdev's refcnt is
almost zero and it's only in netdev_array(s) ?


We basically do that automatically in netdev_wait_allrefs:

pr_emerg("unregister_netdevice: waiting for %s to become free. Usage
count = %d\n",
 dev->name, refcnt);

It is a very unpleasant warning and users probably think about a bug in
the kernel at first.

I don't think we should wait for user space to clean that up but have to
do it automatically from the kernel. Maybe we can introduce a special
value that basically NOPs the transmission. The hash table itself would
install a netdevice notifier and would clean all tables. Could
definitely cause some storm in the kernel, if a lot of keys are mapped
to the same interface.


or should it be deleted from the array(s) automatically and
then user space will be notified post-deletion?
Both approaches have their pros and cons.


I am leaning more towards deleting it automatically. But walking all
tables and in there all keys might cause some unwanted load spikes.


yeah, when netdev is being unregistered we have to drop the ref
to avoid the warning.

Speaking of delete notifiers for maps.
Until recently all deletions were explicit and the program
could do extra perf_event_submit() if it needed to notify
user space of deletion.
But right now we have LRU map that deletes automatically and this
netdev_array will be deleting automatically too, so we probably
need some sort of generic map notifier for all map types.
It can be as simple as user space sleeping on read(map_fd, buf);
or waiting for epoll on map_fd and kernel wakes it up
when map element is deleted and pushes 'key/value' pair
into the buf. That should be generic for all map types,
but it needs some mechanism to avoid blocking, since number
of deletions can be huge. Something similar to perf's event record
'lost N records' should do the trick.

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-29 Thread Alexei Starovoitov


On 4/28/17 3:58 AM, Jesper Dangaard Brouer wrote:

On Thu, 27 Apr 2017 16:31:14 -0700
Alexei Starovoitov  wrote:


On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote:

When registering/attaching a XDP/bpf program, we would just send the
file-descriptor for this port-map along (like we do with the bpf_prog
FD). Plus, it own ingress-port number this program is in the port-map.

It is not clear to me, in-which-data-structure on the kernel-side we
store this reference to the port-map and ingress-port. As today we only
have the "raw" struct bpf_prog pointer. I see several options:

1. Create a new xdp_prog struct that contains existing bpf_prog,
a port-map pointer and ingress-port. (IMHO easiest solution)

2. Just create a new pointer to port-map and store it in driver rx-ring
struct (like existing bpf_prog), but this create a race-challenge
replacing (cmpxchg) the program (or perhaps it's not a problem as it
runs under rcu and RTNL-lock).

3. Extend bpf_prog to store this port-map and ingress-port, and have a
fast-way to access it.  I assume it will be accessible via
bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP.


I'm not sure I completely follow the 3 proposals.
Are you suggesting to have only one netdev_array per program?


Yes, but I can see you have a more clever idea below.


Why not to allow any number like we do for tailcall+prog_array, etc?



We can teach verifier to allow new helper
 bpf_tx_port(netdev_array, port_num);
to only be used with netdev_array map type.
It will fetch netdevice pointer from netdev_array[port_num]
and will tx the packet into it.


I love it.

I just don't like the "netdev" part of the name "netdev_array" as one
basic ideas of a port tabel, is that a port can be anything that can
consume a XDP_buff packet.  This generalization allow us to move code
out of the drivers.  We might be on the same page, as I do imagine that
netdev_array or port_array is just a struct bpf_map pointer, and the
bpf_map->map_type will tell us that this bpf_map contains net_device
pointers.  Thus, when later introducing a new type of redirect (like to
a socket or remote-CPU) then we just add a new bpf_map_type for this,
without needing to change anything in the drivers, right?


In theory, yes, but in practice I doubt it will be so easy.
We probably shouldn't allow very different types of netdev
into the same netdev_array or port_array (whatever the name).
They need to be similar enough, otherwise we'd have to do run-time
checks. If they're all the same, these checks can be done at
insertion time instead.


Do you imagine that bpf-side bpf_tx_port() returns XDP_REDIRECT?
Or does it return if the call was successful (e.g validate port_num
existed in map)?


don't know :)
we need to brainstorm pros and cons.


On the kernel side, we need to receive this info "port_array" and
"port_num", given you don't provide the call a xdp_buff/ctx, then I
assume you want the per-CPU temp-store solution.  Then during the
XDP_REDIRECT action we call a core redirect function that based on the
bpf_map_type does a lookup, and find the net_device ptr.


hmm. didn't think that far either :)
indeed makes sense to pass 'ctx' into such helper as well,
so it's easier to deal with original netdev.


We can make it similar to bpf_tail_call(), so that program will
finish on successful bpf_tx_port() or
make it into 'delayed' tx which will be executed when program finishes.
Not sure which approach is better.


I know you are talking about something slightly different, about
delaying TX.

But I want to mention (as I've done before) that it is important (for
me) that we get bulking working/integrated.   I imagine the driver will
call a function that will delay the TX/redirect action and at the end
of the NAPI cycle have a function that flush packets, bulk per
destination port.

I was wondering where to store these delayed TX packets, but now that
we have an associated bpf_map data-structure (netdev_array), I'm thinking
about storing packets (ordered by port) inside that.  And then have a
bpf_tx_flush(netdev_array) call in the driver (for every port-table-map
seen, which will likely be small).


makes sense to me as well.
Ideally we should try to make an api such, that batching or no-batching
can be kernel choice. The program will just do
xdp_tx_port(...something here...)
and the kernel does the best for performance. If it needs to delay
the result to do batching the api should allow that transparently.
Like right now xdp program does 'return XDP_TX;' and
on the kernel side we can either do immediate xmit (like we do today)
or can change it to do batching and programs don't need to change.


We can also extend this netdev_array into broadcast/multicast. Like
bpf_tx_allports(_array);
call from the program will xmit the packet to all netdevices
in that 'netdev_array' map type.


When broadcasting you often don't want to broadcast the packet out of
the incoming interface.  How can you support this?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-28 Thread Hannes Frederic Sowa

On 28.04.2017 07:30, Alexei Starovoitov wrote:
> On 4/27/17 10:06 PM, John Fastabend wrote:
>> That is more or less what I was thinking as well. The other question
>> I have though is should we have a bpf_redirect() call for the simple
>> case where I use the ifindex directly. This will be helpful for taking
>> existing programs from tc_cls into xdp. I think it makes sense to have
>> both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect().
> 
> I think so too.
> Once netdevice is stored into netdev_array map the netdevice is pinned
> and we need to figure out what to do if somebody tries to delete it.
> Should we add a new netlink notifier that this netdev's refcnt is
> almost zero and it's only in netdev_array(s) ?

We basically do that automatically in netdev_wait_allrefs:

pr_emerg("unregister_netdevice: waiting for %s to become free. Usage
count = %d\n",
 dev->name, refcnt);

It is a very unpleasant warning and users probably think about a bug in
the kernel at first.

I don't think we should wait for user space to clean that up but have to
do it automatically from the kernel. Maybe we can introduce a special
value that basically NOPs the transmission. The hash table itself would
install a netdevice notifier and would clean all tables. Could
definitely cause some storm in the kernel, if a lot of keys are mapped
to the same interface.

> or should it be deleted from the array(s) automatically and
> then user space will be notified post-deletion?
> Both approaches have their pros and cons.

I am leaning more towards deleting it automatically. But walking all
tables and in there all keys might cause some unwanted load spikes.

> Whereas raw ifindex approach (via bpf_redirect) doesn't have these
> caveats. It's clear to both bpf prog and user space that ifindex
> can be stale and user space needs to monitor netdevs and update
> programs/maps.

A separate type for ifindex as key or value might be nice to expose this
information directly via the kernel (fdinfo etc.) but at the same time,
debugging infrastructure from user space can also easily deal with that.

Another approach would be:

ifindexes are allocated cyclic and also are signed int and not u32
during allocation. Maybe we can negate the ifindex during transmission
in the table and thus mark it as stale (or set it to -1)? This update
would be done by bpf_tx_*ports() that take a reference to a table and a
key and submit the packets on the appropriate ports and can flag the
relevant ifindexes as stale.

Just wanted to draft this idea, I am not particular happy with that
idea. Maybe someone comes up with a better one.

Thanks,
Hannes

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-28 Thread Jesper Dangaard Brouer

On Thu, 27 Apr 2017 16:31:14 -0700
Alexei Starovoitov  wrote:

> On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote:
> > When registering/attaching a XDP/bpf program, we would just send the
> > file-descriptor for this port-map along (like we do with the bpf_prog
> > FD). Plus, it own ingress-port number this program is in the port-map.
> >
> > It is not clear to me, in-which-data-structure on the kernel-side we
> > store this reference to the port-map and ingress-port. As today we only
> > have the "raw" struct bpf_prog pointer. I see several options:
> >
> > 1. Create a new xdp_prog struct that contains existing bpf_prog,
> > a port-map pointer and ingress-port. (IMHO easiest solution)
> >
> > 2. Just create a new pointer to port-map and store it in driver rx-ring
> > struct (like existing bpf_prog), but this create a race-challenge
> > replacing (cmpxchg) the program (or perhaps it's not a problem as it
> > runs under rcu and RTNL-lock).
> >
> > 3. Extend bpf_prog to store this port-map and ingress-port, and have a
> > fast-way to access it.  I assume it will be accessible via
> > bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP.  
> 
> I'm not sure I completely follow the 3 proposals.
> Are you suggesting to have only one netdev_array per program?

Yes, but I can see you have a more clever idea below.

> Why not to allow any number like we do for tailcall+prog_array, etc?

> We can teach verifier to allow new helper
>  bpf_tx_port(netdev_array, port_num);
> to only be used with netdev_array map type.
> It will fetch netdevice pointer from netdev_array[port_num]
> and will tx the packet into it.

I love it. 

I just don't like the "netdev" part of the name "netdev_array" as one
basic ideas of a port tabel, is that a port can be anything that can
consume a XDP_buff packet.  This generalization allow us to move code
out of the drivers.  We might be on the same page, as I do imagine that
netdev_array or port_array is just a struct bpf_map pointer, and the
bpf_map->map_type will tell us that this bpf_map contains net_device
pointers.  Thus, when later introducing a new type of redirect (like to
a socket or remote-CPU) then we just add a new bpf_map_type for this,
without needing to change anything in the drivers, right?

Do you imagine that bpf-side bpf_tx_port() returns XDP_REDIRECT?
Or does it return if the call was successful (e.g validate port_num
existed in map)?

On the kernel side, we need to receive this info "port_array" and
"port_num", given you don't provide the call a xdp_buff/ctx, then I
assume you want the per-CPU temp-store solution.  Then during the
XDP_REDIRECT action we call a core redirect function that based on the
bpf_map_type does a lookup, and find the net_device ptr.

> We can make it similar to bpf_tail_call(), so that program will
> finish on successful bpf_tx_port() or
> make it into 'delayed' tx which will be executed when program finishes.
> Not sure which approach is better.

I know you are talking about something slightly different, about
delaying TX.

But I want to mention (as I've done before) that it is important (for
me) that we get bulking working/integrated.   I imagine the driver will
call a function that will delay the TX/redirect action and at the end
of the NAPI cycle have a function that flush packets, bulk per
destination port.

I was wondering where to store these delayed TX packets, but now that
we have an associated bpf_map data-structure (netdev_array), I'm thinking
about storing packets (ordered by port) inside that.  And then have a
bpf_tx_flush(netdev_array) call in the driver (for every port-table-map
seen, which will likely be small).

> We can also extend this netdev_array into broadcast/multicast. Like
> bpf_tx_allports(_array);
> call from the program will xmit the packet to all netdevices
> in that 'netdev_array' map type.

When broadcasting you often don't want to broadcast the packet out of
the incoming interface.  How can you support this?

Normally you would know your ingress port, and then excluded that port
in the broadcast.  But with many netdev_array's how do the program know
it's own ingress port.

> The map-in-map support can be trivially extended to allow netdev_array,
> then the program can create N multicast groups of netdevices.
> Each multicast group == one netdev_array map.
> The user space will populate a hashmap with these netdev_arrays and
> bpf kernel side can select dynamically which multicast group to use
> to send the packets to.
> bpf kernel side may look like:
> struct bpf_netdev_array *netdev_array = bpf_map_lookup_elem(, key);
> if (!netdev_array)
>...
> if (my_condition)
> bpf_tx_allports(netdev_array);  /* broadcast to all netdevices */
> else
> bpf_tx_port(netdev_array, port_num); /* tx into one netdevice */
> 
> that's an artificial example. Just trying to point out
> that we shouldn't restrict the feature too soon.

I like how you solve the multicast problem.  (But I do

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-27 Thread Alexei Starovoitov


On 4/27/17 10:06 PM, John Fastabend wrote:

That is more or less what I was thinking as well. The other question
I have though is should we have a bpf_redirect() call for the simple
case where I use the ifindex directly. This will be helpful for taking
existing programs from tc_cls into xdp. I think it makes sense to have
both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect().


I think so too.
Once netdevice is stored into netdev_array map the netdevice is pinned
and we need to figure out what to do if somebody tries to delete it.
Should we add a new netlink notifier that this netdev's refcnt is
almost zero and it's only in netdev_array(s) ?
or should it be deleted from the array(s) automatically and
then user space will be notified post-deletion?
Both approaches have their pros and cons.

Whereas raw ifindex approach (via bpf_redirect) doesn't have these
caveats. It's clear to both bpf prog and user space that ifindex
can be stale and user space needs to monitor netdevs and update
programs/maps.

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-27 Thread John Fastabend

On 17-04-27 04:31 PM, Alexei Starovoitov wrote:
> On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote:
>> When registering/attaching a XDP/bpf program, we would just send the
>> file-descriptor for this port-map along (like we do with the bpf_prog
>> FD). Plus, it own ingress-port number this program is in the port-map.
>>
>> It is not clear to me, in-which-data-structure on the kernel-side we
>> store this reference to the port-map and ingress-port. As today we only
>> have the "raw" struct bpf_prog pointer. I see several options:
>>
>> 1. Create a new xdp_prog struct that contains existing bpf_prog,
>> a port-map pointer and ingress-port. (IMHO easiest solution)
>>
>> 2. Just create a new pointer to port-map and store it in driver rx-ring
>> struct (like existing bpf_prog), but this create a race-challenge
>> replacing (cmpxchg) the program (or perhaps it's not a problem as it
>> runs under rcu and RTNL-lock).
>>
>> 3. Extend bpf_prog to store this port-map and ingress-port, and have a
>> fast-way to access it.  I assume it will be accessible via
>> bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP.
> 
> I'm not sure I completely follow the 3 proposals.
> Are you suggesting to have only one netdev_array per program?
> Why not to allow any number like we do for tailcall+prog_array, etc?
> We can teach verifier to allow new helper
> bpf_tx_port(netdev_array, port_num);
> to only be used with netdev_array map type.
> It will fetch netdevice pointer from netdev_array[port_num]
> and will tx the packet into it.
> We can make it similar to bpf_tail_call(), so that program will
> finish on successful bpf_tx_port() or
> make it into 'delayed' tx which will be executed when program finishes.
> Not sure which approach is better.

My reaction would be to make it finish on success but would like to write
a few programs first and see. I can't think of any use _not_ to terminate
but maybe there is something I'm missing.

> 
> We can also extend this netdev_array into broadcast/multicast. Like
> bpf_tx_allports(_array);
> call from the program will xmit the packet to all netdevices
> in that 'netdev_array' map type.

Yep nice solution to the multicast problem.

> 
> The map-in-map support can be trivially extended to allow netdev_array,
> then the program can create N multicast groups of netdevices.
> Each multicast group == one netdev_array map.
> The user space will populate a hashmap with these netdev_arrays and
> bpf kernel side can select dynamically which multicast group to use
> to send the packets to.
> bpf kernel side may look like:
> struct bpf_netdev_array *netdev_array = bpf_map_lookup_elem(, key);
> if (!netdev_array)
>   ...
> if (my_condition)
>bpf_tx_allports(netdev_array);  /* broadcast to all netdevices */
> else
>bpf_tx_port(netdev_array, port_num); /* tx into one netdevice */
> 
> that's an artificial example. Just trying to point out
> that we shouldn't restrict the feature too soon.
> 

That is more or less what I was thinking as well. The other question
I have though is should we have a bpf_redirect() call for the simple
case where I use the ifindex directly. This will be helpful for taking
existing programs from tc_cls into xdp. I think it makes sense to have
both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect().

.John

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-27 Thread Alexei Starovoitov


On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote:

When registering/attaching a XDP/bpf program, we would just send the
file-descriptor for this port-map along (like we do with the bpf_prog
FD). Plus, it own ingress-port number this program is in the port-map.

It is not clear to me, in-which-data-structure on the kernel-side we
store this reference to the port-map and ingress-port. As today we only
have the "raw" struct bpf_prog pointer. I see several options:

1. Create a new xdp_prog struct that contains existing bpf_prog,
a port-map pointer and ingress-port. (IMHO easiest solution)

2. Just create a new pointer to port-map and store it in driver rx-ring
struct (like existing bpf_prog), but this create a race-challenge
replacing (cmpxchg) the program (or perhaps it's not a problem as it
runs under rcu and RTNL-lock).

3. Extend bpf_prog to store this port-map and ingress-port, and have a
fast-way to access it.  I assume it will be accessible via
bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP.


I'm not sure I completely follow the 3 proposals.
Are you suggesting to have only one netdev_array per program?
Why not to allow any number like we do for tailcall+prog_array, etc?
We can teach verifier to allow new helper
bpf_tx_port(netdev_array, port_num);
to only be used with netdev_array map type.
It will fetch netdevice pointer from netdev_array[port_num]
and will tx the packet into it.
We can make it similar to bpf_tail_call(), so that program will
finish on successful bpf_tx_port() or
make it into 'delayed' tx which will be executed when program finishes.
Not sure which approach is better.

We can also extend this netdev_array into broadcast/multicast. Like
bpf_tx_allports(_array);
call from the program will xmit the packet to all netdevices
in that 'netdev_array' map type.

The map-in-map support can be trivially extended to allow netdev_array,
then the program can create N multicast groups of netdevices.
Each multicast group == one netdev_array map.
The user space will populate a hashmap with these netdev_arrays and
bpf kernel side can select dynamically which multicast group to use
to send the packets to.
bpf kernel side may look like:
struct bpf_netdev_array *netdev_array = bpf_map_lookup_elem(, key);
if (!netdev_array)
  ...
if (my_condition)
   bpf_tx_allports(netdev_array);  /* broadcast to all netdevices */
else
   bpf_tx_port(netdev_array, port_num); /* tx into one netdevice */

that's an artificial example. Just trying to point out
that we shouldn't restrict the feature too soon.

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-27 Thread Jesper Dangaard Brouer

On Wed, 26 Apr 2017 16:55:44 -0400
Andy Gospodarek  wrote:

> On Wed, Apr 26, 2017 at 10:58:45AM -0700, Alexei Starovoitov wrote:
> > On 4/26/17 9:35 AM, John Fastabend wrote:  
> > >   
> > > > As Alexei also mentioned before, ifindex vs port makes no real
> > > > difference seen from the bpf program side.  It is userspace's
> > > > responsibility to add ifindex/port's to the bpf-maps, according to how
> > > > the bpf program "policy" want to "connect" these ports.  The
> > > > port-table system add one extra step, of also adding this port to the
> > > > port-table (which lives inside the kernel).
> > > >   
> > > 
> > > I'm not sure I understand the "lives inside the kernel" bit. I assumed
> > > the 'map' should be a bpf map and behave like any other bpf map.
> > > 
> > > I wanted a new map to be defined, something like this from the bpf 
> > > programmer
> > > side.
> > > 
> > > struct bpf_map_def SEC("maps") port_table =
> > >   .type = BPF_MAP_TYPE_PORT_CONNECTION,
> > >   .key_size = sizeof(u32),
> > >   .value_size = BPF_PORT_CONNECTION_SIZE,
> > >   .max_entries = 256,
> > > };  
> > 
> > I like the idea.
> > We have prog_array, perf_event_array, cgroup_array map specializations.
> > This one can be new netdev_array with some new bpf_redirect-like helper
> > accessing it.
> >   
> > > > When loading the XDP program, we also need to pass along a port table
> > > > "id" this XDP program is associated with (and if it doesn't exists you
> > > > create it).  And your userspace "control-plane" application also need
> > > > to know this port table "id", when adding a new port.  
> > > 
> > > So the user space application that is loading the program also needs
> > > to handle this map. This seems correct to me. But I don't see the
> > > value in making some new port table when we already have well understood
> > > framework for maps.  
> > 
> > +1
> >   
> > > > 
> > > > The concept of having multiple port tables is key.  As this implies we
> > > > can have several simultaneous "data-planes" that is *isolated* from
> > > > each-other.  Think about how network-namespaces/containers want
> > > > isolation. A subtle thing I'm afraid to mention, is that oppose to the
> > > > ifindex model, a port table with mapping to a net_device pointer, would
> > > > allow (faster) delivery into the container's inner net_device, which
> > > > sort of violates the isolation, but I would argue it is not a problem
> > > > as this net_device pointer could only be added from a process within the
> > > > namespace.  I like this feature, but it could easily be disallowed via
> > > > port insertion-time validation.
> > > >   
> > > 
> > > I think the above optimization should be allowed. And agree multiple port
> > > tables (maps?) is needed. Again all this points to using standard maps
> > > logic in my mind. For permissions and different domains, which I think
> > > you were starting to touch on, it looks like we could extend the pinning 
> > > API.
> > > At the moment it does an inode_permission(inode, MAY_WRITE) check but I
> > > presume this could be extended. None of this would be needed in v1 and
> > > could be added subsequently. read-only maps seems doable.  
> > 
> > this is great idea. Once BPF_MAP_TYPE_NETDEV_ARRAY is populated
> > the user space can make it readonly to prevent further changes.
> > 
> > From user space it can be done similar to perf_events/cgroups as well.
> > bpf_map_update_elem(_array, _num, )
> > should work.
> > For bpf_map_lookup_elem() from such netdev_array we can return
> > ifindex back.
> > The bpf_map_show_fdinfo() can be customized as well to pretty print
> > ifindexes of netdevs stored in there.
> >   
> 
> I agree with both of you on all of these points.  Having the port
> redirection in a new type of map and/or array seems like the way to go.
> 
> I understood Jesper's perspecitive when thinking about a way to pass a
> port-table id down, but I think the idea that the userspace loader code
> defining the maps is going to be the one making this link is the right
> idea and handling things like ifindex changes (rather than identifiers
> that perform lookups in other tables) is going to have to be yet another
> exercise left up to the...user.  :-)
> 

I love this idea. Integrating the port table closer with the bpf-maps
infrastructure makes sense.  This gives me a place to hook the code into,
instead of (re)inventing a new infrastructure for this port table, and the
interface will be more natural from a bpf-API point of view.

When registering/attaching a XDP/bpf program, we would just send the
file-descriptor for this port-map along (like we do with the bpf_prog
FD). Plus, it own ingress-port number this program is in the port-map.

It is not clear to me, in-which-data-structure on the kernel-side we
store this reference to the port-map and ingress-port. As today we only
have the "raw" struct bpf_prog pointer. I see several options:

1. Create a new xdp_prog struct that

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-26 Thread Andy Gospodarek

On Wed, Apr 26, 2017 at 10:58:45AM -0700, Alexei Starovoitov wrote:
> On 4/26/17 9:35 AM, John Fastabend wrote:
> > 
> > > As Alexei also mentioned before, ifindex vs port makes no real
> > > difference seen from the bpf program side.  It is userspace's
> > > responsibility to add ifindex/port's to the bpf-maps, according to how
> > > the bpf program "policy" want to "connect" these ports.  The
> > > port-table system add one extra step, of also adding this port to the
> > > port-table (which lives inside the kernel).
> > > 
> > 
> > I'm not sure I understand the "lives inside the kernel" bit. I assumed
> > the 'map' should be a bpf map and behave like any other bpf map.
> > 
> > I wanted a new map to be defined, something like this from the bpf 
> > programmer
> > side.
> > 
> > struct bpf_map_def SEC("maps") port_table =
> > .type = BPF_MAP_TYPE_PORT_CONNECTION,
> > .key_size = sizeof(u32),
> > .value_size = BPF_PORT_CONNECTION_SIZE,
> > .max_entries = 256,
> > };
> 
> I like the idea.
> We have prog_array, perf_event_array, cgroup_array map specializations.
> This one can be new netdev_array with some new bpf_redirect-like helper
> accessing it.
> 
> > > When loading the XDP program, we also need to pass along a port table
> > > "id" this XDP program is associated with (and if it doesn't exists you
> > > create it).  And your userspace "control-plane" application also need
> > > to know this port table "id", when adding a new port.
> > 
> > So the user space application that is loading the program also needs
> > to handle this map. This seems correct to me. But I don't see the
> > value in making some new port table when we already have well understood
> > framework for maps.
> 
> +1
> 
> > > 
> > > The concept of having multiple port tables is key.  As this implies we
> > > can have several simultaneous "data-planes" that is *isolated* from
> > > each-other.  Think about how network-namespaces/containers want
> > > isolation. A subtle thing I'm afraid to mention, is that oppose to the
> > > ifindex model, a port table with mapping to a net_device pointer, would
> > > allow (faster) delivery into the container's inner net_device, which
> > > sort of violates the isolation, but I would argue it is not a problem
> > > as this net_device pointer could only be added from a process within the
> > > namespace.  I like this feature, but it could easily be disallowed via
> > > port insertion-time validation.
> > > 
> > 
> > I think the above optimization should be allowed. And agree multiple port
> > tables (maps?) is needed. Again all this points to using standard maps
> > logic in my mind. For permissions and different domains, which I think
> > you were starting to touch on, it looks like we could extend the pinning 
> > API.
> > At the moment it does an inode_permission(inode, MAY_WRITE) check but I
> > presume this could be extended. None of this would be needed in v1 and
> > could be added subsequently. read-only maps seems doable.
> 
> this is great idea. Once BPF_MAP_TYPE_NETDEV_ARRAY is populated
> the user space can make it readonly to prevent further changes.
> 
> From user space it can be done similar to perf_events/cgroups as well.
> bpf_map_update_elem(_array, _num, )
> should work.
> For bpf_map_lookup_elem() from such netdev_array we can return
> ifindex back.
> The bpf_map_show_fdinfo() can be customized as well to pretty print
> ifindexes of netdevs stored in there.
> 

I agree with both of you on all of these points.  Having the port
redirection in a new type of map and/or array seems like the way to go.

I understood Jesper's perspecitive when thinking about a way to pass a
port-table id down, but I think the idea that the userspace loader code
defining the maps is going to be the one making this link is the right
idea and handling things like ifindex changes (rather than identifiers
that perform lookups in other tables) is going to have to be yet another
exercise left up to the...user.  :-)

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-26 Thread Alexei Starovoitov


On 4/26/17 9:35 AM, John Fastabend wrote:



As Alexei also mentioned before, ifindex vs port makes no real
difference seen from the bpf program side.  It is userspace's
responsibility to add ifindex/port's to the bpf-maps, according to how
the bpf program "policy" want to "connect" these ports.  The
port-table system add one extra step, of also adding this port to the
port-table (which lives inside the kernel).



I'm not sure I understand the "lives inside the kernel" bit. I assumed
the 'map' should be a bpf map and behave like any other bpf map.

I wanted a new map to be defined, something like this from the bpf programmer
side.

struct bpf_map_def SEC("maps") port_table =
.type = BPF_MAP_TYPE_PORT_CONNECTION,
.key_size = sizeof(u32),
.value_size = BPF_PORT_CONNECTION_SIZE,
.max_entries = 256,
};


I like the idea.
We have prog_array, perf_event_array, cgroup_array map specializations.
This one can be new netdev_array with some new bpf_redirect-like helper
accessing it.


When loading the XDP program, we also need to pass along a port table
"id" this XDP program is associated with (and if it doesn't exists you
create it).  And your userspace "control-plane" application also need
to know this port table "id", when adding a new port.


So the user space application that is loading the program also needs
to handle this map. This seems correct to me. But I don't see the
value in making some new port table when we already have well understood
framework for maps.


+1



The concept of having multiple port tables is key.  As this implies we
can have several simultaneous "data-planes" that is *isolated* from
each-other.  Think about how network-namespaces/containers want
isolation. A subtle thing I'm afraid to mention, is that oppose to the
ifindex model, a port table with mapping to a net_device pointer, would
allow (faster) delivery into the container's inner net_device, which
sort of violates the isolation, but I would argue it is not a problem
as this net_device pointer could only be added from a process within the
namespace.  I like this feature, but it could easily be disallowed via
port insertion-time validation.



I think the above optimization should be allowed. And agree multiple port
tables (maps?) is needed. Again all this points to using standard maps
logic in my mind. For permissions and different domains, which I think
you were starting to touch on, it looks like we could extend the pinning API.
At the moment it does an inode_permission(inode, MAY_WRITE) check but I
presume this could be extended. None of this would be needed in v1 and
could be added subsequently. read-only maps seems doable.


this is great idea. Once BPF_MAP_TYPE_NETDEV_ARRAY is populated
the user space can make it readonly to prevent further changes.

From user space it can be done similar to perf_events/cgroups as well.
bpf_map_update_elem(_array, _num, )
should work.
For bpf_map_lookup_elem() from such netdev_array we can return
ifindex back.
The bpf_map_show_fdinfo() can be customized as well to pretty print
ifindexes of netdevs stored in there.

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-26 Thread John Fastabend

[...]

>> Jesper, I was working up the code for the redirect piece for ixgbe and
>> virtio, please use this as a base for your virtual port number table. I'll
>> push an update onto github tomorrow. I think the table should drop in fairly
>> nicely.
> 
> Cool, I will do that. Then, I'll also have a redirect method to shape
> this around, and I would have to benchmark/test your ixgbe redirect.
> 
> (John please let me know, what github tree we are talking about, and
> what branch)
> 
> 
>> One piece that isn't clear to me is how do you plan to instantiate and
>> program this table. Is it a new static bpf map that is created any
>> time we see the redirect command? I think this would be preferred.
> 
> (This is difficult to explain without us misunderstanding each-other)
> 

Yep and I'm not sure I follow :)

> As Alexei also mentioned before, ifindex vs port makes no real
> difference seen from the bpf program side.  It is userspace's
> responsibility to add ifindex/port's to the bpf-maps, according to how
> the bpf program "policy" want to "connect" these ports.  The
> port-table system add one extra step, of also adding this port to the
> port-table (which lives inside the kernel). 
> 

I'm not sure I understand the "lives inside the kernel" bit. I assumed
the 'map' should be a bpf map and behave like any other bpf map.

I wanted a new map to be defined, something like this from the bpf programmer
side.

struct bpf_map_def SEC("maps") port_table =
.type = BPF_MAP_TYPE_PORT_CONNECTION,
.key_size = sizeof(u32),
.value_size = BPF_PORT_CONNECTION_SIZE,
.max_entries = 256,
};

> When loading the XDP program, we also need to pass along a port table
> "id" this XDP program is associated with (and if it doesn't exists you
> create it).  And your userspace "control-plane" application also need
> to know this port table "id", when adding a new port.

So the user space application that is loading the program also needs
to handle this map. This seems correct to me. But I don't see the
value in making some new port table when we already have well understood
framework for maps.

> 
> The concept of having multiple port tables is key.  As this implies we
> can have several simultaneous "data-planes" that is *isolated* from
> each-other.  Think about how network-namespaces/containers want
> isolation. A subtle thing I'm afraid to mention, is that oppose to the
> ifindex model, a port table with mapping to a net_device pointer, would
> allow (faster) delivery into the container's inner net_device, which
> sort of violates the isolation, but I would argue it is not a problem
> as this net_device pointer could only be added from a process within the
> namespace.  I like this feature, but it could easily be disallowed via
> port insertion-time validation.
> 

I think the above optimization should be allowed. And agree multiple port
tables (maps?) is needed. Again all this points to using standard maps
logic in my mind. For permissions and different domains, which I think
you were starting to touch on, it looks like we could extend the pinning API.
At the moment it does an inode_permission(inode, MAY_WRITE) check but I
presume this could be extended. None of this would be needed in v1 and
could be added subsequently. read-only maps seems doable.

>
 I'm not worried about the DROP case, I agree that is fine (as you
 also say).  The problem is unintentionally sending a packet to a
 wrong ifindex.  This is clearly an eBPF program error, BUT with
 XDP this becomes a very hard to debug program error.  With
 TC-redirect/cls_bpf we can tcpdump the packets, with XDP there is
 no visibility into this happening (the NSA is going to love this
 "feature").  Maybe we could add yet-another tracepoint to allow
 debugging this.  My proposal that we simply remove the possibility
 for such program errors, by as you say move the validation from
 run-time into static insertion-time, via a port table.  
>>>
>>> I think lack of tcpdump-like debugging in xdp is a separate issue.
>>> As I was saying in the other thread we have trivial 'xdpdump'
>>> kern+user app that emits pcap file, but it's too specific to how we
>>> use tail_calls+prog_array in our xdp setup. I'm working on the
>>> program chaining that will be generic and allow us transparently
>>> add multiple xdp or tc progs to the same attachment point and will
>>> allow us to do 'xdpdump' at any point of this pipeline, so
>>> debugging of what happened to the packet will be easier and done in
>>> the same way for both tc and xdp.
>>> btw in our experience working with both tc and xdp the tc+bpf was
>>> actually harder to use and more bug prone.
>>>   
>>
>> Nice, the tcpdump-like debugging looks interesting.
> 
> Yes, this xdpdump sound like a very useful tool.
>

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-26 Thread Jesper Dangaard Brouer

On Tue, 25 Apr 2017 20:07:34 -0700
John Fastabend  wrote:

> On 17-04-25 05:26 PM, Alexei Starovoitov wrote:
> > On Tue, Apr 25, 2017 at 11:34:53AM +0200, Jesper Dangaard Brouer wrote:  
> >>> Note the very first bpf patchset years ago contained the port table
> >>> abstraction. ovs has concept of vports as well. These two very
> >>> different projects needed port table to provide a layer of
> >>> indirection between ifindex==netdev and virtual port number.
> >>> This is still the case and I'd like to see this port table to be
> >>> implemented for both cls_bpf and xdp. In that sense xdp is not
> >>> special.  
> >>
> >> Glad to hear you want to see this implemented, I will start coding on
> >> this then.  Good point with cls_bpf, I was planning to make this port
> >> table strongly connected to XDP, guess I should also think of cls_bpf.  
> > 
> > perfect.
> > I think we should try to make all additions to bpf networking world
> > to be usable for both tc and xdp, since both are actively used and
> > it wouldn't be great to have cool feature for one, but not the other.
> > I think port table is an excellent candidate that applies to both.  
> 
> +1
> 
> Jesper, I was working up the code for the redirect piece for ixgbe and
> virtio, please use this as a base for your virtual port number table. I'll
> push an update onto github tomorrow. I think the table should drop in fairly
> nicely.

Cool, I will do that. Then, I'll also have a redirect method to shape
this around, and I would have to benchmark/test your ixgbe redirect.

(John please let me know, what github tree we are talking about, and
what branch)

> One piece that isn't clear to me is how do you plan to instantiate and
> program this table. Is it a new static bpf map that is created any
> time we see the redirect command? I think this would be preferred.

(This is difficult to explain without us misunderstanding each-other)

As Alexei also mentioned before, ifindex vs port makes no real
difference seen from the bpf program side.  It is userspace's
responsibility to add ifindex/port's to the bpf-maps, according to how
the bpf program "policy" want to "connect" these ports.  The
port-table system add one extra step, of also adding this port to the
port-table (which lives inside the kernel). 

When loading the XDP program, we also need to pass along a port table
"id" this XDP program is associated with (and if it doesn't exists you
create it).  And your userspace "control-plane" application also need
to know this port table "id", when adding a new port.

The concept of having multiple port tables is key.  As this implies we
can have several simultaneous "data-planes" that is *isolated* from
each-other.  Think about how network-namespaces/containers want
isolation. A subtle thing I'm afraid to mention, is that oppose to the
ifindex model, a port table with mapping to a net_device pointer, would
allow (faster) delivery into the container's inner net_device, which
sort of violates the isolation, but I would argue it is not a problem
as this net_device pointer could only be added from a process within the
namespace.  I like this feature, but it could easily be disallowed via
port insertion-time validation.

> >> I'm not worried about the DROP case, I agree that is fine (as you
> >> also say).  The problem is unintentionally sending a packet to a
> >> wrong ifindex.  This is clearly an eBPF program error, BUT with
> >> XDP this becomes a very hard to debug program error.  With
> >> TC-redirect/cls_bpf we can tcpdump the packets, with XDP there is
> >> no visibility into this happening (the NSA is going to love this
> >> "feature").  Maybe we could add yet-another tracepoint to allow
> >> debugging this.  My proposal that we simply remove the possibility
> >> for such program errors, by as you say move the validation from
> >> run-time into static insertion-time, via a port table.  
> > 
> > I think lack of tcpdump-like debugging in xdp is a separate issue.
> > As I was saying in the other thread we have trivial 'xdpdump'
> > kern+user app that emits pcap file, but it's too specific to how we
> > use tail_calls+prog_array in our xdp setup. I'm working on the
> > program chaining that will be generic and allow us transparently
> > add multiple xdp or tc progs to the same attachment point and will
> > allow us to do 'xdpdump' at any point of this pipeline, so
> > debugging of what happened to the packet will be easier and done in
> > the same way for both tc and xdp.
> > btw in our experience working with both tc and xdp the tc+bpf was
> > actually harder to use and more bug prone.
> >   
> 
> Nice, the tcpdump-like debugging looks interesting.

Yes, this xdpdump sound like a very useful tool.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-25 Thread John Fastabend

On 17-04-25 05:26 PM, Alexei Starovoitov wrote:
> On Tue, Apr 25, 2017 at 11:34:53AM +0200, Jesper Dangaard Brouer wrote:
>>> Note the very first bpf patchset years ago contained the port table
>>> abstraction. ovs has concept of vports as well. These two very
>>> different projects needed port table to provide a layer of
>>> indirection between ifindex==netdev and virtual port number.
>>> This is still the case and I'd like to see this port table to be
>>> implemented for both cls_bpf and xdp. In that sense xdp is not
>>> special.
>>
>> Glad to hear you want to see this implemented, I will start coding on
>> this then.  Good point with cls_bpf, I was planning to make this port
>> table strongly connected to XDP, guess I should also think of cls_bpf.
> 
> perfect.
> I think we should try to make all additions to bpf networking world
> to be usable for both tc and xdp, since both are actively used and
> it wouldn't be great to have cool feature for one, but not the other.
> I think port table is an excellent candidate that applies to both.

+1

Jesper, I was working up the code for the redirect piece for ixgbe and
virtio, please use this as a base for your virtual port number table. I'll
push an update onto github tomorrow. I think the table should drop in fairly
nicely.

One piece that isn't clear to me is how do you plan to instantiate and
program this table. Is it a new static bpf map that is created any time we see
the redirect command? I think this would be preferred.

> 
>> I'm not worried about the DROP case, I agree that is fine (as you also
>> say).  The problem is unintentionally sending a packet to a wrong
>> ifindex.  This is clearly an eBPF program error, BUT with XDP this
>> becomes a very hard to debug program error.  With TC-redirect/cls_bpf
>> we can tcpdump the packets, with XDP there is no visibility into this
>> happening (the NSA is going to love this "feature").  Maybe we could add
>> yet-another tracepoint to allow debugging this.  My proposal that we
>> simply remove the possibility for such program errors, by as you say
>> move the validation from run-time into static insertion-time, via a
>> port table.
> 
> I think lack of tcpdump-like debugging in xdp is a separate issue.
> As I was saying in the other thread we have trivial 'xdpdump' kern+user
> app that emits pcap file, but it's too specific to how we use
> tail_calls+prog_array in our xdp setup. I'm working on the program
> chaining that will be generic and allow us transparently add multiple
> xdp or tc progs to the same attachment point and will allow us to
> do 'xdpdump' at any point of this pipeline, so debugging of what
> happened to the packet will be easier and done in the same way
> for both tc and xdp.
> btw in our experience working with both tc and xdp the tc+bpf was
> actually harder to use and more bug prone.
> 

Nice, the tcpdump-like debugging looks interesting.

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-25 Thread Alexei Starovoitov

On Tue, Apr 25, 2017 at 11:34:53AM +0200, Jesper Dangaard Brouer wrote:
> > Note the very first bpf patchset years ago contained the port table
> > abstraction. ovs has concept of vports as well. These two very
> > different projects needed port table to provide a layer of
> > indirection between ifindex==netdev and virtual port number.
> > This is still the case and I'd like to see this port table to be
> > implemented for both cls_bpf and xdp. In that sense xdp is not
> > special.
> 
> Glad to hear you want to see this implemented, I will start coding on
> this then.  Good point with cls_bpf, I was planning to make this port
> table strongly connected to XDP, guess I should also think of cls_bpf.

perfect.
I think we should try to make all additions to bpf networking world
to be usable for both tc and xdp, since both are actively used and
it wouldn't be great to have cool feature for one, but not the other.
I think port table is an excellent candidate that applies to both.

> I'm not worried about the DROP case, I agree that is fine (as you also
> say).  The problem is unintentionally sending a packet to a wrong
> ifindex.  This is clearly an eBPF program error, BUT with XDP this
> becomes a very hard to debug program error.  With TC-redirect/cls_bpf
> we can tcpdump the packets, with XDP there is no visibility into this
> happening (the NSA is going to love this "feature").  Maybe we could add
> yet-another tracepoint to allow debugging this.  My proposal that we
> simply remove the possibility for such program errors, by as you say
> move the validation from run-time into static insertion-time, via a
> port table.

I think lack of tcpdump-like debugging in xdp is a separate issue.
As I was saying in the other thread we have trivial 'xdpdump' kern+user
app that emits pcap file, but it's too specific to how we use
tail_calls+prog_array in our xdp setup. I'm working on the program
chaining that will be generic and allow us transparently add multiple
xdp or tc progs to the same attachment point and will allow us to
do 'xdpdump' at any point of this pipeline, so debugging of what
happened to the packet will be easier and done in the same way
for both tc and xdp.
btw in our experience working with both tc and xdp the tc+bpf was
actually harder to use and more bug prone.

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-25 Thread Jesper Dangaard Brouer

On Thu, 20 Apr 2017 10:10:08 -0700
Alexei Starovoitov  wrote:

> On Thu, Apr 20, 2017 at 08:10:51AM +0200, Jesper Dangaard Brouer wrote:
> > On Wed, 19 Apr 2017 19:56:13 -0700
> > Alexei Starovoitov  wrote:
> >   
> > > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:  
> > > > 
> > > > Is there a concrete reason that all the proposed future cases like 
> > > > sockets
> > > > have to be handled within the very same XDP_REDIRECT return code? F.e. 
> > > > why
> > > > not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and 
> > > > future
> > > > ones would get a different return code f.e. XDP_TX_SK only handling 
> > > > sockets
> > > > when we get there implementation-wise?
> > > 
> > > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
> > > out of this discussion.
> > > XDP_REDIRECT should assume L2 raw packet is being redirected to another 
> > > L2 netdev.
> > > If we make it too generic it will lose performance.
> > > 
> > > For cls_bpf the ifindex concept is symmetric. The program can access it as
> > > skb->ifindex on receive and can redirect to another ifindex via 
> > > bpf_redirect() helper.
> > > Since netdev is not locked, it's actually big plus, since container 
> > > management
> > > control plane can simply delete netns+veth and it goes away. The program 
> > > can
> > > have dangling ifindex (if control plane is buggy and didn't update the 
> > > bpf side),
> > > but it's harmless. Packets that redirect to non-existing ifindex are 
> > > dropped.
> > > This approach already understood and works well, so for XDP I suggest to 
> > > use
> > > the same approach initially before starting to reinvent the wheel.
> > > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
> > > to L2 netdev only. That's it. Simple and easy.
> > > I think the main use cases in John's and Jesper's minds is something like
> > > xdpswitch where packets are coming from VMs and from physical eths and
> > > being redirected to either physical eth or to VM via upcoming vhost+xdp 
> > > support.
> > > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.
> > > 
> > > Once we have vhost+xdp and all other bits implemented, we must come back
> > > to this discussion about having port mapping table. As I mentioned
> > > during netconf I think it's very useful, but I don't think we should
> > > gate vhost+xdp and xdp_redirect work on this discussion.
> > > As far as this port mapping table we would need 'port' field in xdp_md as 
> > > well
> > > and xdp_redirect_port() done via helper or via extra 'out_port' field in 
> > > xdp_md
> > > plus another XDP_REDIRECT_PORT action code.  
> > 
> > Guess it would be easier to talk about if we name it "ingress_port" and
> > "egress_port".
> >   
> > > The actual port table (array) should be populated by user space with 
> > > netdevs
> > > and these netdev will have their refcnt incremented. Then we'll have 
> > > discussion
> > > what to do with netdev_unregister notifiers, whether they should be 
> > > auto-removed
> > > from port table or bpf should have a chance to be notified and act on it.
> > > Such port mapping will allow us to optimize inevitable call
> > > dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
> > > away, since netdevs will be stored in the port table and direct deref
> > > port_map_array[xdp_md->out_port] will give us target netdev quickly.  
> > 
> > I agree with above paragraph, and is happy that you can see that this
> > will actually be faster than using ifindex'es.
> >   
> > > It's nice optimization and there are other more powerful optimizations we
> > > can do with such port table (since we will know in advance which netdevs
> > > the program will be redirecting too), but I still think we should do
> > > ifindex based xdp_redirect first and only add this port table later.  
> > 
> > No, we cannot first do an ifindex based xdp_redirect. The point of the
> > port table is to sandbox which ports XDP can use.  
> 
> hmm. port table cannot sandbox the ports. The only thing it does
> from 'safety' point of view is moving the checks from run-time into
> static insertion time.
> So the checks that we would do on netdev after looking it up
> based on ifindex are the same checks we will do at insertion time
> into port table. The user space will insert/delete them live
> from that port table, so from program point of view it's exactly
> the same as ifindex. The ports can disappear and can be added
> while the program is running.

I agree, that from the eBPF programs point of view using an ifindex or a
port number is the same.  And I do like this model, that this is just a
number seem from bpf.  It provides a clean separation between the
kernel and ebpf program world.


> Note the very first bpf patchset years ago contained the port table
> abstraction. ovs has concept of vports as well. These two very
>

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-20 Thread Alexei Starovoitov

On Thu, Apr 20, 2017 at 08:10:51AM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 19 Apr 2017 19:56:13 -0700
> Alexei Starovoitov  wrote:
> 
> > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:
> > > 
> > > Is there a concrete reason that all the proposed future cases like sockets
> > > have to be handled within the very same XDP_REDIRECT return code? F.e. why
> > > not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and 
> > > future
> > > ones would get a different return code f.e. XDP_TX_SK only handling 
> > > sockets
> > > when we get there implementation-wise?  
> > 
> > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
> > out of this discussion.
> > XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 
> > netdev.
> > If we make it too generic it will lose performance.
> > 
> > For cls_bpf the ifindex concept is symmetric. The program can access it as
> > skb->ifindex on receive and can redirect to another ifindex via 
> > bpf_redirect() helper.
> > Since netdev is not locked, it's actually big plus, since container 
> > management
> > control plane can simply delete netns+veth and it goes away. The program can
> > have dangling ifindex (if control plane is buggy and didn't update the bpf 
> > side),
> > but it's harmless. Packets that redirect to non-existing ifindex are 
> > dropped.
> > This approach already understood and works well, so for XDP I suggest to use
> > the same approach initially before starting to reinvent the wheel.
> > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
> > to L2 netdev only. That's it. Simple and easy.
> > I think the main use cases in John's and Jesper's minds is something like
> > xdpswitch where packets are coming from VMs and from physical eths and
> > being redirected to either physical eth or to VM via upcoming vhost+xdp 
> > support.
> > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.
> > 
> > Once we have vhost+xdp and all other bits implemented, we must come back
> > to this discussion about having port mapping table. As I mentioned
> > during netconf I think it's very useful, but I don't think we should
> > gate vhost+xdp and xdp_redirect work on this discussion.
> > As far as this port mapping table we would need 'port' field in xdp_md as 
> > well
> > and xdp_redirect_port() done via helper or via extra 'out_port' field in 
> > xdp_md
> > plus another XDP_REDIRECT_PORT action code.
> 
> Guess it would be easier to talk about if we name it "ingress_port" and
> "egress_port".
> 
> > The actual port table (array) should be populated by user space with netdevs
> > and these netdev will have their refcnt incremented. Then we'll have 
> > discussion
> > what to do with netdev_unregister notifiers, whether they should be 
> > auto-removed
> > from port table or bpf should have a chance to be notified and act on it.
> > Such port mapping will allow us to optimize inevitable call
> > dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
> > away, since netdevs will be stored in the port table and direct deref
> > port_map_array[xdp_md->out_port] will give us target netdev quickly.
> 
> I agree with above paragraph, and is happy that you can see that this
> will actually be faster than using ifindex'es.
> 
> > It's nice optimization and there are other more powerful optimizations we
> > can do with such port table (since we will know in advance which netdevs
> > the program will be redirecting too), but I still think we should do
> > ifindex based xdp_redirect first and only add this port table later.
> 
> No, we cannot first do an ifindex based xdp_redirect. The point of the
> port table is to sandbox which ports XDP can use.

hmm. port table cannot sandbox the ports. The only thing it does
from 'safety' point of view is moving the checks from run-time into
static insertion time.
So the checks that we would do on netdev after looking it up
based on ifindex are the same checks we will do at insertion time
into port table. The user space will insert/delete them live
from that port table, so from program point of view it's exactly
the same as ifindex. The ports can disappear and can be added
while the program is running.

Note the very first bpf patchset years ago contained the port table
abstraction. ovs has concept of vports as well. These two very
different projects needed port table to provide a layer of
indirection between ifindex==netdev and virtual port number.
This is still the case and I'd like to see this port table to be
implemented for both cls_bpf and xdp. In that sense xdp is not special.

> XDP is different than TC/cls_bpf, as it does "bypass", there is no
> other layer that can stop or inspect these packets. The TC hooks
> redirect into the network stack, which have all the usual facilities
> available for filtering, inspection and debugging what is going on
> (e.g. tcpdump works for TC

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-20 Thread Jesper Dangaard Brouer

On Wed, 19 Apr 2017 19:56:13 -0700
Alexei Starovoitov  wrote:

> On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:
> > 
> > Is there a concrete reason that all the proposed future cases like sockets
> > have to be handled within the very same XDP_REDIRECT return code? F.e. why
> > not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and 
> > future
> > ones would get a different return code f.e. XDP_TX_SK only handling sockets
> > when we get there implementation-wise?  
> 
> yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
> out of this discussion.
> XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 
> netdev.
> If we make it too generic it will lose performance.
> 
> For cls_bpf the ifindex concept is symmetric. The program can access it as
> skb->ifindex on receive and can redirect to another ifindex via 
> bpf_redirect() helper.
> Since netdev is not locked, it's actually big plus, since container management
> control plane can simply delete netns+veth and it goes away. The program can
> have dangling ifindex (if control plane is buggy and didn't update the bpf 
> side),
> but it's harmless. Packets that redirect to non-existing ifindex are dropped.
> This approach already understood and works well, so for XDP I suggest to use
> the same approach initially before starting to reinvent the wheel.
> struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
> to L2 netdev only. That's it. Simple and easy.
> I think the main use cases in John's and Jesper's minds is something like
> xdpswitch where packets are coming from VMs and from physical eths and
> being redirected to either physical eth or to VM via upcoming vhost+xdp 
> support.
> This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.
> 
> Once we have vhost+xdp and all other bits implemented, we must come back
> to this discussion about having port mapping table. As I mentioned
> during netconf I think it's very useful, but I don't think we should
> gate vhost+xdp and xdp_redirect work on this discussion.
> As far as this port mapping table we would need 'port' field in xdp_md as well
> and xdp_redirect_port() done via helper or via extra 'out_port' field in 
> xdp_md
> plus another XDP_REDIRECT_PORT action code.

Guess it would be easier to talk about if we name it "ingress_port" and
"egress_port".

> The actual port table (array) should be populated by user space with netdevs
> and these netdev will have their refcnt incremented. Then we'll have 
> discussion
> what to do with netdev_unregister notifiers, whether they should be 
> auto-removed
> from port table or bpf should have a chance to be notified and act on it.
> Such port mapping will allow us to optimize inevitable call
> dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
> away, since netdevs will be stored in the port table and direct deref
> port_map_array[xdp_md->out_port] will give us target netdev quickly.

I agree with above paragraph, and is happy that you can see that this
will actually be faster than using ifindex'es.

> It's nice optimization and there are other more powerful optimizations we
> can do with such port table (since we will know in advance which netdevs
> the program will be redirecting too), but I still think we should do
> ifindex based xdp_redirect first and only add this port table later.

No, we cannot first do an ifindex based xdp_redirect. The point of the
port table is to sandbox which ports XDP can use.

XDP is different than TC/cls_bpf, as it does "bypass", there is no
other layer that can stop or inspect these packets. The TC hooks
redirect into the network stack, which have all the usual facilities
available for filtering, inspection and debugging what is going on
(e.g. tcpdump works for TC redirect).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-19 Thread John Fastabend

On 17-04-19 09:58 PM, Alexei Starovoitov wrote:
> On Wed, Apr 19, 2017 at 09:38:44PM -0700, John Fastabend wrote:
>> On 17-04-19 07:56 PM, Alexei Starovoitov wrote:
>>> On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:

 Is there a concrete reason that all the proposed future cases like sockets
 have to be handled within the very same XDP_REDIRECT return code? F.e. why
 not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and 
 future
 ones would get a different return code f.e. XDP_TX_SK only handling sockets
 when we get there implementation-wise?
>>>
>>> yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
>>> out of this discussion.
>>> XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 
>>> netdev.
>>> If we make it too generic it will lose performance.
>>>
>>> For cls_bpf the ifindex concept is symmetric. The program can access it as
>>> skb->ifindex on receive and can redirect to another ifindex via 
>>> bpf_redirect() helper.
>>> Since netdev is not locked, it's actually big plus, since container 
>>> management
>>> control plane can simply delete netns+veth and it goes away. The program can
>>> have dangling ifindex (if control plane is buggy and didn't update the bpf 
>>> side),
>>> but it's harmless. Packets that redirect to non-existing ifindex are 
>>> dropped.
>>> This approach already understood and works well, so for XDP I suggest to use
>>> the same approach initially before starting to reinvent the wheel.
>>> struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
>>> to L2 netdev only. That's it. Simple and easy.
>>> I think the main use cases in John's and Jesper's minds is something like
>>> xdpswitch where packets are coming from VMs and from physical eths and
>>> being redirected to either physical eth or to VM via upcoming vhost+xdp 
>>> support.
>>> This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.
>>
>> hmm I must be missing something, bpf_redirect() helper should be used as a
>> return statement, e.g.
>>
>>  return bpf_redirect(ifindex, flags);
>>
>> Its a bit awkward to use in any other way. You would have to ensure the
>> program returns TC_ACT_REDIRECT in all cases I presume. It seems incredibly
>> fragile and gains nothing as far as I can tell. bpf_redirect uses per_cpu
>> data and expects the core to call skb_do_redirect() to push the packet into
>> the correct netdev. bpf_redirect() does not modify the skb->ifindex value,
>> looking at the code now.
>>
>> Are you suggesting using xdp_md to store the ifindex value instead of a per
>> cpu variable in the redirect helper? 
> 
> no. i'm suggesting to store in per-cpu scratch area just like cls_bpf does.
> 
>> Do you really mean the xdp_md struct in
>> the uapi headers? 
> 
> yes. since 'ifindex' needs to be part of xdp_md struct in read-only way.
> Just like in cls_bpf does it.
> Otherwise if we attach the same program to multiple taps it won't
> know which tap the traffic arriving on and won't be able to redirect properly.
> 
>> I don't see why it needs to be in the UAPI at all. If we
>> don't like per cpu variables it could be pushed as part of xdp_buff I guess.
> 
> It's not about like or dont-like per-cpu scratch area.
> My main point: it works just fine for cls_bpf and i'm suggesting to do
> the same for xdp_redirect, since no one ever complained about that bit of 
> cls_bpf.
> 
>> My suggestion is we could add an ifindex to the xdp_md structure and have the
>> receiving NIC driver populate the value assuming it is useful to programs. 
>> But,
>> if we use cls_bpf as a model then xdp_md ifindex is more or less independent 
>> of
>> the redirect helper. 
> 
> exactly. arriving ifindex is independent of xdp_redirect helper.
> 
>> In my opinion we should avoid diverging cls bpf and xdp bpf
>> in subtle ways like handling of ifindex and redirect.
> 
> exactly. I'm saying the same thing.
> I'm not sure which part of my proposal was so confusing.
> Sorry about that.
> 

Aha what you are suggesting is exactly what I prototyped on virtio_net so I'm
happy :) I just didn't manage to parse it for whatever reason must be getting
tired.

Thanks,
John

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-19 Thread Alexei Starovoitov

On Wed, Apr 19, 2017 at 09:38:44PM -0700, John Fastabend wrote:
> On 17-04-19 07:56 PM, Alexei Starovoitov wrote:
> > On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:
> >>
> >> Is there a concrete reason that all the proposed future cases like sockets
> >> have to be handled within the very same XDP_REDIRECT return code? F.e. why
> >> not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and 
> >> future
> >> ones would get a different return code f.e. XDP_TX_SK only handling sockets
> >> when we get there implementation-wise?
> > 
> > yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
> > out of this discussion.
> > XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 
> > netdev.
> > If we make it too generic it will lose performance.
> > 
> > For cls_bpf the ifindex concept is symmetric. The program can access it as
> > skb->ifindex on receive and can redirect to another ifindex via 
> > bpf_redirect() helper.
> > Since netdev is not locked, it's actually big plus, since container 
> > management
> > control plane can simply delete netns+veth and it goes away. The program can
> > have dangling ifindex (if control plane is buggy and didn't update the bpf 
> > side),
> > but it's harmless. Packets that redirect to non-existing ifindex are 
> > dropped.
> > This approach already understood and works well, so for XDP I suggest to use
> > the same approach initially before starting to reinvent the wheel.
> > struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
> > to L2 netdev only. That's it. Simple and easy.
> > I think the main use cases in John's and Jesper's minds is something like
> > xdpswitch where packets are coming from VMs and from physical eths and
> > being redirected to either physical eth or to VM via upcoming vhost+xdp 
> > support.
> > This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.
> 
> hmm I must be missing something, bpf_redirect() helper should be used as a
> return statement, e.g.
> 
>   return bpf_redirect(ifindex, flags);
> 
> Its a bit awkward to use in any other way. You would have to ensure the
> program returns TC_ACT_REDIRECT in all cases I presume. It seems incredibly
> fragile and gains nothing as far as I can tell. bpf_redirect uses per_cpu
> data and expects the core to call skb_do_redirect() to push the packet into
> the correct netdev. bpf_redirect() does not modify the skb->ifindex value,
> looking at the code now.
> 
> Are you suggesting using xdp_md to store the ifindex value instead of a per
> cpu variable in the redirect helper? 

no. i'm suggesting to store in per-cpu scratch area just like cls_bpf does.

> Do you really mean the xdp_md struct in
> the uapi headers? 

yes. since 'ifindex' needs to be part of xdp_md struct in read-only way.
Just like in cls_bpf does it.
Otherwise if we attach the same program to multiple taps it won't
know which tap the traffic arriving on and won't be able to redirect properly.

> I don't see why it needs to be in the UAPI at all. If we
> don't like per cpu variables it could be pushed as part of xdp_buff I guess.

It's not about like or dont-like per-cpu scratch area.
My main point: it works just fine for cls_bpf and i'm suggesting to do
the same for xdp_redirect, since no one ever complained about that bit of 
cls_bpf.

> My suggestion is we could add an ifindex to the xdp_md structure and have the
> receiving NIC driver populate the value assuming it is useful to programs. 
> But,
> if we use cls_bpf as a model then xdp_md ifindex is more or less independent 
> of
> the redirect helper. 

exactly. arriving ifindex is independent of xdp_redirect helper.

> In my opinion we should avoid diverging cls bpf and xdp bpf
> in subtle ways like handling of ifindex and redirect.

exactly. I'm saying the same thing.
I'm not sure which part of my proposal was so confusing.
Sorry about that.

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-19 Thread John Fastabend

On 17-04-19 07:56 PM, Alexei Starovoitov wrote:
> On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:
>>
>> Is there a concrete reason that all the proposed future cases like sockets
>> have to be handled within the very same XDP_REDIRECT return code? F.e. why
>> not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future
>> ones would get a different return code f.e. XDP_TX_SK only handling sockets
>> when we get there implementation-wise?
> 
> yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
> out of this discussion.
> XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 
> netdev.
> If we make it too generic it will lose performance.
> 
> For cls_bpf the ifindex concept is symmetric. The program can access it as
> skb->ifindex on receive and can redirect to another ifindex via 
> bpf_redirect() helper.
> Since netdev is not locked, it's actually big plus, since container management
> control plane can simply delete netns+veth and it goes away. The program can
> have dangling ifindex (if control plane is buggy and didn't update the bpf 
> side),
> but it's harmless. Packets that redirect to non-existing ifindex are dropped.
> This approach already understood and works well, so for XDP I suggest to use
> the same approach initially before starting to reinvent the wheel.
> struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
> to L2 netdev only. That's it. Simple and easy.
> I think the main use cases in John's and Jesper's minds is something like
> xdpswitch where packets are coming from VMs and from physical eths and
> being redirected to either physical eth or to VM via upcoming vhost+xdp 
> support.
> This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.

hmm I must be missing something, bpf_redirect() helper should be used as a
return statement, e.g.

return bpf_redirect(ifindex, flags);

Its a bit awkward to use in any other way. You would have to ensure the
program returns TC_ACT_REDIRECT in all cases I presume. It seems incredibly
fragile and gains nothing as far as I can tell. bpf_redirect uses per_cpu
data and expects the core to call skb_do_redirect() to push the packet into
the correct netdev. bpf_redirect() does not modify the skb->ifindex value,
looking at the code now.

Are you suggesting using xdp_md to store the ifindex value instead of a per
cpu variable in the redirect helper? Do you really mean the xdp_md struct in
the uapi headers? I don't see why it needs to be in the UAPI at all. If we
don't like per cpu variables it could be pushed as part of xdp_buff I guess.

My suggestion is we could add an ifindex to the xdp_md structure and have the
receiving NIC driver populate the value assuming it is useful to programs. But,
if we use cls_bpf as a model then xdp_md ifindex is more or less independent of
the redirect helper. In my opinion we should avoid diverging cls bpf and xdp bpf
in subtle ways like handling of ifindex and redirect.

> 
> Once we have vhost+xdp and all other bits implemented, we must come back
> to this discussion about having port mapping table. As I mentioned
> during netconf I think it's very useful, but I don't think we should
> gate vhost+xdp and xdp_redirect work on this discussion.
> As far as this port mapping table we would need 'port' field in xdp_md as well
> and xdp_redirect_port() done via helper or via extra 'out_port' field in 
> xdp_md
> plus another XDP_REDIRECT_PORT action code.
> The actual port table (array) should be populated by user space with netdevs
> and these netdev will have their refcnt incremented. Then we'll have 
> discussion
> what to do with netdev_unregister notifiers, whether they should be 
> auto-removed
> from port table or bpf should have a chance to be notified and act on it.
> Such port mapping will allow us to optimize inevitable call
> dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
> away, since netdevs will be stored in the port table and direct deref
> port_map_array[xdp_md->out_port] will give us target netdev quickly.
> It's nice optimization and there are other more powerful optimizations we
> can do with such port table (since we will know in advance which netdevs
> the program will be redirecting too), but I still think we should do
> ifindex based xdp_redirect first and only add this port table later.
> 

Agreed on all this further optimization would be welcome of course after
basic implementation is in place.

xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

2017-04-19 Thread Alexei Starovoitov

On Thu, Apr 20, 2017 at 12:51:31AM +0200, Daniel Borkmann wrote:
> 
> Is there a concrete reason that all the proposed future cases like sockets
> have to be handled within the very same XDP_REDIRECT return code? F.e. why
> not XDP_TX_NIC that only assumes ifindex as proposed in the patch, and future
> ones would get a different return code f.e. XDP_TX_SK only handling sockets
> when we get there implementation-wise?

yeah. let's keep redirect to sockets, tunnels, crypto and exotic things
out of this discussion.
XDP_REDIRECT should assume L2 raw packet is being redirected to another L2 
netdev.
If we make it too generic it will lose performance.

For cls_bpf the ifindex concept is symmetric. The program can access it as
skb->ifindex on receive and can redirect to another ifindex via bpf_redirect() 
helper.
Since netdev is not locked, it's actually big plus, since container management
control plane can simply delete netns+veth and it goes away. The program can
have dangling ifindex (if control plane is buggy and didn't update the bpf 
side),
but it's harmless. Packets that redirect to non-existing ifindex are dropped.
This approach already understood and works well, so for XDP I suggest to use
the same approach initially before starting to reinvent the wheel.
struct xdp_md needs ifindex field and xdp_redirect() helper that redirects
to L2 netdev only. That's it. Simple and easy.
I think the main use cases in John's and Jesper's minds is something like
xdpswitch where packets are coming from VMs and from physical eths and
being redirected to either physical eth or to VM via upcoming vhost+xdp support.
This xdp_md->ifindex + xdp_redirect(to_ifindex) will solve it just fine.

Once we have vhost+xdp and all other bits implemented, we must come back
to this discussion about having port mapping table. As I mentioned
during netconf I think it's very useful, but I don't think we should
gate vhost+xdp and xdp_redirect work on this discussion.
As far as this port mapping table we would need 'port' field in xdp_md as well
and xdp_redirect_port() done via helper or via extra 'out_port' field in xdp_md
plus another XDP_REDIRECT_PORT action code.
The actual port table (array) should be populated by user space with netdevs
and these netdev will have their refcnt incremented. Then we'll have discussion
what to do with netdev_unregister notifiers, whether they should be auto-removed
from port table or bpf should have a chance to be notified and act on it.
Such port mapping will allow us to optimize inevitable call
dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
away, since netdevs will be stored in the port table and direct deref
port_map_array[xdp_md->out_port] will give us target netdev quickly.
It's nice optimization and there are other more powerful optimizations we
can do with such port table (since we will know in advance which netdevs
the program will be redirecting too), but I still think we should do
ifindex based xdp_redirect first and only add this port table later.

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

22 matches

Site Navigation

Mail list logo

Footer information