Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure

2016-11-30 Thread John Fastabend
On 16-11-29 09:37 PM, Alexei Starovoitov wrote:
> On Tue, Nov 29, 2016 at 06:52:36PM -0800, John Fastabend wrote:
>> On 16-11-29 04:15 PM, Alexei Starovoitov wrote:
>>> On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
 Registers new BPF program types which correspond to the LWT hooks:
   - BPF_PROG_TYPE_LWT_IN   => dst_input()
   - BPF_PROG_TYPE_LWT_OUT  => dst_output()
   - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()

 The separate program types are required to differentiate between the
 capabilities each LWT hook allows:

  * Programs attached to dst_input() or dst_output() are restricted and
may only read the data of an skb. This prevent modification and
possible invalidation of already validated packet headers on receive
and the construction of illegal headers while the IP headers are
still being assembled.

  * Programs attached to lwtunnel_xmit() are allowed to modify packet
content as well as prepending an L2 header via a newly introduced
helper bpf_skb_push(). This is safe as lwtunnel_xmit() is invoked
after the IP header has been assembled completely.

 All BPF programs receive an skb with L3 headers attached and may return
 one of the following error codes:

  BPF_OK - Continue routing as per nexthop
  BPF_DROP - Drop skb and return EPERM
  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
 (Only valid in lwtunnel_xmit() context)

 The return codes are binary compatible with their TC_ACT_
 relatives to ease compatibility.

 Signed-off-by: Thomas Graf 
>>> ...
 +#define LWT_BPF_MAX_HEADROOM 128
>>>
>>> why 128?
>>> btw I'm thinking for XDP to use 256, so metadata can be stored in there.
>>>
>>
>> hopefully not too off-topic but for XDP I would like to see this get
> 
> definitely off-topic. lwt->headroom is existing concept. Too late
> to do anything about it.
> 
>> passed down with the program. It would be more generic and drivers could
>> configure the headroom on demand and more importantly verify that a
>> program pushing data is not going to fail at runtime.
> 
> For xdp I think it will be problematic, since we'd have to check for
> that value at prog array access to make sure tailcalls are not broken.
> Mix and match won't be possible.
> So what does 'configure the headroom on demand' buys us?
> Isn't it much easier to tell all drivers "always reserve this much" ?
> We burn the page anyway.
> If it's configurable per driver, then we'd need an api
> to retrieve it. Yet the program author doesn't care what the value is.
> If program needs to do udp encap, it will try do it. No matter what.
> If xdp_adjust_head() helper fails, the program will likely decide
> to drop the packet. In some cases it may decide to punt to stack
> for further processing, but for high performance dataplane code
> it's highly unlikely.
> If it's configurable to something that is not cache line boundary
> hw dma performance may suffer and so on.
> So I see only cons in such 'configurable headroom' and propose
> to have fixed 256 bytes headroom for XDP
> which is enough for any sensible encap and metadata.
> 

OK I'm convinced let it be fixed at some conservative value.


Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure

2016-11-30 Thread Thomas Graf
On 11/29/16 at 11:01pm, Alexei Starovoitov wrote:
> On Wed, Nov 30, 2016 at 07:48:51AM +0100, Thomas Graf wrote:
> > Should we check in __bpf_redirect_common() whether mac_header <
> > nework_header then or add it to lwt-bpf conditional on
> > dev_is_mac_header_xmit()?
> 
> may be only extra 'if' in lwt-bpf is all we need?

Agreed, I will add a mac_header < network_header check to lwt-bpf if we
redirect to an l2 device.

> I'm still missing what will happen if we 'forget' to do
> bpf_skb_push() inside the lwt-bpf program, but still do redirect
> in lwt_xmit stage to l2 netdev...

The same as for a AF_PACKET socket not providing an actual L2 header.
I will add a test case to cover this scenario as well.


Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure

2016-11-29 Thread Alexei Starovoitov
On Wed, Nov 30, 2016 at 07:48:51AM +0100, Thomas Graf wrote:
> On 11/29/16 at 04:15pm, Alexei Starovoitov wrote:
> > On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
> > ...
> > > +#define LWT_BPF_MAX_HEADROOM 128
> > 
> > why 128?
> > btw I'm thinking for XDP to use 256, so metadata can be stored in there.
> 
> It's an arbitrary limit to catch obvious misconfiguration. I'm absolutely
> fine with bumping it to 256.
> 
> > > +static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
> > > +struct dst_entry *dst, bool can_redirect)
> > > +{
> > > + int ret;
> > > +
> > > + /* Preempt disable is needed to protect per-cpu redirect_info between
> > > +  * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
> > > +  * access to maps strictly require a rcu_read_lock() for protection,
> > > +  * mixing with BH RCU lock doesn't work.
> > > +  */
> > > + preempt_disable();
> > > + rcu_read_lock();
> > > + bpf_compute_data_end(skb);
> > > + ret = BPF_PROG_RUN(lwt->prog, skb);
> > > + rcu_read_unlock();
> > > +
> > > + switch (ret) {
> > > + case BPF_OK:
> > > + break;
> > > +
> > > + case BPF_REDIRECT:
> > > + if (!can_redirect) {
> > > + WARN_ONCE(1, "Illegal redirect return code in prog 
> > > %s\n",
> > > +   lwt->name ? : "");
> > > + ret = BPF_OK;
> > > + } else {
> > > + ret = skb_do_redirect(skb);
> > 
> > I think this assumes that program did bpf_skb_push and L2 header is present.
> > Would it make sense to check that mac_header < network_header here to make
> > sure that it actually happened? I think the cost of single 'if' isn't much.
> > Also skb_do_redirect() can redirect to l3 tunnels like ipip ;)
> > so program shouldn't be doing bpf_skb_push in such case...
> 
> We are currently guaranteeing mac_header <= network_header given that
> bpf_skb_push() is calling skb_reset_mac_header() unconditionally.
> 
> Even if a program were to push an L2 header and then redirect to an l3
> tunnel, __bpf_redirect_no_mac will pull it off again and correct the
> mac_header offset.

yes. that part is fine.

> Should we check in __bpf_redirect_common() whether mac_header <
> nework_header then or add it to lwt-bpf conditional on
> dev_is_mac_header_xmit()?

may be only extra 'if' in lwt-bpf is all we need?
I'm still missing what will happen if we 'forget' to do
bpf_skb_push() inside the lwt-bpf program, but still do redirect
in lwt_xmit stage to l2 netdev...



Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure

2016-11-29 Thread Thomas Graf
On 11/29/16 at 04:15pm, Alexei Starovoitov wrote:
> On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
> ...
> > +#define LWT_BPF_MAX_HEADROOM 128
> 
> why 128?
> btw I'm thinking for XDP to use 256, so metadata can be stored in there.

It's an arbitrary limit to catch obvious misconfiguration. I'm absolutely
fine with bumping it to 256.

> > +static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
> > +  struct dst_entry *dst, bool can_redirect)
> > +{
> > +   int ret;
> > +
> > +   /* Preempt disable is needed to protect per-cpu redirect_info between
> > +* BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
> > +* access to maps strictly require a rcu_read_lock() for protection,
> > +* mixing with BH RCU lock doesn't work.
> > +*/
> > +   preempt_disable();
> > +   rcu_read_lock();
> > +   bpf_compute_data_end(skb);
> > +   ret = BPF_PROG_RUN(lwt->prog, skb);
> > +   rcu_read_unlock();
> > +
> > +   switch (ret) {
> > +   case BPF_OK:
> > +   break;
> > +
> > +   case BPF_REDIRECT:
> > +   if (!can_redirect) {
> > +   WARN_ONCE(1, "Illegal redirect return code in prog 
> > %s\n",
> > + lwt->name ? : "");
> > +   ret = BPF_OK;
> > +   } else {
> > +   ret = skb_do_redirect(skb);
> 
> I think this assumes that program did bpf_skb_push and L2 header is present.
> Would it make sense to check that mac_header < network_header here to make
> sure that it actually happened? I think the cost of single 'if' isn't much.
> Also skb_do_redirect() can redirect to l3 tunnels like ipip ;)
> so program shouldn't be doing bpf_skb_push in such case...

We are currently guaranteeing mac_header <= network_header given that
bpf_skb_push() is calling skb_reset_mac_header() unconditionally.

Even if a program were to push an L2 header and then redirect to an l3
tunnel, __bpf_redirect_no_mac will pull it off again and correct the
mac_header offset.

Should we check in __bpf_redirect_common() whether mac_header <
nework_header then or add it to lwt-bpf conditional on
dev_is_mac_header_xmit()?

> May be rename bpf_skb_push to bpf_skb_push_l2 ?
> since it's doing skb_reset_mac_header(skb); at the end of it?
> Or it's probably better to use 'flags' argument to tell whether
> bpf_skb_push() should set mac_header or not ?

I added the flags for this purpose but left it unused for now. This
will allow to add a flag later to extend the l3 header prior to
pushing an l2 header, hence the current helper name. But even then,
we would always reset the mac header as well to ensure we never
leave an skb with mac_header > network_header. Are you seeing a
scenario where we would want to omit the mac header reset?

> Then this bit:
> 
> > +   case BPF_OK:
> > +   /* If the L3 header was expanded, headroom might be too
> > +* small for L2 header now, expand as needed.
> > +*/
> > +   ret = xmit_check_hhlen(skb);
> 
> will work fine as well...
> which probably needs "mac_header wasn't set" check? or it's fine?

Right. The check is not strictly required right now but is a safety net
to ensure that the skb passed to dst_neigh_output() which will add the
l2 header in the non-redirect case to always have sufficient headroom.
It will currently be a NOP as we are not allowing to extend the l3 header
in bpf_skb_push() yet. I left this in there to ensure that we are not
missing to add this later.


Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure

2016-11-29 Thread Alexei Starovoitov
On Tue, Nov 29, 2016 at 06:52:36PM -0800, John Fastabend wrote:
> On 16-11-29 04:15 PM, Alexei Starovoitov wrote:
> > On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
> >> Registers new BPF program types which correspond to the LWT hooks:
> >>   - BPF_PROG_TYPE_LWT_IN   => dst_input()
> >>   - BPF_PROG_TYPE_LWT_OUT  => dst_output()
> >>   - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
> >>
> >> The separate program types are required to differentiate between the
> >> capabilities each LWT hook allows:
> >>
> >>  * Programs attached to dst_input() or dst_output() are restricted and
> >>may only read the data of an skb. This prevent modification and
> >>possible invalidation of already validated packet headers on receive
> >>and the construction of illegal headers while the IP headers are
> >>still being assembled.
> >>
> >>  * Programs attached to lwtunnel_xmit() are allowed to modify packet
> >>content as well as prepending an L2 header via a newly introduced
> >>helper bpf_skb_push(). This is safe as lwtunnel_xmit() is invoked
> >>after the IP header has been assembled completely.
> >>
> >> All BPF programs receive an skb with L3 headers attached and may return
> >> one of the following error codes:
> >>
> >>  BPF_OK - Continue routing as per nexthop
> >>  BPF_DROP - Drop skb and return EPERM
> >>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
> >> (Only valid in lwtunnel_xmit() context)
> >>
> >> The return codes are binary compatible with their TC_ACT_
> >> relatives to ease compatibility.
> >>
> >> Signed-off-by: Thomas Graf 
> > ...
> >> +#define LWT_BPF_MAX_HEADROOM 128
> > 
> > why 128?
> > btw I'm thinking for XDP to use 256, so metadata can be stored in there.
> > 
> 
> hopefully not too off-topic but for XDP I would like to see this get

definitely off-topic. lwt->headroom is existing concept. Too late
to do anything about it.

> passed down with the program. It would be more generic and drivers could
> configure the headroom on demand and more importantly verify that a
> program pushing data is not going to fail at runtime.

For xdp I think it will be problematic, since we'd have to check for
that value at prog array access to make sure tailcalls are not broken.
Mix and match won't be possible.
So what does 'configure the headroom on demand' buys us?
Isn't it much easier to tell all drivers "always reserve this much" ?
We burn the page anyway.
If it's configurable per driver, then we'd need an api
to retrieve it. Yet the program author doesn't care what the value is.
If program needs to do udp encap, it will try do it. No matter what.
If xdp_adjust_head() helper fails, the program will likely decide
to drop the packet. In some cases it may decide to punt to stack
for further processing, but for high performance dataplane code
it's highly unlikely.
If it's configurable to something that is not cache line boundary
hw dma performance may suffer and so on.
So I see only cons in such 'configurable headroom' and propose
to have fixed 256 bytes headroom for XDP
which is enough for any sensible encap and metadata.



Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure

2016-11-29 Thread John Fastabend
On 16-11-29 04:15 PM, Alexei Starovoitov wrote:
> On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
>> Registers new BPF program types which correspond to the LWT hooks:
>>   - BPF_PROG_TYPE_LWT_IN   => dst_input()
>>   - BPF_PROG_TYPE_LWT_OUT  => dst_output()
>>   - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
>>
>> The separate program types are required to differentiate between the
>> capabilities each LWT hook allows:
>>
>>  * Programs attached to dst_input() or dst_output() are restricted and
>>may only read the data of an skb. This prevent modification and
>>possible invalidation of already validated packet headers on receive
>>and the construction of illegal headers while the IP headers are
>>still being assembled.
>>
>>  * Programs attached to lwtunnel_xmit() are allowed to modify packet
>>content as well as prepending an L2 header via a newly introduced
>>helper bpf_skb_push(). This is safe as lwtunnel_xmit() is invoked
>>after the IP header has been assembled completely.
>>
>> All BPF programs receive an skb with L3 headers attached and may return
>> one of the following error codes:
>>
>>  BPF_OK - Continue routing as per nexthop
>>  BPF_DROP - Drop skb and return EPERM
>>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
>> (Only valid in lwtunnel_xmit() context)
>>
>> The return codes are binary compatible with their TC_ACT_
>> relatives to ease compatibility.
>>
>> Signed-off-by: Thomas Graf 
> ...
>> +#define LWT_BPF_MAX_HEADROOM 128
> 
> why 128?
> btw I'm thinking for XDP to use 256, so metadata can be stored in there.
> 

hopefully not too off-topic but for XDP I would like to see this get
passed down with the program. It would be more generic and drivers could
configure the headroom on demand and more importantly verify that a
program pushing data is not going to fail at runtime.

.John


Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure

2016-11-29 Thread Alexei Starovoitov
On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
> Registers new BPF program types which correspond to the LWT hooks:
>   - BPF_PROG_TYPE_LWT_IN   => dst_input()
>   - BPF_PROG_TYPE_LWT_OUT  => dst_output()
>   - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
> 
> The separate program types are required to differentiate between the
> capabilities each LWT hook allows:
> 
>  * Programs attached to dst_input() or dst_output() are restricted and
>may only read the data of an skb. This prevent modification and
>possible invalidation of already validated packet headers on receive
>and the construction of illegal headers while the IP headers are
>still being assembled.
> 
>  * Programs attached to lwtunnel_xmit() are allowed to modify packet
>content as well as prepending an L2 header via a newly introduced
>helper bpf_skb_push(). This is safe as lwtunnel_xmit() is invoked
>after the IP header has been assembled completely.
> 
> All BPF programs receive an skb with L3 headers attached and may return
> one of the following error codes:
> 
>  BPF_OK - Continue routing as per nexthop
>  BPF_DROP - Drop skb and return EPERM
>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
> (Only valid in lwtunnel_xmit() context)
> 
> The return codes are binary compatible with their TC_ACT_
> relatives to ease compatibility.
> 
> Signed-off-by: Thomas Graf 
...
> +#define LWT_BPF_MAX_HEADROOM 128

why 128?
btw I'm thinking for XDP to use 256, so metadata can be stored in there.

> +static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
> +struct dst_entry *dst, bool can_redirect)
> +{
> + int ret;
> +
> + /* Preempt disable is needed to protect per-cpu redirect_info between
> +  * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
> +  * access to maps strictly require a rcu_read_lock() for protection,
> +  * mixing with BH RCU lock doesn't work.
> +  */
> + preempt_disable();
> + rcu_read_lock();
> + bpf_compute_data_end(skb);
> + ret = BPF_PROG_RUN(lwt->prog, skb);
> + rcu_read_unlock();
> +
> + switch (ret) {
> + case BPF_OK:
> + break;
> +
> + case BPF_REDIRECT:
> + if (!can_redirect) {
> + WARN_ONCE(1, "Illegal redirect return code in prog 
> %s\n",
> +   lwt->name ? : "");
> + ret = BPF_OK;
> + } else {
> + ret = skb_do_redirect(skb);

I think this assumes that program did bpf_skb_push and L2 header is present.
Would it make sense to check that mac_header < network_header here to make
sure that it actually happened? I think the cost of single 'if' isn't much.
Also skb_do_redirect() can redirect to l3 tunnels like ipip ;)
so program shouldn't be doing bpf_skb_push in such case...
May be rename bpf_skb_push to bpf_skb_push_l2 ?
since it's doing skb_reset_mac_header(skb); at the end of it?
Or it's probably better to use 'flags' argument to tell whether
bpf_skb_push() should set mac_header or not ? Then this bit:

> + case BPF_OK:
> + /* If the L3 header was expanded, headroom might be too
> +  * small for L2 header now, expand as needed.
> +  */
> + ret = xmit_check_hhlen(skb);

will work fine as well...
which probably needs "mac_header wasn't set" check? or it's fine?

All bpf bits look great. Thanks!