Re: [iovisor-dev] bpf_redirect_map not working after tail call

2018-06-04 Thread Daniel Borkmann via iovisor-dev
On 06/04/2018 01:04 PM, Jesper Dangaard Brouer via iovisor-dev wrote:
> On Fri, 1 Jun 2018 14:15:58 +0200
> Sebastiano Miano via iovisor-dev  wrote:
> 
>> Dear all,
>>
>> We have noticed that the bpf_redirect_map returns an error when it is
>> called after a tail call.
>> The xdp_redirect_map program (under sample/bpf) works fine, but if we
>> modify it as shown in the following diff, it doesn't work anymore.
>> I have debugged it with the xdp_monitor application and the error
>> returned is EFAULT.
>> Is this a known issue? Am I doing something wrong?
> 
> Argh, this is likely an issue/bug due to the check xdp_map_invalid(),
> that was introduced in commit 7c3001313396 ("bpf: fix ri->map_owner
> pointer on bpf_prog_realloc").
> 
> To Daniel, I don't know how to solve this, could you give some advice?
> 
> 
> 
>  static inline bool xdp_map_invalid(const struct bpf_prog *xdp_prog,
>  unsigned long aux)
>  {
>   return (unsigned long)xdp_prog->aux != aux;
>  }
> 
>  static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
>  struct bpf_prog *xdp_prog)
>  {
>   struct redirect_info *ri = this_cpu_ptr(_info);
>   unsigned long map_owner = ri->map_owner;
>   struct bpf_map *map = ri->map;
>   u32 index = ri->ifindex;
>   void *fwd = NULL;
>   int err;
> 
>   [...]
>   if (unlikely(xdp_map_invalid(xdp_prog, map_owner))) {
>   err = -EFAULT;
>   map = NULL;
>   goto err;
>   }
>   [...]

Argh, I see the issue. Working on a fix after checking the syzkaller reports.

Thanks for the report!
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH bpf-next v5 0/2] bpf: allow map helpers access to map values directly

2018-04-24 Thread Daniel Borkmann via iovisor-dev
On 04/24/2018 03:06 PM, Paul Chaignon wrote:
> Currently, helpers that expect ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE
> can only access stack and packet memory.  This patchset allows these
> helpers to directly access map values by passing registers of type
> PTR_TO_MAP_VALUE.
> 
> The first patch changes the verifier; the second adds new test cases.
> 
> The first three versions of this patchset were sent on the iovisor-dev
> mailing list only.
> 
> Changelogs:
>   Changes in v5:
> - Refactor using check_helper_mem_access.
>   Changes in v4:
> - Rebase.
>   Changes in v3:
> - Bug fixes.
> - Negative test cases.
>   Changes in v2:
> - Additional test cases for adjusted maps.

Applied to bpf-next, thanks Paul!

___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH bpf-next v4 1/2] bpf: allow map helpers access to map values directly

2018-04-23 Thread Daniel Borkmann via iovisor-dev
On 04/22/2018 11:52 PM, Paul Chaignon wrote:
> Helpers that expect ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE can only
> access stack and packet memory.  Allow these helpers to directly access
> map values by passing registers of type PTR_TO_MAP_VALUE.
> 
> This change removes the need for an extra copy to the stack when using a
> map value to perform a second map lookup, as in the following:
> 
> struct bpf_map_def SEC("maps") infobyreq = {
> .type = BPF_MAP_TYPE_HASHMAP,
> .key_size = sizeof(struct request *),
> .value_size = sizeof(struct info_t),
> .max_entries = 1024,
> };
> struct bpf_map_def SEC("maps") counts = {
> .type = BPF_MAP_TYPE_HASHMAP,
> .key_size = sizeof(struct info_t),
> .value_size = sizeof(u64),
> .max_entries = 1024,
> };
> SEC("kprobe/blk_account_io_start")
> int bpf_blk_account_io_start(struct pt_regs *ctx)
> {
> struct info_t *info = bpf_map_lookup_elem(, >di);
> u64 *count = bpf_map_lookup_elem(, info);
> (*count)++;
> }
> 
> Signed-off-by: Paul Chaignon 
> ---
>  kernel/bpf/verifier.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 5dd1dcb902bf..70e00beade03 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1914,7 +1914,7 @@ static int check_func_arg(struct bpf_verifier_env *env, 
> u32 regno,
>   if (arg_type == ARG_PTR_TO_MAP_KEY ||
>   arg_type == ARG_PTR_TO_MAP_VALUE) {
>   expected_type = PTR_TO_STACK;
> - if (!type_is_pkt_pointer(type) &&
> + if (!type_is_pkt_pointer(type) && type != PTR_TO_MAP_VALUE &&
>   type != expected_type)
>   goto err_type;
>   } else if (arg_type == ARG_CONST_SIZE ||
> @@ -1970,6 +1970,9 @@ static int check_func_arg(struct bpf_verifier_env *env, 
> u32 regno,
>   err = check_packet_access(env, regno, reg->off,
> meta->map_ptr->key_size,
> false);
> + else if (type == PTR_TO_MAP_VALUE)
> + err = check_map_access(env, regno, reg->off,
> +meta->map_ptr->key_size, false);
>   else
>   err = check_stack_boundary(env, regno,
>  meta->map_ptr->key_size,

We should reuse check_helper_mem_access() here which covers all three cases
from above already and simplifies the code a bit.

> @@ -1987,6 +1990,10 @@ static int check_func_arg(struct bpf_verifier_env 
> *env, u32 regno,
>   err = check_packet_access(env, regno, reg->off,
> meta->map_ptr->value_size,
> false);
> + else if (type == PTR_TO_MAP_VALUE)
> + err = check_map_access(env, regno, reg->off,
> +meta->map_ptr->value_size,
> +false);
>   else
>   err = check_stack_boundary(env, regno,
>  meta->map_ptr->value_size,
> 

Ditto.

Thanks,
Daniel
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] bpftool binary size

2018-04-18 Thread Daniel Borkmann via iovisor-dev
On 04/18/2018 10:03 PM, Jesper Dangaard Brouer via iovisor-dev wrote:
[...]
> What does bpftool use BFD for?

It's used for the prog JIT dump in order to disassemble the instructions.
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Best userspace programming API for XDP features query to kernel?

2018-04-10 Thread Daniel Borkmann via iovisor-dev
On 04/09/2018 01:26 PM, Jesper Dangaard Brouer wrote:
> On Fri, 6 Apr 2018 12:36:18 +0200
> Daniel Borkmann  wrote:
[...]
>> [...]
>> The underlying problem is effectively the decoupling of program
>> verification that doesn't have/know the context of where it is being
>> attached to in this case. 
> 
> Yes, exactly, that the underlying problem. (plus the first XDP prog
> gets verified and driver attached, and subsequent added bpf tail calls,
> cannot be "rejected" (at "driver attachment time") as its too late).
> 
>> Thinking out loud for a sec on couple of other options aside
>> from feature bits, what about i) providing the target ifindex to the
>> verifier for XDP programs, such that at verification time you have the
>> full context similar to nfp offload case today, 
> 
> I do like that we could provide the target ifindex earlier.  But
> userspace still need some way to express that it e.g. need the
> XDP_REDIRECT features (as verifier cannot reliability detect the action
> return codes used, as discussed before, due to bpf tail calls and maps
> used as return values).

(See below on the detection of it.)

>> or ii) populating some
>> XDP specific auxillary data to the BPF program at verification time such
>> that the driver can check at program attach time whether the requested
>> features are possible and if not it will reject and respond with netlink
>> extack message to the user (as we do in various XDP attach cases already
>> through XDP_SETUP_PROG{,_HW}).
> 
> I like proposal ii) better.  But how do I specify/input that I need
> e.g. the XDP_REDIRECT feature, such that is it avail to "the BPF
> program at verification time"?
> 
> My proposal iii), what about at XDP attachment time, create a new
> netlink attribute that describe XDP action codes the program
> needs/wants. If the info is provided, the ndo_bpf call check and
> reject, and respond with netlink extack message.
>   If I want to query for avail action codes, then I can use the same
> netlink attribute format, and kernel will return it "populated" with
> the info.
> 
> It is very useful that the program gets rejected at attachment time (to
> avoid loading an XDP prog that silently drops packets). BUT I also
> want a query option/functionality (reuse netlink attr format).
> 
> Specifically for Suricata the same "bypass" features can be implemented
> at different levels (XDP, XDP-generic or clsbpf), and the XDP program
> contains multiple features.  Thus, depending on what NIC driver
> supports, we want to load different XDP and/or clsbpf/TC BPF-programs.
> Thus, still support the same user requested feature/functionality, even
> if XDP_REDIRECT is not avail, just slightly slower.

Makes sense to have such fallbacks.

>> This would, for example, avoid the need for feature bits, and do actual
>> rejection of the program while retaining flexibility (and avoiding to
>> expose bits that over time hopefully will deprecate anyway due to all
>> XDP aware drivers implementing them). For both cases i) and ii), it
>> would mean we make the verifier a bit smarter with regards to keeping
>> track of driver related (XDP) requirements. Program return code checking
>> is already present in the verifier's check_return_code() and we could
>> extend it for XDP as well, for example. Seems cleaner and more extensible
>> than feature bits, imho.
> 
> I thought the verifier's check_return_code() couldn't see/verify if the
> XDP return action code is provided as a map lookup result(?). How is
> that handled?

For the latter, I would just add a BPF_F_STRICT_CONST_VERDICT load flag
which e.g. enforces a constant return code in all prog types. It also needs
to check for helpers like bpf_xdp_redirect() and track R0 from there that
it either contains XDP_REDIRECT or XDP_ABORTED.

For the bpf_prog_aux, we need a prog dependent void *private_data area
pointer in bpf_prog_aux that verifier populates; e.g. we could migrate
already some of the prog type specific fields into that like kprobe_override,
cb_access, dst_needed that are non-generic anyway. For XDP, verifier would
in your case record all seen return codes into private_data. When the flag
BPF_F_STRICT_CONST_VERDICT is not used and we noticed there were cases
where the verdict was not a verifiable constant, then we e.g. mark all
XDP return codes as seen. Potentially the same is needed for tail calls.

We can add a ndo_bpf query to drivers that return their supported XDP return
codes and compare them in e.g. dev_change_xdp_fd() out of generic code and
reject with extack.

For tail calls, the only way that comes to mind right now where you could
lift that requirement with having to mark all return codes as seen is that
you'd need to pass the ifindex as in offload case at load time, such that
you the program becomes tied to the device. Then you also need to record
dev pointer in the private_data such that you can add a new callback to
bpf_prog_ops for prog dependent compare in 

[iovisor-dev] Best userspace programming API for XDP features query to kernel?

2018-04-06 Thread Daniel Borkmann via iovisor-dev
On 04/05/2018 10:51 PM, Jesper Dangaard Brouer wrote:
> On Thu, 5 Apr 2018 12:37:19 +0200
> Daniel Borkmann  wrote:
> 
>> On 04/04/2018 02:28 PM, Jesper Dangaard Brouer via iovisor-dev wrote:
>>> Hi Suricata people,
>>>
>>> When Eric Leblond (and I helped) integrated XDP in Suricata, we ran
>>> into the issue, that at Suricata load/start time, we cannot determine
>>> if the chosen XDP config options, like xdp-cpu-redirect[1], is valid on
>>> this HW (e.g require driver XDP_REDIRECT support and bpf cpumap).
>>>
>>> We would have liked a way to report that suricata.yaml config was
>>> invalid for this hardware/setup.  Now, it just loads, and packets gets
>>> silently dropped by XDP (well a WARN_ONCE and catchable via tracepoints).
>>>
>>> My question to suricata developers: (Q1) Do you already have code that
>>> query the kernel or drivers for features?
>>>
>>> At the IOvisor call (2 weeks ago), we discussed two options of exposing
>>> XDP features avail in a given driver.
>>>
>>> Option#1: Extend existing ethtool -k/-K "offload and other features"
>>> with some XDP features, that userspace can query. (Do you already query
>>> offloads, regarding Q1)
>>>
>>> Option#2: Invent a new 'ip link set xdp' netlink msg with a query option.  
>>
>> I don't really mind if you go via ethtool, as long as we handle this
>> generically from there and e.g. call the dev's ndo_bpf handler such that
>> we keep all the information in one place. This can be a new ndo_bpf command
>> e.g. XDP_QUERY_FEATURES or such.
> 
> Just to be clear: notice as Victor points out[2], they are programmable
> going though the IOCTL (SIOCETHTOOL) and not using cmdline tools.

Sure, that was perfectly clear. (But at the same time if you extend the
ioctl, it's obvious to also add support to actual ethtool cmdline tool.)

> [2] https://github.com/OISF/suricata/blob/master/src/util-ioctl.c#L326
> 
> If you want everything to go through the drivers ndo_bpf call anyway
> (which userspace API is netlink based) then at what point to you

Not really, that's the front end. ndo_bpf itself is a plain netdev op
and has no real tie to netlink.

> want drivers to call their own ndo_bpf, when activated though their
> ethtool_ops ? (Sorry, but I don't follow the flow you are proposing)
> 
> Notice, I'm not directly against using the drivers ndo_bpf call.  I can
> see it does provide kernel more flexibility than the ethtool IOCTL.

What I was saying is that even if you go via ethtool ioctl api, where
you end up in dev_ethtool() and have some new ETHTOOL_* query command,
then instead of adding a new ethtool_ops callback, we can and should
reuse ndo_bpf from there.

[...]
> Here, I want to discuss how drivers expose/tell userspace that they
> support a given feature: Specifically a bit for: XDP_REDIRECT action
> support.
> 
>> Same for meta data,
> 
> Well, not really.  It would be a "nice-to-have", but not strictly
> needed as a feature bit.  XDP meta-data is controlled via a helper.
> And the BPF-prog can detect/see runtime, that the helper bpf_xdp_adjust_meta
> returns -ENOTSUPP (and need to check the ret value anyhow).  Thus,
> there is that not much gained by exposing this to be detected setup
> time, as all drivers should eventually support this, and we can detect
> it runtime.
> 
> The missing XDP_REDIRECT action features bit it different, as the
> BPF-prog cannot detect runtime that this is an unsupported action.
> Plus, setup time we cannot query the driver for supported XDP actions.

Ok, so with the example of meta data, you're arguing that it's okay
to load a native XDP program onto a driver, and run actual traffic on
the NIC in order probe for the availability of the feature when you're
saying that it "can detect/see [at] runtime". I totally agree with you
that all drivers should eventually support this (same with XDP_REDIRECT),
but today there are even differences in drivers on bpf_xdp_adjust_meta()/
bpf_xdp_adjust_head() with regards to how much headroom they have available,
etc (e.g. some of them have none), so right now you can either go and
read the code or do a runtime test with running actual traffic through
the NIC to check whether your BPF prog is supported or not. Theoretically,
you can do the same runtime test with XDP_REDIRECT (taking the warn in
bpf_warn_invalid_xdp_action() aside for a moment), but you do have the
trace_xdp_exception() tracepoint to figure it out, yes, it's a painful
hassle, but overall, it's not that different as you were trying to argue
here. For /both/ cases it would be nice to know at setup time whether
this would be supported or not. Hence, such query is not just limited to
XDP_REDIRECT alone. Eventually once such interface is agreed upon,
undoubtedly the list of feature bits will grow is what I'm trying to say;
only arguing on the XDP_REDIRECT here would be short term.

[...]
>> What about keeping this high level to users? E.g. say you have 2 options
>> that drivers can expose as 

Re: [iovisor-dev] Best userspace programming API for XDP features query to kernel?

2018-04-05 Thread Daniel Borkmann via iovisor-dev
On 04/04/2018 02:28 PM, Jesper Dangaard Brouer via iovisor-dev wrote:
> Hi Suricata people,
> 
> When Eric Leblond (and I helped) integrated XDP in Suricata, we ran
> into the issue, that at Suricata load/start time, we cannot determine
> if the chosen XDP config options, like xdp-cpu-redirect[1], is valid on
> this HW (e.g require driver XDP_REDIRECT support and bpf cpumap).
> 
> We would have liked a way to report that suricata.yaml config was
> invalid for this hardware/setup.  Now, it just loads, and packets gets
> silently dropped by XDP (well a WARN_ONCE and catchable via tracepoints).
> 
> My question to suricata developers: (Q1) Do you already have code that
> query the kernel or drivers for features?
> 
> At the IOvisor call (2 weeks ago), we discussed two options of exposing
> XDP features avail in a given driver.
> 
> Option#1: Extend existing ethtool -k/-K "offload and other features"
> with some XDP features, that userspace can query. (Do you already query
> offloads, regarding Q1)
> 
> Option#2: Invent a new 'ip link set xdp' netlink msg with a query option.

I don't really mind if you go via ethtool, as long as we handle this
generically from there and e.g. call the dev's ndo_bpf handler such that
we keep all the information in one place. This can be a new ndo_bpf command
e.g. XDP_QUERY_FEATURES or such.

More specifically, how would such feature mask look like? How fine grained
would this be? When you add a new minor feature to, say, cpumap that not
all drivers support yet, we'd need a new flag each time, no? Same for meta data,
then potentially for the redirect memory return work, or the af_xdp bits, the
xdp_rxq_info would have needed it, etc. What about nfp in terms of XDP
offload capabilities, should they be included as well or is probing to load
the program and see if it loads/JITs as we do today just fine (e.g. you'd
otherwise end up with extra flags on a per BPF helper basis)? To make a
somewhat reliable assertion whether feature xyz would work, this would
explode in new feature bits long term. Additionally, if we end up with a
lot of feature flags, it will be very hard for users to determine whether
this particular set of features a driver supports actually represents a
fully supported native XDP driver.

What about keeping this high level to users? E.g. say you have 2 options
that drivers can expose as netdev_features_strings 'xdp-native-full' or
'xdp-native-partial'. If a driver truly supports all XDP features for a
given kernel e.g. v4.16, then a query like 'ethtool -k foo' will say
'xdp-native-full', if at least one feature is missing to be feature complete
from e.g. above list, then ethtool will tell 'xdp-native-partial', and if
not even ndo_bpf callback exists then no 'xdp-native-*' is reported.

Side-effect might be that it would give incentive to keep drivers in state
'xdp-native-full' instead of being downgraded to 'xdp-native-partial'.
Potentially, in the 'xdp-native-partial' state, we can expose a high-level
list of missing features that the driver does not support yet, which would
over time converge towards 'zero' and thus 'xdp-native-full' again. ethtool
itself could get a new XDP specific query option that, based on this info,
can then dump the full list of supported and not supported features. In order
for this to not explode, such features would need to be kept on a high-level
basis, meaning if e.g. cpumap gets extended along with support for a number
of drivers, then those that missed out would need to be temporarily
re-flagged with e.g. 'cpumap not supported' until it gets also implemented
there. That way, we don't explode in adding too fine-grained feature bit
combinations long term and make it easier to tell whether a driver supports
the full set in native XDP or not. Thoughts?

> (Q2) Do Suricata devs have any preference (or other options/ideas) for
> the way the kernel expose this info to userspace?
> 
> [1] 
> http://suricata.readthedocs.io/en/latest/capture-hardware/ebpf-xdp.html#the-xdp-cpu-redirect-case
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Kernel crash when running bcc's test_xlate

2018-01-15 Thread Daniel Borkmann via iovisor-dev
On 01/15/2018 09:33 PM, Daniel Borkmann via iovisor-dev wrote:
> On 01/15/2018 09:13 PM, David Miller wrote:
>> From: Daniel Borkmann via iovisor-dev <iovisor-dev@lists.iovisor.org>
>> Date: Mon, 15 Jan 2018 21:11:06 +0100
>>
>>> On 01/15/2018 06:24 PM, Daniel Borkmann via iovisor-dev wrote:
>>>> On 01/11/2018 07:05 AM, Sandipan Das via iovisor-dev wrote:
>>>> [...]
>>>>> Any ideas about why this is happening? Also, it would be really helpful
>>>>> if someone can translate the Pyroute2 calls in the test script to the
>>>>> corresponding tc commands.
>>>>
>>>> I can trigger it as well, so I should have something very soon; currently
>>>> looking into it.
>>>
>>> Here's the fix for your panic ... cooking up proper patch with commit
>>> description for netdev now:
>>>
>>> From 126dd0c1ecb7c207f79976c6c3ef13be08afe6b5 Mon Sep 17 00:00:00 2001
>>> From: Daniel Borkmann <dan...@iogearbox.net>
>>> Date: Mon, 15 Jan 2018 19:48:28 +
>>> Subject: [PATCH net] net, sched: fix miniq {b,q}stats handling
>>>
>>> Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
>>
>> Is this related to:
>>
>> http://patchwork.ozlabs.org/patch/860647/
>>
>> ?
> 
> Ahh, yes, I didn't see that one. The fix is needed in net tree however, since
> it's broken there already (v4.15-rc1 onwards). I was using an arm64 machine
> with simple clsact and some filter attached to it, so as soon as this hits
> traffic, we get the panic since {b,q}stats are allocated /after/ setting them
> up in miniq init.
> 
> We could use some form of d59f5ffa59d8 ("net: sched: a dflt qdisc may be
> used with per cpu stats") for net and get rid of the per-cpu alloc from
> qdisc_create(), since it's not needed there anymore.

Basically like this for net to reduce merge churn later on:

 include/net/sch_generic.h |  2 ++
 net/sched/sch_api.c   | 15 +--
 net/sched/sch_generic.c   | 18 +-
 net/sched/sch_ingress.c   | 19 ---
 4 files changed, 24 insertions(+), 30 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 83a3e47..becf86a 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -179,6 +179,7 @@ struct Qdisc_ops {
const struct Qdisc_class_ops*cl_ops;
charid[IFNAMSIZ];
int priv_size;
+   unsigned intstatic_flags;

int (*enqueue)(struct sk_buff *skb,
   struct Qdisc *sch,
@@ -444,6 +445,7 @@ void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, 
unsigned int n,
   unsigned int len);
 struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
  const struct Qdisc_ops *ops);
+void qdisc_free(struct Qdisc *qdisc);
 struct Qdisc *qdisc_create_dflt(struct netdev_queue *dev_queue,
const struct Qdisc_ops *ops, u32 parentid);
 void __qdisc_calculate_pkt_len(struct sk_buff *skb,
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 0f1eab9..52529b7 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1063,17 +1063,6 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
}

if (!ops->init || (err = ops->init(sch, tca[TCA_OPTIONS])) == 0) {
-   if (qdisc_is_percpu_stats(sch)) {
-   sch->cpu_bstats =
-   netdev_alloc_pcpu_stats(struct 
gnet_stats_basic_cpu);
-   if (!sch->cpu_bstats)
-   goto err_out4;
-
-   sch->cpu_qstats = alloc_percpu(struct gnet_stats_queue);
-   if (!sch->cpu_qstats)
-   goto err_out4;
-   }
-
if (tca[TCA_STAB]) {
stab = qdisc_get_stab(tca[TCA_STAB]);
if (IS_ERR(stab)) {
@@ -1115,7 +1104,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
ops->destroy(sch);
 err_out3:
dev_put(dev);
-   kfree((char *) sch - sch->padded);
+   qdisc_free(sch);
 err_out2:
module_put(ops->owner);
 err_out:
@@ -1123,8 +1112,6 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
return NULL;

 err_out4:
-   free_percpu(sch->cpu_bstats);
-   free_percpu(sch->cpu_qstats);
/*
 * Any broken qdiscs that would require a ops->reset() here?
 * The qdisc was never in action so it shouldn't be necessary.
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 661c714..cac003f 100644
--- a/net/

Re: [iovisor-dev] Kernel crash when running bcc's test_xlate

2018-01-15 Thread Daniel Borkmann via iovisor-dev
On 01/15/2018 09:13 PM, David Miller wrote:
> From: Daniel Borkmann via iovisor-dev <iovisor-dev@lists.iovisor.org>
> Date: Mon, 15 Jan 2018 21:11:06 +0100
> 
>> On 01/15/2018 06:24 PM, Daniel Borkmann via iovisor-dev wrote:
>>> On 01/11/2018 07:05 AM, Sandipan Das via iovisor-dev wrote:
>>> [...]
>>>> Any ideas about why this is happening? Also, it would be really helpful
>>>> if someone can translate the Pyroute2 calls in the test script to the
>>>> corresponding tc commands.
>>>
>>> I can trigger it as well, so I should have something very soon; currently
>>> looking into it.
>>
>> Here's the fix for your panic ... cooking up proper patch with commit
>> description for netdev now:
>>
>> From 126dd0c1ecb7c207f79976c6c3ef13be08afe6b5 Mon Sep 17 00:00:00 2001
>> From: Daniel Borkmann <dan...@iogearbox.net>
>> Date: Mon, 15 Jan 2018 19:48:28 +
>> Subject: [PATCH net] net, sched: fix miniq {b,q}stats handling
>>
>> Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
> 
> Is this related to:
> 
> http://patchwork.ozlabs.org/patch/860647/
> 
> ?

Ahh, yes, I didn't see that one. The fix is needed in net tree however, since
it's broken there already (v4.15-rc1 onwards). I was using an arm64 machine
with simple clsact and some filter attached to it, so as soon as this hits
traffic, we get the panic since {b,q}stats are allocated /after/ setting them
up in miniq init.

We could use some form of d59f5ffa59d8 ("net: sched: a dflt qdisc may be
used with per cpu stats") for net and get rid of the per-cpu alloc from
qdisc_create(), since it's not needed there anymore.
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Kernel crash when running bcc's test_xlate

2018-01-15 Thread Daniel Borkmann via iovisor-dev
On 01/15/2018 06:24 PM, Daniel Borkmann via iovisor-dev wrote:
> On 01/11/2018 07:05 AM, Sandipan Das via iovisor-dev wrote:
> [...]
>> Any ideas about why this is happening? Also, it would be really helpful
>> if someone can translate the Pyroute2 calls in the test script to the
>> corresponding tc commands.
> 
> I can trigger it as well, so I should have something very soon; currently
> looking into it.

Here's the fix for your panic ... cooking up proper patch with commit
description for netdev now:

>From 126dd0c1ecb7c207f79976c6c3ef13be08afe6b5 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <dan...@iogearbox.net>
Date: Mon, 15 Jan 2018 19:48:28 +
Subject: [PATCH net] net, sched: fix miniq {b,q}stats handling

Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
---
 net/sched/sch_api.c | 38 +++---
 net/sched/sch_ingress.c | 17 ++---
 2 files changed, 21 insertions(+), 34 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 0f1eab9..1593f81 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1034,7 +1034,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
sch->parent = parent;

if (handle == TC_H_INGRESS) {
-   sch->flags |= TCQ_F_INGRESS;
+   sch->flags |= TCQ_F_INGRESS | TCQ_F_CPUSTATS;
handle = TC_H_MAKE(TC_H_INGRESS, 0);
lockdep_set_class(qdisc_lock(sch), _rx_lock);
} else {
@@ -1062,23 +1062,22 @@ static struct Qdisc *qdisc_create(struct net_device 
*dev,
netdev_info(dev, "Caught tx_queue_len zero misconfig\n");
}

-   if (!ops->init || (err = ops->init(sch, tca[TCA_OPTIONS])) == 0) {
-   if (qdisc_is_percpu_stats(sch)) {
-   sch->cpu_bstats =
-   netdev_alloc_pcpu_stats(struct 
gnet_stats_basic_cpu);
-   if (!sch->cpu_bstats)
-   goto err_out4;
-
-   sch->cpu_qstats = alloc_percpu(struct gnet_stats_queue);
-   if (!sch->cpu_qstats)
-   goto err_out4;
-   }
+   if (qdisc_is_percpu_stats(sch)) {
+   sch->cpu_bstats = netdev_alloc_pcpu_stats(struct 
gnet_stats_basic_cpu);
+   if (!sch->cpu_bstats)
+   goto err_out4;
+
+   sch->cpu_qstats = alloc_percpu(struct gnet_stats_queue);
+   if (!sch->cpu_qstats)
+   goto err_out4;
+   }

+   if (!ops->init || (err = ops->init(sch, tca[TCA_OPTIONS])) == 0) {
if (tca[TCA_STAB]) {
stab = qdisc_get_stab(tca[TCA_STAB]);
if (IS_ERR(stab)) {
err = PTR_ERR(stab);
-   goto err_out4;
+   goto err_out5;
}
rcu_assign_pointer(sch->stab, stab);
}
@@ -1087,7 +1086,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,

err = -EOPNOTSUPP;
if (sch->flags & TCQ_F_MQROOT)
-   goto err_out4;
+   goto err_out5;

if ((sch->parent != TC_H_ROOT) &&
!(sch->flags & TCQ_F_INGRESS) &&
@@ -1103,7 +1102,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
running,
tca[TCA_RATE]);
if (err)
-   goto err_out4;
+   goto err_out5;
}

qdisc_hash_add(sch, false);
@@ -1113,6 +1112,9 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
/* ops->init() failed, we call ->destroy() like qdisc_create_dflt() */
if (ops->destroy)
ops->destroy(sch);
+err_out4:
+   free_percpu(sch->cpu_bstats);
+   free_percpu(sch->cpu_qstats);
 err_out3:
dev_put(dev);
kfree((char *) sch - sch->padded);
@@ -1122,9 +1124,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
*errp = err;
return NULL;

-err_out4:
-   free_percpu(sch->cpu_bstats);
-   free_percpu(sch->cpu_qstats);
+err_out5:
/*
 * Any broken qdiscs that would require a ops->reset() here?
 * The qdisc was never in action so it shouldn't be necessary.
@@ -1132,7 +1132,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
qdisc_put_stab(rtnl_dereference(sch->stab));
if (ops->destroy)
ops->destroy(sch);
-   goto err_out3;
+   goto err_out4;

Re: [iovisor-dev] Kernel crash when running bcc's test_xlate

2018-01-15 Thread Daniel Borkmann via iovisor-dev
On 01/11/2018 07:05 AM, Sandipan Das via iovisor-dev wrote:
[...]
> Any ideas about why this is happening? Also, it would be really helpful
> if someone can translate the Pyroute2 calls in the test script to the
> corresponding tc commands.

I can trigger it as well, so I should have something very soon; currently
looking into it.

Cheers,
Daniel
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] eBPF program to forward pkt from if-a to if-b

2017-12-20 Thread Daniel Borkmann via iovisor-dev
On 12/20/2017 02:32 AM, Y Song via iovisor-dev wrote:
> There are a few examples under examples/networking with
> bpf_clone_redirect helper to redirect a packet from one
> interface to another.

Right, and there's also bpf_redirect() directly out of cls_bpf where you
don't need to go and clone the skb.

> Right socket_filter cannot modify packets, you may need to use a
> different one, e.g., cls_act type.
> 
> On Mon, Dec 18, 2017 at 6:23 AM, Avi Cohen (A) via iovisor-dev
>  wrote:
>> Hello,
>> 1. I don't find under bcc/examples  a reference for forwarding function . 
>> e,g, capture pkt on if-a  look for any match and then forward to if-b.
>> Can you send a reference to a similar function. ?
>>
>> 2. in the module examples/networking/http-filter-simple.c - when I try to 
>> write to the packet I receive an exception - do I have to set the socket tp 
>> other than SOCKET_FILTER ?
>>
>> Best Regards
>> Avi
>> ___
>> iovisor-dev mailing list
>> iovisor-dev@lists.iovisor.org
>> https://lists.iovisor.org/mailman/listinfo/iovisor-dev
> ___
> iovisor-dev mailing list
> iovisor-dev@lists.iovisor.org
> https://lists.iovisor.org/mailman/listinfo/iovisor-dev
> 

___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH RFC 0/4] Initial 32-bit eBPF encoding support

2017-09-21 Thread Daniel Borkmann via iovisor-dev

On 09/21/2017 08:56 PM, Alexei Starovoitov wrote:

On Wed, Sep 20, 2017 at 12:20:40AM +0100, Jiong Wang via iovisor-dev wrote:

On 18/09/2017 22:29, Daniel Borkmann wrote:

On 09/18/2017 10:47 PM, Jiong Wang wrote:

Hi,

Currently, LLVM eBPF backend always generate code in 64-bit mode,
this may
cause troubles when JITing to 32-bit targets.

For example, it is quite common for XDP eBPF program to access
some packet
fields through base + offset that the default eBPF will generate
BPF_ALU64 for
the address formation, later when JITing to 32-bit hardware,
BPF_ALU64 needs
to be expanded into 32 bit ALU sequences even though the address
space is
32-bit that the high bits is not significant.

While a complete 32-bit mode implemention may need an new ABI
(something like
-target-abi=ilp32), this patch set first add some initial code so we
could
construct 32-bit eBPF tests through hand-written assembly.

A new 32-bit register set is introduced, its name is with "w"
prefix and LLVM
assembler will encode statements like "w1 += w2" into the following
8-bit code
field:

  BPF_ADD | BPF_X | BPF_ALU

BPF_ALU will be used instead of BPF_ALU64.

NOTE, currently you can only use "w" register with ALU
statements, not with
others like branches etc as they don't have different encoding for
32-bit
target.


Great to see work in this direction! Can we also enable to use / emit
all the 32bit BPF_ALU instructions whenever possible for the currently
available bpf targets while at it (which only use BPF_ALU64 right now)?


Hi Daniel,

Thanks for the feedback.

I think we could also enable the use of all the 32bit BPF_ALU under
currently
available bpf targets.  As we now have 32bit register set support, we could
make
i32 type as legal type to prevent it be promoted into i64, then hook it up
with i32
ALU patterns, will look into this.


I don't think we need to gate 32bit alu generation with a flag.
Though interpreter and JITs support 32-bit since day one, the verifier
never seen such programs before, so some valid programs may get
rejected. After some time passes and we're sure that all progs
still work fine when they're optimized with 32-bit alu, we can flip
the switch in llvm and make it default.


Sounds good to me! Could this be done for bpf target via -mattr=+alu32
feature flag or similar in llc?
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH RFC 0/4] Initial 32-bit eBPF encoding support

2017-09-18 Thread Daniel Borkmann via iovisor-dev

On 09/18/2017 10:47 PM, Jiong Wang wrote:

Hi,

   Currently, LLVM eBPF backend always generate code in 64-bit mode, this may
cause troubles when JITing to 32-bit targets.

   For example, it is quite common for XDP eBPF program to access some packet
fields through base + offset that the default eBPF will generate BPF_ALU64 for
the address formation, later when JITing to 32-bit hardware, BPF_ALU64 needs
to be expanded into 32 bit ALU sequences even though the address space is
32-bit that the high bits is not significant.

   While a complete 32-bit mode implemention may need an new ABI (something like
-target-abi=ilp32), this patch set first add some initial code so we could
construct 32-bit eBPF tests through hand-written assembly.

   A new 32-bit register set is introduced, its name is with "w" prefix and LLVM
assembler will encode statements like "w1 += w2" into the following 8-bit code
field:

 BPF_ADD | BPF_X | BPF_ALU

BPF_ALU will be used instead of BPF_ALU64.

   NOTE, currently you can only use "w" register with ALU statements, not with
others like branches etc as they don't have different encoding for 32-bit
target.


Great to see work in this direction! Can we also enable to use / emit
all the 32bit BPF_ALU instructions whenever possible for the currently
available bpf targets while at it (which only use BPF_ALU64 right now)?

Thanks,
Daniel


*** BLURB HERE ***

Jiong Wang (4):
   Improve instruction encoding descriptions
   Improve class inheritance in instruction patterns
   New 32-bit register set
   Initial 32-bit ALU (BPF_ALU) encoding support in assembler

  lib/Target/BPF/BPFInstrFormats.td   |  84 +++-
  lib/Target/BPF/BPFInstrInfo.td  | 506 +++-
  lib/Target/BPF/BPFRegisterInfo.td   |  74 +++-
  lib/Target/BPF/Disassembler/BPFDisassembler.cpp |  15 +
  test/MC/BPF/insn-unit-32.s  |  53 +++
  5 files changed, 427 insertions(+), 305 deletions(-)
  create mode 100644 test/MC/BPF/insn-unit-32.s



___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH RFC] bpf: add connection tracking helper functions

2017-09-04 Thread Daniel Borkmann via iovisor-dev

On 09/01/2017 01:30 PM, William Tu wrote:

This patch adds two BPF conntrack helper functions, bpf_ct_lookup()
and bpf_ct_commit(), to enable the possibility of BPF stateful firewall.

There are two ways to implement BPF conntrack.  One way is to not
rely on helpers but implement the conntrack state table using BPF
maps.  So conntrack is basically another BPF program extracting
the tuples and lookup/update its map.  Currenly Cillium project has
implemented this way.

Instaed, this patch is proposed to reuse the the existing netfilter
conntrack.  By creating two helpers, bpf_skb_ct_lookup() and
bpf_skb_ct_commit(), a BPF program simply lookup the conntrack state,
based on the conntrack zone, or commit the new connection into the
netfilter conntrack table, for later lookup hit.  In the case where
a fragmented packet is received, the helper handles it by submitting
packets to ip_defrag() and return TC_ACT_STOLEN if being queued.

An simple example is to allow PING packet to your host, only when
your host initiates the ping.  The first ping from your host to
outside will be tracked into conntrack table as new, and the echo
reply coming into your host will make it established.  So the
subsequent ping will pass the firwall. (see tcbpf3_kern.c and
test_conntrack_bpf.h)

Now the helper is attached to TC, and requires skb to interact with
netfilter conntrack, so XDP program does not apply here.
In the future, I'd like to test/implement ip defragmentation, alg,
and nat.  But I'd like to have some early feedbacks. Thanks!

Signed-off-by: William Tu 
Cc: Joe Stringer 
Cc: Daniel Borkmann 
Cc: Alexei Starovoitov 

[...]

Just curious, were you able to do some performance testing with
the helper?

[...]

FN(unspec), \
@@ -638,6 +650,8 @@ union bpf_attr {
FN(redirect_map),   \
FN(sk_redirect_map),\
FN(sock_map_update),\
+   FN(skb_ct_lookup),  \
+   FN(skb_ct_commit),  \

  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
   * function eBPF program intends to call
@@ -737,6 +751,23 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
  };

+#define BPF_CT_LABELS_LEN 16
+#define BPF_F_CT_DEFRAG(1ULL << 0)
+
+struct bpf_conntrack_info {
+   __u32 ct_state;
+   __u16 family;
+   __u16 zone_id;


Have you considered also supporting zone directions, so you can have
overlapping tuples? Presumably if you have some other interaction
with CT where they are used and you do the lookup here with dir as
NF_CT_DEFAULT_ZONE_DIR, then CT won't find them if I remember correctly.


+   /* for conntrack mark */
+   __u32 mark_value;
+   __u32 mark_mask;
+
+   /* for conntrack label */
+   __u8 ct_label_value[16];
+   __u8 ct_label_mask[16];
+};
+
  /* Generic BPF return codes which all BPF program types may support.
   * The values are binary compatible with their TC_ACT_* counter-part to
   * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
diff --git a/net/core/filter.c b/net/core/filter.c
index c6a37fe0285b..906518043f19 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -56,6 +56,11 @@
  #include 
  #include 
  #include 
+#include 
+#include 
+#include 
+#include 
+#include 

  /**
   *sk_filter_trim_cap - run a packet through a socket filter
@@ -2674,6 +2679,221 @@ static unsigned long bpf_skb_copy(void *dst_buff, const 
void *skb,
return 0;
  }

+/* Lookup the conntrack with bpf_conntrack_info.
+ * If found, receive the conntrack state, mark, and label
+ */
+BPF_CALL_3(bpf_skb_ct_lookup, struct sk_buff *, skb,
+  struct bpf_conntrack_info *, info, u64, flags)
+{
+   struct net *net = dev_net(skb->dev);
+   enum ip_conntrack_info ctinfo;
+   struct nf_conntrack_zone zone;
+   struct nf_conn_labels *cl;
+   struct nf_conn *ct, *tmpl;
+   int err, nh_ofs;
+   struct iphdr *iph;


Please err out on invalid flags, otherwise new flags cannot
be added anymore since they would break existing programs.


+   /* Conntrack expects L3 packet */
+   nh_ofs = skb_network_offset(skb);
+   skb_pull_rcsum(skb, nh_ofs);
+   iph = ip_hdr(skb);


What happens if you call this function with an IPv6 or some
other non-IPv4 pkt and go below into IPv4 defrag for example?

We would also need to handle this for IPv6 in general, and
bail out if skb->protocol is neither IPv4 nor IPv6.


+   if (ip_is_fragment(iph) && (flags & BPF_F_CT_DEFRAG)) {
+   /* XXX: not test yet */
+   enum ip_defrag_users user =
+   IP_DEFRAG_CONNTRACK_IN + info->zone_id;
+
+   memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));


The cb[] is already used in this layer by tc and BPF, you would
somewhat need to save / restore the content of the cb[] before
returning 

Re: [iovisor-dev] [PATCH v2 net-next 5/5] bpf/verifier: document liveness analysis

2017-08-23 Thread Daniel Borkmann via iovisor-dev

On 08/23/2017 04:11 PM, Edward Cree wrote:

The liveness tracking algorithm is quite subtle; add comments to explain it.

Signed-off-by: Edward Cree 


Acked-by: Daniel Borkmann 
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH v2 net-next 2/5] bpf/verifier: when pruning a branch, ignore its write marks

2017-08-23 Thread Daniel Borkmann via iovisor-dev

On 08/23/2017 04:10 PM, Edward Cree wrote:

The fact that writes occurred in reaching the continuation state does
  not screen off its reads from us, because we're not really its parent.
So detect 'not really the parent' in do_propagate_liveness, and ignore
  write marks in that case.

Fixes: dc503a8ad984 ("bpf/verifier: track liveness for pruning")
Signed-off-by: Edward Cree 


Acked-by: Daniel Borkmann 
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH v2 net-next 3/5] selftests/bpf: add a test for a pruning bug in the verifier

2017-08-23 Thread Daniel Borkmann via iovisor-dev

On 08/23/2017 04:10 PM, Edward Cree wrote:

From: Alexei Starovoitov 

The test makes a read through a map value pointer, then considers pruning
  a branch where the register holds an adjusted map value pointer.  It
  should not prune, but currently it does.

Signed-off-by: Alexei Starovoitov 
[ec...@solarflare.com: added test-name and patch description]
Signed-off-by: Edward Cree 


Acked-by: Daniel Borkmann 
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH v2 net-next 1/5] selftests/bpf: add a test for a bug in liveness-based pruning

2017-08-23 Thread Daniel Borkmann via iovisor-dev

On 08/23/2017 04:09 PM, Edward Cree wrote:

Writes in straight-line code should not prevent reads from propagating
  along jumps.  With current verifier code, the jump from 3 to 5 does not
  add a read mark on 3:R0 (because 5:R0 has a write mark), meaning that
  the jump from 1 to 3 gets pruned as safe even though R0 is NOT_INIT.

Verifier output:
0: (61) r2 = *(u32 *)(r1 +0)
1: (35) if r2 >= 0x0 goto pc+1
  R1=ctx(id=0,off=0,imm=0) R2=inv(id=0,umax_value=4294967295,var_off=(0x0; 
0x)) R10=fp0
2: (b7) r0 = 0
3: (35) if r2 >= 0x0 goto pc+1
  R0=inv0 R1=ctx(id=0,off=0,imm=0) 
R2=inv(id=0,umax_value=4294967295,var_off=(0x0; 0x)) R10=fp0
4: (b7) r0 = 0
5: (95) exit

from 3 to 5: safe

from 1 to 3: safe
processed 8 insns, stack depth 0

Signed-off-by: Edward Cree 


Acked-by: Daniel Borkmann 
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH v3 net-next] bpf/verifier: track liveness for pruning

2017-08-21 Thread Daniel Borkmann via iovisor-dev

On 08/21/2017 10:44 PM, Edward Cree wrote:

On 21/08/17 21:27, Daniel Borkmann wrote:

On 08/21/2017 08:36 PM, Edward Cree wrote:

On 19/08/17 00:37, Alexei Starovoitov wrote:

[...]

I'm tempted to just rip out env->varlen_map_value_access and always check
   the whole thing, because honestly I don't know what it was meant to do
   originally or how it can ever do any useful pruning.  While drastic, it
   does cause your test case to pass.


Original intention from 484611357c19 ("bpf: allow access into map
value arrays") was that it wouldn't potentially make pruning worse
if PTR_TO_MAP_VALUE_ADJ was not used, meaning that we wouldn't need
to take reg state's min_value and max_value into account for state
checking; this was basically due to min_value / max_value is being
adjusted/tracked on every alu/jmp ops for involved regs (e.g.
adjust_reg_min_max_vals() and others that mangle them) even if we
have the case that no actual dynamic map access is used throughout
the program. To give an example on net tree, the bpf_lxc.o prog's
section increases from 36,386 to 68,226 when env->varlen_map_value_access
is always true, so it does have an effect. Did you do some checks
on this on net-next?

I tested with the cilium progs and saw no change in insn count.  I
  suspect that for the normal case I already killed this optimisation
  when I did my unification patch, it was previously about ignoring
  min/max values on all regs (including scalars), whereas on net-next
  it only ignores them on map_value pointers; in practice this is
  useless because we tend to still have the offset scalar sitting in
  a register somewhere.  (Come to think of it, this may have been
  behind a large chunk of the #insn increase that my patches caused.)


Yeah, this would seem plausible.


Since we use umax_value in find_good_pkt_pointers() now (to check
  against MAX_PACKET_OFF and ensure our reg->range is really ok), we
  can't just stop caring about all min/max values just because we
  haven't done any variable map accesses.
I don't see a way around this.


Agree, was thinking the same. If there's not really a regression in
terms of complexity, then lets kill the flag.
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH v3 net-next] bpf/verifier: track liveness for pruning

2017-08-15 Thread Daniel Borkmann via iovisor-dev

On 08/15/2017 09:34 PM, Edward Cree wrote:

State of a register doesn't matter if it wasn't read in reaching an exit;
  a write screens off all reads downstream of it from all explored_states
  upstream of it.
This allows us to prune many more branches; here are some processed insn
  counts for some Cilium programs:
Program  before  after
bpf_lb_opt_-DLB_L3.o   6515   3361
bpf_lb_opt_-DLB_L4.o   8976   5176
bpf_lb_opt_-DUNKNOWN.o 2960   1137
bpf_lxc_opt_-DDROP_ALL.o  95412  48537
bpf_lxc_opt_-DUNKNOWN.o  141706  78718
bpf_netdev.o  24251  17995
bpf_overlay.o 10999   9385

The runtime is also improved; here are 'time' results in ms:
Program  before  after
bpf_lb_opt_-DLB_L3.o 24  6
bpf_lb_opt_-DLB_L4.o 26 11
bpf_lb_opt_-DUNKNOWN.o   11  2
bpf_lxc_opt_-DDROP_ALL.o   1288139
bpf_lxc_opt_-DUNKNOWN.o1768234
bpf_netdev.o 62 31
bpf_overlay.o15 13

Signed-off-by: Edward Cree 


Acked-by: Daniel Borkmann 
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH v5 net-next 00/12] bpf: rewrite value tracking in verifier

2017-08-07 Thread Daniel Borkmann via iovisor-dev

On 08/07/2017 04:21 PM, Edward Cree wrote:

This series simplifies alignment tracking, generalises bounds tracking and
  fixes some bounds-tracking bugs in the BPF verifier.  Pointer arithmetic on
  packet pointers, stack pointers, map value pointers and context pointers has
  been unified, and bounds on these pointers are only checked when the pointer
  is dereferenced.
Operations on pointers which destroy all relation to the original pointer
  (such as multiplies and shifts) are disallowed if !env->allow_ptr_leaks,
  otherwise they convert the pointer to an unknown scalar and feed it to the
  normal scalar arithmetic handling.
Pointer types have been unified with the corresponding adjusted-pointer types
  where those existed (e.g. PTR_TO_MAP_VALUE[_ADJ] or FRAME_PTR vs
  PTR_TO_STACK); similarly, CONST_IMM and UNKNOWN_VALUE have been unified into
  SCALAR_VALUE.
Pointer types (except CONST_PTR_TO_MAP, PTR_TO_MAP_VALUE_OR_NULL and
  PTR_TO_PACKET_END, which do not allow arithmetic) have a 'fixed offset' and
  a 'variable offset'; the former is used when e.g. adding an immediate or a
  known-constant register, as long as it does not overflow.  Otherwise the
  latter is used, and any operation creating a new variable offset creates a
  new 'id' (and, for PTR_TO_PACKET, clears the 'range').
SCALAR_VALUEs use the 'variable offset' fields to track the range of possible
  values; the 'fixed offset' should never be set on a scalar.


Been testing and reviewing the series over the last several days, looks
reasonable to me as far as I can tell. Thanks for all the hard work on
unifying this, Edward!

Acked-by: Daniel Borkmann 
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH v3 net-next 00/12] bpf: rewrite value tracking in verifier

2017-07-07 Thread Daniel Borkmann via iovisor-dev

On 07/07/2017 02:50 PM, Edward Cree wrote:

On 07/07/17 10:14, Daniel Borkmann wrote:

But this means the bpf_lxc_* cases increase quite significantly,
arguably one of them is pretty close already, but the other one not
so much, meaning while 142k would shoot over the 128k target quite a
bit, the 95k is quite close to the point that it wouldn't take much,
say, few different optimizations from compiler, to hit the limit as
well eventually, something like 156k for the time being would seem a
more adequate raise perhaps that needs to be evaluated carefully
given the situation.

Note that the numbers in my table are the _sum_ of all the progs in the
  object file, not the #insns for a single program.  (Hence the awk
  invocation in my pipeline.)  For instance in bpf_lxc_opt_-DUNKNOWN.o
  on net-next there were (iirc) a couple of 30k progs and then some
  smaller ones, not a single 93k prog.


Okay, sorry, seems I misread in that case.
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH v3 net-next 00/12] bpf: rewrite value tracking in verifier

2017-07-04 Thread Daniel Borkmann via iovisor-dev

On 07/04/2017 09:22 PM, Edward Cree wrote:

On 30/06/17 19:15, Alexei Starovoitov wrote:

On 6/30/17 9:44 AM, Edward Cree wrote:

I haven't measured the test_progs ones, because I *still* haven't gotten
  around to actually setting up a BPF toolchain (it doesn't help that I'm
  building everything on a test server that gets reimaged every night to
  run our nightly tests...).


then you're missing a lot of tests then...
installing llvm is trivial. On x86 there are plenty of pre-built
packages that you can apt-get or yum.
Dave had to compile llvm and gcc from source on sparc, so volatile test
server isn't really an excuse to miss all these tests ;)
especially for such large verifier change.


After two days' wrestling with clang's build system, I'm finally able to
  run test_progs, and all its tests pass as of the full patch series.


(Hmm, usually with major distros LLVM comes with BPF targets enabled
by default these days, so there's less need to compile it from scratch
actually, just installation via yum/apt/... would suffice then.)


Here are the processed insn counts:

Program net-next  short  full
test_pkt_access   78 7979
test_xdp 386411   407
test_l4lb   6438   4154  4154
test_tcp_estats  435436   435
test_bpf_obj_id8  8 8
test_pkt_md_access41 4242

"short" is the first 3 patches plus the 'roll back ptr' patch I
  posted on Friday.  "full" is the full 12-patch series.  "Program" is
  the function in test_progs.c.
I don't know why test_l4lb has to process _fewer_ insns with my patches;
  if anything I'm worrying that I may be incorrectly pruning branches.
(I've spotted a possible bug in that I'm not looking at 'id' which,
  although it doesn't have to match, if two regs in the old state had the
  same id as each other, then those regs in the new state have to have
  the same id as each other too.)
Also interesting is that going from "short" to "full" only decreases the
  counts, suggesting that the ptr and full negative/positive
  tracking isn't (at least for these test cases) costly.


Have you tried with cilium's BPF code? The kernel selftests are quite small,
so not really pushing processed insns too far. I can send you a BPF obj file
if that's easier for testing.

Thanks,
Daniel
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH v3 net-next 00/12] bpf: rewrite value tracking in verifier

2017-06-28 Thread Daniel Borkmann via iovisor-dev

On 06/28/2017 04:11 PM, Edward Cree wrote:

On 28/06/17 14:50, Daniel Borkmann wrote:

Hi Edward,

Did you also have a chance in the meantime to look at reducing complexity
along with your unification? I did run the cilium test suite with your
latest set from here and current # worst case processed insns that
verifier has to go through for cilium progs increases from ~53k we have
right now to ~76k. I'm a bit worried that this quickly gets us close to
the upper ~98k max limit starting to reject programs again. Alternative
is to bump the complexity limit again in near future once run into it,
but preferably there's a way to optimize it along with the rewrite? Do
you see any possibilities worth exploring?

The trouble, I think, is that as we're now tracking more information about
  each register value, we're less able to prune branches.  But often that
  information is not actually being used in reaching the exit state.  So it


Agree.


  seems like the way to tackle this would be to track what information is
  used — or at least, which registers are read from (including e.g. writing
  through them or passing them to helper calls) — in reaching a safe state.
  Then only registers which are used are required to match for pruning.
But that tracking would presumably have to propagate backwards through the
  verifier stack, and I'm not sure how easily that could be done.  Someone
  (was it you?) was talking about replacing the current DAG walking and
  pruning with some kind of basic-block thing, which would help with this.
Summary: I think it could be done, but I haven't looked into the details
  of implementation yet; if it's not actually breaking your programs (yet),
  maybe leave it for a followup patch series?


Could we adapt the limit to 128k perhaps as part of this set
given we know that we're tracking more meta data here anyway?
Then we could potentially avoid going via -stable later on,
biggest pain point is usually tracking differences in LLVM
code generation (e.g. differences in optimizations) along with
verifier changes to make sure that programs still keep loading
on older kernels with e.g. newer LLVM; one of the issues is that
pruning can be quite fragile. E.g. worst case adding a simple
var in a branch that LLVM assigns a stack slot that was otherwise
not used throughout the prog can cause a significant increase of
verifier work (run into this multiple times in the past and
is a bit of a pain to track down actually). If we could keep
some buffer in BPF_COMPLEXITY_LIMIT_INSNS at least when we know
that more work is needed anyway from that point onward, that
would be good.
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH v3 net-next 02/12] bpf/verifier: rework value tracking

2017-06-28 Thread Daniel Borkmann via iovisor-dev

On 06/28/2017 06:07 PM, Edward Cree wrote:

On 28/06/17 16:15, Daniel Borkmann wrote:

On 06/27/2017 02:56 PM, Edward Cree wrote:

Tracks value alignment by means of tracking known & unknown bits.
Tightens some min/max value checks and fixes a couple of bugs therein.


You mean the one in relation to patch 1/12? Would be good to elaborate
here since otherwise this gets forgotten few weeks later.

That wasn't the only one; there were also some in the new min/max value
  calculation for ALU ops.  For instance, in subtraction we were taking
  the new bounds as [min-min, max-max] instead of [min-max, max-min].
I can't remember what else there was and there might also have been some
  that I missed but that got incidentally fixed by the rewrite.  But I
  guess I should change "checks" to "checks and updates" in the above?


Ok. Would be good though to have them all covered in the selftests
part of your series if possible, so we can make sure to keep track
of these cases.


Could you also document all the changes that verifier will then start
allowing for after the patch?

Maybe not the changes, because the old verifier had a lot of special
  cases, but I could, and probably should, document the new behaviour
  (maybe in Documentation/networking/filter.txt, that already has a bit
  of description of the verifier).


Yeah, that would definitely help; filter.txt should be fine.


[...]

   /* check whether memory at (regno + off) is accessible for t = (read | write)
@@ -899,52 +965,79 @@ static int check_mem_access(struct bpf_verifier_env *env, 
int insn_idx, u32 regn
   struct bpf_reg_state *reg = >regs[regno];
   int size, err = 0;

-if (reg->type == PTR_TO_STACK)
-off += reg->imm;
-
   size = bpf_size_to_bytes(bpf_size);
   if (size < 0)
   return size;


[...]

-if (reg->type == PTR_TO_MAP_VALUE ||
-reg->type == PTR_TO_MAP_VALUE_ADJ) {
+/* for access checks, reg->off is just part of off */
+off += reg->off;


Could you elaborate on why removing the reg->type == PTR_TO_STACK?

Previously bpf_reg_state had a member 'imm' which, for PTR_TO_STACK, was
  a fixed offset, so we had to add it in to the offset.  Now we instead
  have reg->off and it's generic to all pointerish types, so we don't need
  special handling of PTR_TO_STACK here.

Also in context of below PTR_TO_CTX.

[...]

   } else if (reg->type == PTR_TO_CTX) {
-enum bpf_reg_type reg_type = UNKNOWN_VALUE;
+enum bpf_reg_type reg_type = SCALAR_VALUE;

   if (t == BPF_WRITE && value_regno >= 0 &&
   is_pointer_value(env, value_regno)) {
   verbose("R%d leaks addr into ctx\n", value_regno);
   return -EACCES;
   }
+/* ctx accesses must be at a fixed offset, so that we can
+ * determine what type of data were returned.
+ */
+if (!tnum_is_const(reg->var_off)) {
+char tn_buf[48];
+
+tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
+verbose("variable ctx access var_off=%s off=%d size=%d",
+tn_buf, off, size);
+return -EACCES;
+}
+off += reg->var_off.value;


... f.e. in PTR_TO_CTX case the only access that is currently
allowed is LDX/STX with fixed offset from insn->off, which is
passed as off param to check_mem_access(). Can you elaborate on
off += reg->var_off.value? Meaning we make this more dynamic
as long as access is known const?

So, I can't actually figure out how to construct a pointer with a known
  variable offset, but future changes to the verifier (like learning from
  comparing two pointers with the same base) could make it possible.  The
  situation we're handling here is where our register holds ctx + x,
  where x is also known to be some constant value k, and currently I don't
  know if that's possible except for the trivial case of k==0, and the edge
  case where k is too big to fit in the s32 reg->off (in which case the
  check_ctx_access will presumably reject it).
Stepping back a bit, each register holding a pointer type has two offsets,
  reg->off and reg->var_off, and the latter is a tnum representing
  knowledge about a value that's not necessarily exactly known.  But
  tnum_is_const checks that it _is_ exactly known.


Right, I was reviewing this with the thought in mind where we could
run into a pruning situation where in the first path we either add
a scalar or offset to the ctx ptr that is then spilled to stack, later
filled to a reg again with eventual successful exit. And the second path
would prune on the spilled reg, but even if scalar, we require that it's
a _known_ const whereas reading back from stack marks it unknown, so that
is not possible. So all is fine; including your below example since it
all has to be a _known_ scalar.


There is another case that we allow now through the reg->off handling:
  adding a constant to a pointer and then dereferencing it.
So, with r1=ctx, instead of r2 = 

Re: [iovisor-dev] [PATCH v3 net-next 02/12] bpf/verifier: rework value tracking

2017-06-28 Thread Daniel Borkmann via iovisor-dev

On 06/27/2017 02:56 PM, Edward Cree wrote:

Tracks value alignment by means of tracking known & unknown bits.
Tightens some min/max value checks and fixes a couple of bugs therein.


You mean the one in relation to patch 1/12? Would be good to elaborate
here since otherwise this gets forgotten few weeks later.

Could you also document all the changes that verifier will then start
allowing for after the patch?


If pointer leaks are allowed, and adjust_ptr_min_max_vals returns -EACCES,
  treat the pointer as an unknown scalar and try again, because we might be
  able to conclude something about the result (e.g. pointer & 0x40 is either
  0 or 0x40).

Signed-off-by: Edward Cree 

[...]

  /* check whether memory at (regno + off) is accessible for t = (read | write)
@@ -899,52 +965,79 @@ static int check_mem_access(struct bpf_verifier_env *env, 
int insn_idx, u32 regn
struct bpf_reg_state *reg = >regs[regno];
int size, err = 0;

-   if (reg->type == PTR_TO_STACK)
-   off += reg->imm;
-
size = bpf_size_to_bytes(bpf_size);
if (size < 0)
return size;


[...]

-   if (reg->type == PTR_TO_MAP_VALUE ||
-   reg->type == PTR_TO_MAP_VALUE_ADJ) {
+   /* for access checks, reg->off is just part of off */
+   off += reg->off;


Could you elaborate on why removing the reg->type == PTR_TO_STACK?
Also in context of below PTR_TO_CTX.

[...]

} else if (reg->type == PTR_TO_CTX) {
-   enum bpf_reg_type reg_type = UNKNOWN_VALUE;
+   enum bpf_reg_type reg_type = SCALAR_VALUE;

if (t == BPF_WRITE && value_regno >= 0 &&
is_pointer_value(env, value_regno)) {
verbose("R%d leaks addr into ctx\n", value_regno);
return -EACCES;
}
+   /* ctx accesses must be at a fixed offset, so that we can
+* determine what type of data were returned.
+*/
+   if (!tnum_is_const(reg->var_off)) {
+   char tn_buf[48];
+
+   tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
+   verbose("variable ctx access var_off=%s off=%d size=%d",
+   tn_buf, off, size);
+   return -EACCES;
+   }
+   off += reg->var_off.value;


... f.e. in PTR_TO_CTX case the only access that is currently
allowed is LDX/STX with fixed offset from insn->off, which is
passed as off param to check_mem_access(). Can you elaborate on
off += reg->var_off.value? Meaning we make this more dynamic
as long as access is known const?


err = check_ctx_access(env, insn_idx, off, size, t, _type);
if (!err && t == BPF_READ && value_regno >= 0) {
-   mark_reg_unknown_value_and_range(state->regs,
-value_regno);
-   /* note that reg.[id|off|range] == 0 */
+   /* ctx access returns either a scalar, or a
+* PTR_TO_PACKET[_END].  In the latter case, we know
+* the offset is zero.
+*/
+   if (reg_type == SCALAR_VALUE)
+   mark_reg_unknown(state->regs, value_regno);
+   else
+   mark_reg_known_zero(state->regs, value_regno);
+   state->regs[value_regno].id = 0;
+   state->regs[value_regno].off = 0;
+   state->regs[value_regno].range = 0;
state->regs[value_regno].type = reg_type;
-   state->regs[value_regno].aux_off = 0;
-   state->regs[value_regno].aux_off_align = 0;
}

-   } else if (reg->type == FRAME_PTR || reg->type == PTR_TO_STACK) {
+   } else if (reg->type == PTR_TO_STACK) {

[...]
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH v3 net-next 01/12] selftests/bpf: add test for mixed signed and unsigned bounds checks

2017-06-28 Thread Daniel Borkmann via iovisor-dev

On 06/27/2017 02:56 PM, Edward Cree wrote:

Currently fails due to bug in verifier bounds handling.

Signed-off-by: Edward Cree 


Acked-by: Daniel Borkmann 
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] How can I know XDP supporting NICs?

2017-04-05 Thread Daniel Borkmann via iovisor-dev

On 04/05/2017 02:09 PM, Paul Chaignon via iovisor-dev wrote:
[...]

Although it's not yet listed on the bcc repository, the Netronome nfp
driver does have support for XDP:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
linux.git/commit/?id=ecd63a0217d5f1e8a92f7516f5586d1177b95de2


If you have a kernel git tree at hand, you can also check current in-tree
drivers that implement ndo_xdp callback:

$ git grep -n ndo_xdp drivers/
drivers/net/ethernet/broadcom/bnxt/bnxt.c:7303: .ndo_xdp= 
bnxt_xdp,
drivers/net/ethernet/mellanox/mlx4/en_netdev.c:2870:.ndo_xdp
= mlx4_xdp,
drivers/net/ethernet/mellanox/mlx4/en_netdev.c:2907:.ndo_xdp
= mlx4_xdp,
drivers/net/ethernet/mellanox/mlx5/core/en_main.c:3593: .ndo_xdp
 = mlx5e_xdp,
drivers/net/ethernet/mellanox/mlx5/core/en_main.c:3629: .ndo_xdp
 = mlx5e_xdp,
drivers/net/ethernet/netronome/nfp/nfp_net_common.c:2957:   .ndo_xdp
= nfp_net_xdp,
drivers/net/ethernet/qlogic/qede/qede_main.c:556:   .ndo_xdp = qede_xdp,
drivers/net/virtio_net.c:1906:  .ndo_xdp= virtnet_xdp,
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] How to attach a BPF program to TC using C/C++

2017-04-03 Thread Daniel Borkmann via iovisor-dev

On 04/03/2017 03:40 PM, Mauricio Vasquez via iovisor-dev wrote:

Dear All,

I am trying to inject a BPF_PROG_TYPE_SCHED_ACT program and attach it to the TC 
using C++.


Out of curiosity, why not .._SCHED_CLS program in direct-action mode
(same functionality wrt BPF and would save you additional layers of
indirection in tc at least), or are you required to run something
different as a classifier entity?


I had no particular problems compiling and injecting the program, however I 
haven't been able to attach it to TC, I know that in go I can use the netlink 
library [1] or in python pyroute2 [2], unfortunately it appears not to be a 
similar library in C++

Is there a library that implements this functionality for C or C++? in case of 
negative, does some body have a example code of how to achieve this using the 
netlink protocol?


Not aware of a C++ lib (but I also haven't looked much). You could
check the iproute2 code, f.e. tc/f_bpf.c to set up the nl message.


Thanks in Advance,

Mauricio

[1] https://github.com/vishvananda/netlink

[2] http://docs.pyroute2.org/

___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Reading src/dst addresses in reuseport programs

2017-03-30 Thread Daniel Borkmann via iovisor-dev

On 03/29/2017 09:00 PM, Alexei Starovoitov via iovisor-dev wrote:

On Tue, Mar 28, 2017 at 7:08 PM, Marek Vavruša via iovisor-dev
 wrote:

Hi,

so I was tinkering with programs attached to reuseport groups again, and
simple decisions work as expected.
The problem comes when I want to make decisions based on source or
destination IP address. Since the sockets
in the socket group are already UDP or TCP, I can only see the payloads. In
userspace, I can use IP(V6)_PKTINFO and get this over ancillary data, but I
couldn't find anything like this in helpers, so I'm wondering what to do.
Things I'm considering:

1. I'm probably missing something. I expected to be able to see the whole
packet in reuseport programs (unlike in socket filters bound to sockets).
2. I need to make a helper, or something like SKF_AD_PROTOCOL (or direct
access through skb), but for PKTINFO.
3. I should look at the addresses in tc, tag messages based on
source/destination address (assuming I can convey the tag through some skb
field somehow), and then make decisions based on tags in the socket filter.

Any ideas?


ldabs insn works with negative offset.
unfortunately skb_load_bytes() doesn't.
Or use BPF_NET_OFF ugliness, but it's slower.


Hm, probably makes sense to implement a version of skb_load_bytes() for
socket filter types explicitly that could also deal with such offsets.
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] BPF verifier: cannot add integer value with 0 upper zero bits to ptr_to_packet

2017-02-24 Thread Daniel Borkmann via iovisor-dev

On 02/23/2017 11:00 PM, William Tu wrote:

Hi Daniel,

Thanks for the reply! I understand the workaround, can you suggest a
way to try it? Should I compile the C source into llvm IR bitcode, add
thie "r0 &= 0x", then assembly it to bpf target binary.


Hmm, problem with C code is that we might have no control over when llvm
decides to spill to stack where this workaround would then be needed. Your
suggestion could work, maybe an llvm expert can jump in here if we have
some other, easier options at hand?

Second case is to preserve the type in verifier when spilled, but will make
pruning ineffective to the point that we could see regressions of programs
not loading anymore (due to running over BPF_COMPLEXITY_LIMIT_INSNS).
states_equal() wrt probing stack space doesn't have much optimizations
thus far either; certainly this path would need a careful evaluation.


On Wed, Feb 22, 2017 at 9:08 AM, Daniel Borkmann  wrote:

On 02/17/2017 06:07 PM, William Tu via iovisor-dev wrote:


Hi,

I encountered another BPF verifier issue related to my previous one
https://lists.iovisor.org/pipermail/iovisor-dev/2017-January/000603.html

My guess is that when register spills to stack and restore, the state
of imm upper zero bits does not get restore? Any comments are
appreciated!

This time the verifier shows:
---
   R0=imm0,min_value=0,max_value=0 R1=imm58,min_value=58,max_value=58
R3=pkt(id=0,off=58,r=58) R4=inv61 R5=pkt_end
R6=imm144,min_value=144,max_value=144 R7=imm0,min_value=0,max_value=0
R8=ctx R9=pkt(id=0,off=0,r=58) R10=fp
260: (bf) r5 = r6
261: (47) r5 |= 12
262: (bf) r1 = r5
263: (07) r1 += 44
264: (77) r1 >>= 3
265: (7b) *(u64 *)(r10 -64) = r1
266: (bf) r7 = r5
267: (07) r7 += 36
268: (77) r7 >>= 3
269: (bf) r0 = r5
270: (07) r0 += 20
271: (77) r0 >>= 3
272: (bf) r1 = r5
273: (07) r1 += 52
274: (77) r1 >>= 3
275: (77) r6 >>= 3
276: (79) r2 = *(u64 *)(r10 -24)
277: (bf) r2 = r5
278: (77) r2 >>= 3
279: (7b) *(u64 *)(r10 -184) = r2
280: (07) r5 += 180
281: (77) r5 >>= 3
282: (bf) r4 = r9
283: (0f) r4 += r5
284: (47) r5 |= 1
285: (bf) r3 = r9
286: (0f) r3 += r6
287: (bf) r6 = r9
288: (0f) r6 += r1
289: (47) r1 |= 1
290: (bf) r2 = r9
291: (0f) r2 += r0
292: (7b) *(u64 *)(r10 -312) = r2
293: (bf) r2 = r9
294: (79) r0 = *(u64 *)(r10 -184)
295: (0f) r2 += r0
cannot add integer value with 0 upper zero bits to ptr_to_packet


Right, so lines 260, 261 r5 is imm type, which in 277 is transferred
to r2 and eventually in 279 spilled to stack. check_stack_write() would
see that type is not spillable, so it will be stored as regular data
to stack (STACK_MISC). Later read in 294 will restore that to r0.
check_stack_read() sees that stack was marked as STACK_MISC and marks
reg as unknown, where you'll later get the error when added to ptr_to_packet
as no overflow protection can be guaranteed (imm is 0). Afaik (Alexei,
might correct me if I'm wrong), if we would support spilling imm, this
could have a non-trivial increase in complexity, where verifier would
need to do a lot more work due to pruning not taking effect. If you
look at evaluate_reg_alu(), you could try to work around it for the time
being by doing r0 &= 0x right before the r2 += r0.

Thanks,
Daniel

___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] bpf invalid stack off=-528

2017-02-09 Thread Daniel Borkmann via iovisor-dev

On 02/08/2017 09:22 PM, William Tu via iovisor-dev wrote:

Hi,

I have a program which I use around at most 300byte of local stack as
below. The largest struct is the "struct Headers" which is around 80
byte. However, when loading into the verifier, it says

393: (7b) *(u64 *)(r10 -56) = r1
394: (05) goto pc+56
451: (7b) *(u64 *)(r10 -528) = r0
invalid stack off=-528 size=8

I don't think I'm using more than 512byte. It seems that the llvm
generates the code which use more stack memory than I thought. Any
idea how to debug it? Or how to dump the llvm IR to know how it
allocates stack? Thanks

snippet of the code:
SEC("prog")
int ebpf_filter(struct xdp_md* skb){
 struct Headers hd;
 unsigned ebpf_packetOffsetInBits = 0;
 enum ebpf_errorCodes ebpf_errorCode = NoError;
 void* ebpf_packetStart = ((void*)(long)skb->data);
 void* ebpf_packetEnd = ((void*)(long)skb->data_end);
 u32 ebpf_zero = 0;
 u8 ebpf_byte = 0;
 u32 ebpf_outHeaderLength = 0;
 struct xdp_output xout;
 /* TODO: this should be initialized by the environment. HOW? */
 struct xdp_input xin;

 goto start;
 start: {
 /* extract(hd.ethernet


Looks like a code generation issue perhaps? It seems your prog parses
and fills a huge struct on the stack, f.e. eth dest is filled from
direct packet acces byte by byte (inefficient, but fair enough). The
below annotation somehow seems to be off slightly (?), but it's always
patterns like:

  r1 = *(u8 *)(r2 + 11)  ; load byte 1 from pkt into reg (u8)
  *(u64 *)(r10 - 408) = r1   ; store byte 1 into stack (u64)

  r1 = *(u8 *)(r2 + 10)  ; load byte 2 from pkt into reg (u8)
  *(u64 *)(r10 - 416) = r1   ; store byte 2 into stack (u64)

[...]
; void* ebpf_packetStart = ((void*)(long)skb->data);
   2:   r2 = *(u32 *)(r6 + 0)
[...]
; hd.ethernet.destination[5] = (u8)((load_byte(ebpf_packetStart, 
BYTES(ebpf_packetOffsetInBits) + 5) >> 0));
   9:   r1 = *(u8 *)(r2 + 11)
; hd.ethernet.destination[4] = (u8)((load_byte(ebpf_packetStart, 
BYTES(ebpf_packetOffsetInBits) + 4) >> 0));
  10:   *(u64 *)(r10 - 408) = r1
  11:   r1 = *(u8 *)(r2 + 10)
; hd.ethernet.destination[3] = (u8)((load_byte(ebpf_packetStart, 
BYTES(ebpf_packetOffsetInBits) + 3) >> 0));
  12:   *(u64 *)(r10 - 416) = r1
  13:   r1 = *(u8 *)(r2 + 9)
; hd.ethernet.destination[2] = (u8)((load_byte(ebpf_packetStart, 
BYTES(ebpf_packetOffsetInBits) + 2) >> 0));
  14:   *(u64 *)(r10 - 424) = r1
  15:   r1 = *(u8 *)(r2 + 8)
; hd.ethernet.destination[1] = (u8)((load_byte(ebpf_packetStart, 
BYTES(ebpf_packetOffsetInBits) + 1) >> 0));
  16:   *(u64 *)(r10 - 432) = r1
  17:   r1 = *(u8 *)(r2 + 7)
; hd.ethernet.destination[0] = (u8)((load_byte(ebpf_packetStart, 
BYTES(ebpf_packetOffsetInBits) + 0) >> 0));
  18:   *(u64 *)(r10 - 440) = r1
  19:   r1 = *(u8 *)(r2 + 6)
[...]

Despite a struct of:

struct Ethernet {
char source[6]; /* bit<48> */
char destination[6]; /* bit<48> */
u16 protocol; /* bit<16> */
u8 ebpf_valid;
};

Maybe packing structs helps a bit (but still shouldn't be such waste,
hmm ...).


full C, objdump
https://gist.github.com/williamtu/5a09b60a951ee5fc062328766403ab4b
thanks
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev



___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Documentation on eBPF map types?

2017-02-02 Thread Daniel Borkmann via iovisor-dev

On 02/02/2017 06:00 PM, Jesper Dangaard Brouer via iovisor-dev wrote:

On Thu, 2 Feb 2017 11:56:19 +0100
Jesper Dangaard Brouer  wrote:


On Tue, 31 Jan 2017 20:54:10 -0800
Alexei Starovoitov  wrote:


On Tue, Jan 31, 2017 at 9:54 AM, Jesper Dangaard Brouer via
iovisor-dev  wrote:


On Sat, 28 Jan 2017 10:14:58 +1100 Brendan Gregg  
wrote:


I did some in the bcc ref guide, but it's incomplete, and the bcc versions:
https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md


Thanks! - this seem rather BCC specific syntax, and I'm looking for
documentation close for the kernel (samples/bpf/).

The best documentation I found was the man-page for the syscall bpf(2):
  http://man7.org/linux/man-pages/man2/bpf.2.html

In lack of a better place, I've started documenting eBPF here:
  https://prototype-kernel.readthedocs.io/en/latest/bpf/index.html

This doc is compatible with the kernels doc format, and I hope we can
push this into the kernel tree, if it turns out to be valuable?
(https://www.kernel.org/doc/html/latest/)


yeah. definitely would be great to add map descriptions to the kernel docs.
So far most of it is in commit logs.
git log kernel/bpf/arraymap.c|tail -33
git log kernel/bpf/hashtab.c|tail -33
will give an overview of key hash and array map principles.


Thanks, I'm using that to write some doc.
  http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html
Gotten to BPF_MAP_TYPE_ARRAY
  
http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#bpf-map-type-array

Can you explain the difference between the kernel and userspace side of
the call bpf_map_lookup_elem() ?

Kernel side:
   long *value = bpf_map_lookup_elem(_map, );

Userspace side:
   long long value;
   bpf_map_lookup_elem(map_fd[0], , )

Looks like userspace gets a copy of the memory...
If so, how can userspace then increment the value safely?


I documented this myself, please correct me.
  
http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#creating-a-map
  
http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#interacting-with-maps

Commit:
  https://github.com/netoptimizer/prototype-kernel/commit/ffbcdc453f3c

Inlined below:
  -

[PATCH] doc: how interacting with eBPF maps works

Documented by reading the code.

I hope someone more knowledgeable will review this and
correct me where I misunderstood things.

Signed-off-by: Jesper Dangaard Brouer 


Would be great if you could submit some of the missing pieces to
linux-...@vger.kernel.org for bpf(2) man-page inclusion while at
it, so this doesn't need to be done twice. Thanks for documenting.

Note that struct bpf_map_def can vary based on the bpf loader
implementation f.e. iproute2 has a different layout and also makes
use of map/prog pinning (see iproute2's lib/bpf.c).

Thanks,
Daniel
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Howto debug eBPF programs and bpf_trace_printk

2017-01-24 Thread Daniel Borkmann via iovisor-dev

On 01/24/2017 10:52 AM, Jesper Dangaard Brouer via iovisor-dev wrote:

Hi IOvisor-group,

I'm playing with kernels samples/bpf, and want so advice on debugging
eBPF programs.

First question: I noticed bpf_trace_printk(), but nothing shows up when
I'm using it... what did I miss/forget?


Depends where you look. F.e. tc has 'tc exec bpf dbg' to find the mount
and dump the trace pipe.


Second question: When doing some mistake the eBPF validator does not
like, I get a dump of eBPF asm codes, which I cannot correlate to line
number in the pseudo-C code.  Is there some debug option I can enable
to get some more help?  (Saw William had some additional ";" output)


Yep, see discussion from netdev:

  https://www.spinics.net/lists/netdev/msg406926.html
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] BPF verifier: cannot add integer value with 0 upper zero bits to ptr_to_packet

2017-01-19 Thread Daniel Borkmann via iovisor-dev

On 01/19/2017 04:53 PM, William Tu wrote:

thanks for the patch, it works.


Thanks a lot, I'll see to cook a proper patch then.


from 52 to 80: R0=imm1,min_value=1,max_value=1 R1=pkt(id=0,off=0,r=34)
R2=pkt_end R3=inv R4=imm272 R5=inv56,min_value=17,max_value=17
R6=pkt(id=0,off=26,r=34) R10=fp
80: (07) r4 += 71
81: (18) r5 = 0xfff8
83: (5f) r4 &= r5
84: (77) r4 >>= 3
85: (0f) r1 += r4
cannot add integer value with 3 upper zero bits to ptr_to_packet

I have no idea why this time the compiler wants to do "and" then
"right shift", instead of directly shit right 3 bit. R5's type is
UNKNOWN_VALUE, So in addition to your patch, I add



But what the compiler is doing here is still correct eventually I assume?


Yes, it zero out the least significant 3 bits first, then shift right 3 bits.


We could also add BPF_AND, but that r5's type is UNKNOWN_VALUE doesn't
seem right if I'm not mistaken. That r5 = 0xfff8 should be a dw imm
load, print_bpf_insn() seems only to print first part. So next r4 &= r5
should be purely imm operation as well and thus trackable. Only that we
need to teach verifier to keep the dw imm when not used with analyzer_ops
(nfp related).

Could you try the below (only compile tested here) to see whether the
verifier can now analyze that part?

Thanks,
Daniel

  kernel/bpf/verifier.c   | 35
++---
  tools/testing/selftests/bpf/test_verifier.c | 24 
  2 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d60e12c..777ef0a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1567,19 +1567,44 @@ static int evaluate_reg_imm_alu(struct
bpf_verifier_env *env,

 struct bpf_reg_state *src_reg = [insn->src_reg];
 u8 opcode = BPF_OP(insn->code);

-   /* dst_reg->type == CONST_IMM here, simulate execution of 'add'/'or'
-* insn. Don't care about overflow or negative values, just add them
+   /* dst_reg->type == CONST_IMM here, simulate execution of
'add'/'or'/...
+* insn. Don't care about overflow or negative values, just add
them.
  */
 if (opcode == BPF_ADD && BPF_SRC(insn->code) == BPF_K)
 dst_reg->imm += insn->imm;
 else if (opcode == BPF_ADD && BPF_SRC(insn->code) == BPF_X &&
  src_reg->type == CONST_IMM)
 dst_reg->imm += src_reg->imm;
+   else if (opcode == BPF_SUB && BPF_SRC(insn->code) == BPF_K)
+   dst_reg->imm -= insn->imm;
+   else if (opcode == BPF_SUB && BPF_SRC(insn->code) == BPF_X &&
+src_reg->type == CONST_IMM)
+   dst_reg->imm -= src_reg->imm;
+   else if (opcode == BPF_MUL && BPF_SRC(insn->code) == BPF_K)
+   dst_reg->imm *= insn->imm;
+   else if (opcode == BPF_MUL && BPF_SRC(insn->code) == BPF_X &&
+src_reg->type == CONST_IMM)
+   dst_reg->imm *= src_reg->imm;
 else if (opcode == BPF_OR && BPF_SRC(insn->code) == BPF_K)
 dst_reg->imm |= insn->imm;
 else if (opcode == BPF_OR && BPF_SRC(insn->code) == BPF_X &&
  src_reg->type == CONST_IMM)
 dst_reg->imm |= src_reg->imm;
+   else if (opcode == BPF_AND && BPF_SRC(insn->code) == BPF_K)
+   dst_reg->imm &= insn->imm;
+   else if (opcode == BPF_AND && BPF_SRC(insn->code) == BPF_X &&
+src_reg->type == CONST_IMM)
+   dst_reg->imm &= src_reg->imm;


Thanks, with this patch. now r5 is CONST_IMM, not UNKNOWN_VALUE.

Regards,
William



+   else if (opcode == BPF_RSH && BPF_SRC(insn->code) == BPF_K)
+   dst_reg->imm >>= insn->imm;
+   else if (opcode == BPF_RSH && BPF_SRC(insn->code) == BPF_X &&
+src_reg->type == CONST_IMM)
+   dst_reg->imm >>= src_reg->imm;
+   else if (opcode == BPF_LSH && BPF_SRC(insn->code) == BPF_K)
+   dst_reg->imm <<= insn->imm;
+   else if (opcode == BPF_LSH && BPF_SRC(insn->code) == BPF_X &&
+src_reg->type == CONST_IMM)
+   dst_reg->imm <<= src_reg->imm;
 else
 mark_reg_unknown_value(regs, insn->dst_reg);
 return 0;
@@ -2225,14 +2250,8 @@ static int check_ld_imm(struct bpf_verifier_env *env,
struct bpf_insn *insn)
 return err;

 if (insn->src_reg == 0) {
-   /* generic move 64-bit immediate into a register,
-* only analyzer needs to collect the ld_imm value.
-*/
 u64 imm = ((u64)(insn + 1)->imm << 32) | (u32)insn->imm;

-   if (!env->analyzer_ops)
-   return 0;
-
 regs[insn->dst_reg].type = CONST_IMM;
 regs[insn->dst_reg].imm = imm;

 return 0;
diff --git a/tools/testing/selftests/bpf/test_verifier.c
b/tools/testing/selftests/bpf/test_verifier.c
index 1aa7324..d60bd07 100644
--- 

Re: [iovisor-dev] BPF verifier: cannot add integer value with 0 upper zero bits to ptr_to_packet

2017-01-18 Thread Daniel Borkmann via iovisor-dev

On 01/18/2017 06:29 PM, William Tu wrote:

Hi Daniel,
Thanks for the patch. Yes, it solves my issue. Then there is another
related due to BPF_AND

LBB0_12:
; if (ebpf_packetEnd < ebpf_packetStart + BYTES(ebpf_packetOffsetInBits + 64)) {

from 52 to 80: R0=imm1,min_value=1,max_value=1 R1=pkt(id=0,off=0,r=34)
R2=pkt_end R3=inv R4=imm272 R5=inv56,min_value=17,max_value=17
R6=pkt(id=0,off=26,r=34) R10=fp
80: (07) r4 += 71
81: (18) r5 = 0xfff8
83: (5f) r4 &= r5
84: (77) r4 >>= 3
85: (0f) r1 += r4
cannot add integer value with 3 upper zero bits to ptr_to_packet

I have no idea why this time the compiler wants to do "and" then
"right shift", instead of directly shit right 3 bit. R5's type is
UNKNOWN_VALUE, So in addition to your patch, I add


But what the compiler is doing here is still correct eventually I assume?
We could also add BPF_AND, but that r5's type is UNKNOWN_VALUE doesn't
seem right if I'm not mistaken. That r5 = 0xfff8 should be a dw imm
load, print_bpf_insn() seems only to print first part. So next r4 &= r5
should be purely imm operation as well and thus trackable. Only that we
need to teach verifier to keep the dw imm when not used with analyzer_ops
(nfp related).

Could you try the below (only compile tested here) to see whether the
verifier can now analyze that part?

Thanks,
Daniel

 kernel/bpf/verifier.c   | 35 ++---
 tools/testing/selftests/bpf/test_verifier.c | 24 
 2 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d60e12c..777ef0a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1567,19 +1567,44 @@ static int evaluate_reg_imm_alu(struct bpf_verifier_env 
*env,
struct bpf_reg_state *src_reg = [insn->src_reg];
u8 opcode = BPF_OP(insn->code);

-   /* dst_reg->type == CONST_IMM here, simulate execution of 'add'/'or'
-* insn. Don't care about overflow or negative values, just add them
+   /* dst_reg->type == CONST_IMM here, simulate execution of 'add'/'or'/...
+* insn. Don't care about overflow or negative values, just add them.
 */
if (opcode == BPF_ADD && BPF_SRC(insn->code) == BPF_K)
dst_reg->imm += insn->imm;
else if (opcode == BPF_ADD && BPF_SRC(insn->code) == BPF_X &&
 src_reg->type == CONST_IMM)
dst_reg->imm += src_reg->imm;
+   else if (opcode == BPF_SUB && BPF_SRC(insn->code) == BPF_K)
+   dst_reg->imm -= insn->imm;
+   else if (opcode == BPF_SUB && BPF_SRC(insn->code) == BPF_X &&
+src_reg->type == CONST_IMM)
+   dst_reg->imm -= src_reg->imm;
+   else if (opcode == BPF_MUL && BPF_SRC(insn->code) == BPF_K)
+   dst_reg->imm *= insn->imm;
+   else if (opcode == BPF_MUL && BPF_SRC(insn->code) == BPF_X &&
+src_reg->type == CONST_IMM)
+   dst_reg->imm *= src_reg->imm;
else if (opcode == BPF_OR && BPF_SRC(insn->code) == BPF_K)
dst_reg->imm |= insn->imm;
else if (opcode == BPF_OR && BPF_SRC(insn->code) == BPF_X &&
 src_reg->type == CONST_IMM)
dst_reg->imm |= src_reg->imm;
+   else if (opcode == BPF_AND && BPF_SRC(insn->code) == BPF_K)
+   dst_reg->imm &= insn->imm;
+   else if (opcode == BPF_AND && BPF_SRC(insn->code) == BPF_X &&
+src_reg->type == CONST_IMM)
+   dst_reg->imm &= src_reg->imm;
+   else if (opcode == BPF_RSH && BPF_SRC(insn->code) == BPF_K)
+   dst_reg->imm >>= insn->imm;
+   else if (opcode == BPF_RSH && BPF_SRC(insn->code) == BPF_X &&
+src_reg->type == CONST_IMM)
+   dst_reg->imm >>= src_reg->imm;
+   else if (opcode == BPF_LSH && BPF_SRC(insn->code) == BPF_K)
+   dst_reg->imm <<= insn->imm;
+   else if (opcode == BPF_LSH && BPF_SRC(insn->code) == BPF_X &&
+src_reg->type == CONST_IMM)
+   dst_reg->imm <<= src_reg->imm;
else
mark_reg_unknown_value(regs, insn->dst_reg);
return 0;
@@ -2225,14 +2250,8 @@ static int check_ld_imm(struct bpf_verifier_env *env, 
struct bpf_insn *insn)
return err;

if (insn->src_reg == 0) {
-   /* generic move 64-bit immediate into a register,
-* only analyzer needs to collect the ld_imm value.
-*/
u64 imm = ((u64)(insn + 1)->imm << 32) | (u32)insn->imm;

-   if (!env->analyzer_ops)
-   return 0;
-
regs[insn->dst_reg].type = CONST_IMM;
regs[insn->dst_reg].imm = imm;
return 0;
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 1aa7324..d60bd07 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ 

Re: [iovisor-dev] BPF verifier: cannot add integer value with 0 upper zero bits to ptr_to_packet

2017-01-17 Thread Daniel Borkmann via iovisor-dev

Hi William,

On 01/12/2017 07:32 PM, William Tu via iovisor-dev wrote:

Hi,
I encounter the following BPF verifier error:

from 28 to 30: R0=imm1,min_value=1,max_value=1 R1=pkt(id=0,off=0,r=22)
R2=pkt_end R3=imm144,min_value=144,max_value=144
R4=imm0,min_value=0,max_value=0 R5=inv48,min_value=2054,max_value=2054
R10=fp
30: (bf) r5 = r3
31: (07) r5 += 23
32: (77) r5 >>= 3
33: (bf) r6 = r1
34: (0f) r6 += r5
cannot add integer value with 0 upper zero bits to ptr_to_packet
---
Shouldn't the r5 be an imm value? since it comes from r3=144 and just
doing +=23 and shift right 3 bits?
---
the C code is doing data and data_end checking:
ebpf_filter:
; int ebpf_filter(struct xdp_md* skb){
0: r2 = *(u32 *)(r1 + 4)
; void *data = (void *)(long)skb->data;
1: r1 = *(u32 *)(r1 + 0)
; u32 ebpf_zero = 0;
2: r3 = 0
3: *(u32 *)(r10 - 4) = r3
4: r0 = 1
; if (data_end < data + BYTES(ebpf_packetOffsetInBits + 48)) {
5: r3 = r1
6: r3 += 6
7: if r3 > r2 goto 1
8: goto 1
...
LBB0_1:
; if (data_end < data + BYTES(ebpf_packetOffsetInBits + 16)) {
   30: r5 = r3
   31: r5 += 23
   32: r5 >>= 3
   33: r6 = r1
   34: r6 += r5
   35: if r6 > r2 goto 65509
; if (data_end < data + BYTES(ebpf_packetOffsetInBits + 16)) {
...
In which I was checking the packet's boundary. BYTE() is a simple macro doing
#define BYTES(w) ((w + 7) >> 3)


Can you try out the below patch?

evaluate_reg_imm_alu() is invoked on ALU ops where the dst reg is
CONST_IMM and the src is either a constant from the insn or a reg
with type CONST_IMM. Currently we only simulate BPF_ADD / BPF_OR
and if we encounter something else (like in your case shifts), then
we mark the reg from imm to unknown, and as soon as you try to add
that to the pkt pointer, then check_packet_ptr_add() will bark due
to src_reg->imm being 0 from prior mark_reg_unknown_value(). Given
we don't care about overflows or negative values on tracking pure
imm ops, we might as well let evaluate_reg_imm_alu() simulate more
ops than just these two. Does that fix it for you?

Signed-off-by: Daniel Borkmann 
---
 kernel/bpf/verifier.c   | 24 ++--
 tools/testing/selftests/bpf/test_verifier.c | 24 
 2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d60e12c..2ac50ae 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1567,19 +1567,39 @@ static int evaluate_reg_imm_alu(struct bpf_verifier_env 
*env,
struct bpf_reg_state *src_reg = [insn->src_reg];
u8 opcode = BPF_OP(insn->code);

-   /* dst_reg->type == CONST_IMM here, simulate execution of 'add'/'or'
-* insn. Don't care about overflow or negative values, just add them
+   /* dst_reg->type == CONST_IMM here, simulate execution of 'add'/'or'/...
+* insn. Don't care about overflow or negative values, just add them.
 */
if (opcode == BPF_ADD && BPF_SRC(insn->code) == BPF_K)
dst_reg->imm += insn->imm;
else if (opcode == BPF_ADD && BPF_SRC(insn->code) == BPF_X &&
 src_reg->type == CONST_IMM)
dst_reg->imm += src_reg->imm;
+   else if (opcode == BPF_SUB && BPF_SRC(insn->code) == BPF_K)
+   dst_reg->imm -= insn->imm;
+   else if (opcode == BPF_SUB && BPF_SRC(insn->code) == BPF_X &&
+src_reg->type == CONST_IMM)
+   dst_reg->imm -= src_reg->imm;
+   else if (opcode == BPF_MUL && BPF_SRC(insn->code) == BPF_K)
+   dst_reg->imm *= insn->imm;
+   else if (opcode == BPF_MUL && BPF_SRC(insn->code) == BPF_X &&
+src_reg->type == CONST_IMM)
+   dst_reg->imm *= src_reg->imm;
else if (opcode == BPF_OR && BPF_SRC(insn->code) == BPF_K)
dst_reg->imm |= insn->imm;
else if (opcode == BPF_OR && BPF_SRC(insn->code) == BPF_X &&
 src_reg->type == CONST_IMM)
dst_reg->imm |= src_reg->imm;
+   else if (opcode == BPF_RSH && BPF_SRC(insn->code) == BPF_K)
+   dst_reg->imm >>= insn->imm;
+   else if (opcode == BPF_RSH && BPF_SRC(insn->code) == BPF_X &&
+src_reg->type == CONST_IMM)
+   dst_reg->imm >>= src_reg->imm;
+   else if (opcode == BPF_LSH && BPF_SRC(insn->code) == BPF_K)
+   dst_reg->imm <<= insn->imm;
+   else if (opcode == BPF_LSH && BPF_SRC(insn->code) == BPF_X &&
+src_reg->type == CONST_IMM)
+   dst_reg->imm <<= src_reg->imm;
else
mark_reg_unknown_value(regs, insn->dst_reg);
return 0;
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 1aa7324..d60bd07 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -2326,6 +2326,30 @@ 

Re: [iovisor-dev] How to update the csum of a trimmed UDP packet?

2017-01-11 Thread Daniel Borkmann via iovisor-dev

On 01/11/2017 10:51 PM, Mauricio Vasquez wrote:

Hi Daniel,

On 01/10/2017 04:49 AM, Daniel Borkmann wrote:

On 01/09/2017 04:48 PM, Mauricio Vasquez via iovisor-dev wrote:

Hello,

I am trying to implement an eBPF program that trims an UDP packet to a predefined 
length. I am able to trim the packet (using the bpf_change_tail helper) and also to 
update the ip->len and udp->length fields. After changing the those fields I 
update the checksums using the bpf_[l3,l4]_csum_replace helpers.

Due to the fact that the UDP checksum also covers the data itself, it is necessary to 
update it. I tried to use the bpf_csum_diff together with the bpf_l4_csum_replace 
helpers, however I found that is not possible to pass a variable size to the 
bpf_csum_diff as it is marked as "ARG_CONST_STACK_SIZE_OR_ZER" in the function 
prototype. This limitation makes it impossible to update the checksum on different 
packets size as the quantity of bytes to be trimmed depends on the length of the packet.

The code is available at [1] (please note that it is based on hover). The error 
that I get when I tried to load the program is:

31: (61) r9 = *(u32 *)(r6 +48)
32: (61) r2 = *(u32 *)(r6 +0)
33: (07) r2 += -100
34: (67) r2 <<= 32
35: (77) r2 >>= 32
36: (b7) r8 = 0
37: (b7) r3 = 0
38: (b7) r4 = 0
39: (b7) r5 = 0
40: (85) call 28
R2 type=inv expected=imm

My questions are.

1. Am I missing another way to update the UDP checksum?

2. In case 1 is false, how could it be implemented? I think having a function 
to trim the packet without having a way to update the checksum is, in some way, 
useless.


What's your specific use-case on this? Reason why I'm asking is that
the helper needs to linearize skb and is thus only suitable for slow
path [1].

We are implementing a DHCP server, DHCP messages are quite unusual so slow path 
helpers are not a problem in this case.

F.e. we use it for rewriting an skb as time exceeded reply [2].
How far from the header start are you trimming? Perhaps it's feasible to
to calc the csum up to the point of trimming via bpf_csum_diff()?

Thanks to it I realized that I was doing the difficult way, I was trying to 
"remove"  the trimmed bytes from the checksum.

I am trying to implement it in a way similar to [2], calculating the checksum 
from scratch.
[...]
 /* calculate checksum of the new udp packet */
 u32 src_ip =  bpf_htonl(ip->src);
 u32 dst_ip =  bpf_htonl(ip->dst);
 u32 len_udp_n = ((u32)ip->nextp<<8) | (bpf_htons(ip->tlen)<<16);


On just a very quick glance, don't you need UDP length here (aka UDP header
and data)?


 int csum = bpf_csum_diff(NULL, 0, data + sizeof(struct ethernet_t) + 
sizeof(struct ip_t), udp_len, 0);
 csum = bpf_csum_diff (NULL, 0, _ip, 4, csum);
 csum = bpf_csum_diff (NULL, 0, _ip, 4, csum);
 csum = bpf_csum_diff (NULL, 0, _udp_n, 4, csum);
 bpf_trace_printk("[dhcp] csum: %x\n", csum);
 err = bpf_l4_csum_replace(skb, UDP_CRC_OFF, 0, csum, BPF_F_PSEUDO_HDR);
[...]
The full code of this test is available at [1].

I am calculating the csum of the udp packet itself, for that reason I skip the 
ethernet and ip headers on the first call to bpf_csum_diff. Then, as the UDP 
checksum includes a IPV4 pseudo header [2], I update the csum including those 
fields.

This code compiles and passes verification, however it is not calculating the 
checksum correctly.  According to wireshark there is always a difference of 
0x14 between the calculated and the expected checksums.

I am pretty sure that I am missing something about checksums, but I cannot 
figure it out.
Can you spot some mistake on that code?

Thank you very much,
Mauricio V.

[1] https://gist.github.com/mvbpolito/9697247eba1c412cd4a6c8efe6fab55d
[2] https://www.ietf.org/rfc/rfc768.txt


Did
you check whether [3] could help in that?

  [1] 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5293efe62df81908f2e90c9820c7edcc8e61f5e9
  [2] https://github.com/cilium/cilium/blob/master/bpf/lib/icmp6.h
  [3] 
https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=06c1c049721a995dee2829ad13b24aaf5d7c5cce


Thanks in advance.

Mauricio V.

[1] https://gist.github.com/mvbpolito/77edb3b4e65cc03c92862b3f17452286

___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev







___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] How to update the csum of a trimmed UDP packet?

2017-01-10 Thread Daniel Borkmann via iovisor-dev

On 01/09/2017 04:48 PM, Mauricio Vasquez via iovisor-dev wrote:

Hello,

I am trying to implement an eBPF program that trims an UDP packet to a predefined 
length. I am able to trim the packet (using the bpf_change_tail helper) and also to 
update the ip->len and udp->length fields. After changing the those fields I 
update the checksums using the bpf_[l3,l4]_csum_replace helpers.

Due to the fact that the UDP checksum also covers the data itself, it is necessary to 
update it. I tried to use the bpf_csum_diff together with the bpf_l4_csum_replace 
helpers, however I found that is not possible to pass a variable size to the 
bpf_csum_diff as it is marked as "ARG_CONST_STACK_SIZE_OR_ZER" in the function 
prototype. This limitation makes it impossible to update the checksum on different 
packets size as the quantity of bytes to be trimmed depends on the length of the packet.

The code is available at [1] (please note that it is based on hover). The error 
that I get when I tried to load the program is:

31: (61) r9 = *(u32 *)(r6 +48)
32: (61) r2 = *(u32 *)(r6 +0)
33: (07) r2 += -100
34: (67) r2 <<= 32
35: (77) r2 >>= 32
36: (b7) r8 = 0
37: (b7) r3 = 0
38: (b7) r4 = 0
39: (b7) r5 = 0
40: (85) call 28
R2 type=inv expected=imm

My questions are.

1. Am I missing another way to update the UDP checksum?

2. In case 1 is false, how could it be implemented? I think having a function 
to trim the packet without having a way to update the checksum is, in some way, 
useless.


What's your specific use-case on this? Reason why I'm asking is that
the helper needs to linearize skb and is thus only suitable for slow
path [1]. F.e. we use it for rewriting an skb as time exceeded reply [2].
How far from the header start are you trimming? Perhaps it's feasible to
to calc the csum up to the point of trimming via bpf_csum_diff()? Did
you check whether [3] could help in that?

  [1] 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5293efe62df81908f2e90c9820c7edcc8e61f5e9
  [2] https://github.com/cilium/cilium/blob/master/bpf/lib/icmp6.h
  [3] 
https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=06c1c049721a995dee2829ad13b24aaf5d7c5cce


Thanks in advance.

Mauricio V.

[1] https://gist.github.com/mvbpolito/77edb3b4e65cc03c92862b3f17452286

___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Direct packet access validation across tail calls

2016-07-04 Thread Daniel Borkmann via iovisor-dev

On 07/02/2016 11:29 PM, Thomas Graf wrote:

Hi

When using direct packet access I noticed that the verifier cannot
cary the packet length validation check across tail calls. This is
mainly a burden for L4 where L3 may require some more expensive logic
to handle variable length headers.


Yeah, checks cannot be carried over in two occasions: i) calling helpers
that change skb->data (and therefore prior checks become invalid) and
ii) tail calls. For tail calls the verifier doesn't know how such programs
will be used or shared (e.g. they could be part of one or multiple tail
call maps but at the same time attached somewhere directly) so they strictly
need to be validated and assumed as stand-alone at that point in time.

So, I think with carrying checks across tail calls you then mean that you
store some non constant start offset/immediate e.g. into skb->cb[0] from
the program that executes the tail call, and would like to do the 'skb->data +
skb->cb[0] < skb->data_end' check once in the new program to give the verifier
some context? Since non constant, this would currently fail with "cannot
add integer value with 0 upper zero bits to ptr_to_packet" since the
unknown value from cb[0] has no tracking (imm==0) of upper zero bits (that
prevents/checks for overflows wrt skb->data since we do skb->data +
skb->cb[0]). Meaning we cannot safely perform such test against skb->data_end
before doing anything further.

So I think this might then require either some context field with similar
semantics as skb->data that you could set safely (since can be verified)
via a new helper function from the main program that does the tail call, or
you could perhaps advance skb->data from ctx to a save offset, so the tail
called program thinks it's as original skb->data then, but this would disallow
access before skb->data obviously. But also then you need to redo checks
when calling helpers that change skb's data since we potentially can call
into pskb_expand_head() and thus pointers to skb->data/data_end can change.
Do you need r+w case, right?


Any objections to add this?

Thomas



___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev