from:"Alexei Starovoitov"

On Fri, Feb 03, 2017 at 05:33:45PM +1300, Eric W. Biederman wrote:
> 
> The point is that we can make the inode number stable across migration
> and the user space API for namespaces has been designed with that
> possibility in mind.
> 
> What you have proposed is the equivalent of reporting a file name, and
> instead of reporting /dir1/file1 /dir2/file1 just reporting file1 for
> both cases.
> 
> That is problematic.
> 
> It doesn't matter that eBPF and CRIU do not mix.  When we implement
> migration of the namespace file descriptors and can move them from
> one system to another preserving the device number and inode number
> so that criu of other parts of userspace can function better there will
> be a problem.  There is not one unique inode number per namespace and
> the proposed interface in your eBPF programs is broken.
> 
> I don't know when inode numbers are going to be the bottleneck we decide
> to make migratable to make CRIU work better but things have been
> designed and maintained very carefully so that we can do that.
> 
> Inode numbers are in the namespace of the filesystem they reside in.

I saw that iproute2 is doing:
  if ((st.st_dev == netst.st_dev) &&
  (st.st_ino == netst.st_ino)) {
but proc_alloc_inum() is using global ida,
so I figured that iproute2 extra st_dev check must have been obsolete.
So the long term plan is to make /proc to be namespace-aware?
That's fair. In such case exposing inode only will
lead to wrong assumptions.

> >> But you told Eric that his nack doesn't matter, and maybe it would be
> >> nice to ask him to clarify instead.
> >
> > Fair enough. Eric, thoughts?
> 
> In very short terms exporting just the inode number would require
> implementing a namespace of namespaces, and that is NOT happening.
> We are not going to design our kernel interfaces so badly that we need
> to do that.
> 
> At a bare minimum you need to export the device number of the filesystem
> as well as the inode number.

Agree. Will do.

> My expectation would be that now you are starting to look at concepts
> that are namespaced the way you would proceed would be to associate a
> full set of namespaces with your ebpf program.  Those namespaces would
> come from the submitter of your ebpf program.  Namespaced values
> would be in the terms of your associated namespaces.
> 
> That keeps things working the way userspace would expect.
> 
> The easy way to build such an association is to not allow your
> contextless ebpf programs from being submitted to kernel in anything
> other than the initial set of namespaces.
> 
> But please assume all global identifiers are namespaced.  If they aren't
> that needs to be fixed because not having them namespaced will break
> process migration at some point.
> 
> In short the fix here is to export both the inode number the device
> number.  That is what it takes to uniquely identify a file.  It would be

Agree. Will respin.

> good if you went farther and limited your contextless ebpf programs to
> only being installed by programs in the initial set of namespaces.

you mean to limit to init_net only? This might break existing users.

> Does that make things clearer?

yep. thanks for the feedback.

[PATCH v3 net] bpf: add bpf_sk_netns_id() helper

2017-02-03 Thread Alexei Starovoitov

in cases where bpf programs are looking at sockets and packets
that belong to different netns, it could be useful to get an id
that uniquely identify a netns within the whole system.

Therefore introduce 'u64 bpf_sk_netns_id(sk);' helper. It returns
unique value that identifies netns of given socket or dev_net(skb->dev)
The upper 32-bits of the return value contain device id where namespace
filesystem resides and lower 32-bits contain inode number within that 
filesystem.
It's the same as
 struct stat st;
 stat("/proc/pid/ns/net", );
 return (st->st_dev << 32)  | st->st_ino;

For example to disallow raw sockets in all non-init netns
the bpf_type_cgroup_sock program can do:
if (sk->type == SOCK_RAW && bpf_sk_netns_id(sk) != 0x3f075)
  return 0;
where 0x3f075 comes from combination of st_dev and st_ino
of /proc/pid/ns/net

Note that all bpf programs types are global. The same socket filter
program can be attached to sockets in different netns,
just like cls_bpf can see ingress/egress packets of multiple
net_devices in different netns. The cgroup_bpf programs are
the most exposed to sockets and devices across netns,
but the need to identify netns applies to all.
For example, if bpf_type_cgroup_skb didn't exist the system wide
monitoring daemon could have used ld_preload mechanism and
attached the same program to see traffic from applications
across netns. Therefore make bpf_sk_netns_id() helper available
to all network related bpf program types.
For socket, cls_bpf and cgroup_skb programs this helper
can be considered a new feature, whereas for cgroup_sock
programs that modify sk->bound_dev_if (like 'ip vrf' does)
it's a bug fix, since 'ip vrf' needs to be netns aware.

Signed-off-by: Alexei Starovoitov <a...@kernel.org>
---
Eric, I'v added proc_get_ns_devid_inum() to nsfs.c
right next to __ns_get_path(), so when it is time in the future
to make nsfs more namespace aware, it will be easy to adjust
both new_inode_pseudo(mnt->mnt_sb) line and proc_get_ns_devid_inum()
I thought about using ns->stashed, but it's obviously transient
inode and not usable. If later we decide to store dev_t into ns_common
it will be fine as well. We'll just change proc_get_ns_devid_inum()
without affecting user space.

v2->v3: build bot complained. s/static/static inline/. no other changes.
---
 fs/nsfs.c |  7 +++
 include/linux/proc_ns.h   |  3 ++-
 include/uapi/linux/bpf.h  | 14 +-
 net/core/filter.c | 44 ++-
 samples/bpf/bpf_helpers.h |  2 ++
 samples/bpf/sock_flags_kern.c |  2 ++
 samples/bpf/sockex1_kern.c|  2 ++
 7 files changed, 71 insertions(+), 3 deletions(-)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index 8c9fb29c6673..1a604bccef86 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -49,6 +49,13 @@ static void nsfs_evict(struct inode *inode)
ns->ops->put(ns);
 }
 
+u64 proc_get_ns_devid_inum(struct ns_common *ns)
+{
+   u64 dev = new_encode_dev(nsfs_mnt->mnt_sb->s_dev);
+
+   return (dev << 32) | ns->inum;
+}
+
 static void *__ns_get_path(struct path *path, struct ns_common *ns)
 {
struct vfsmount *mnt = nsfs_mnt;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 12cb8bd81d2d..531c16105198 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -48,7 +48,7 @@ extern int pid_ns_prepare_proc(struct pid_namespace *ns);
 extern void pid_ns_release_proc(struct pid_namespace *ns);
 extern int proc_alloc_inum(unsigned int *pino);
 extern void proc_free_inum(unsigned int inum);
-
+extern u64 proc_get_ns_devid_inum(struct ns_common *ns);
 #else /* CONFIG_PROC_FS */
 
 static inline int pid_ns_prepare_proc(struct pid_namespace *ns) { return 0; }
@@ -61,6 +61,7 @@ static inline int proc_alloc_inum(unsigned int *inum)
 }
 static inline void proc_free_inum(unsigned int inum) {}
 
+static inline u64 proc_get_ns_devid_inum(struct ns_common *ns) { return 0; }
 #endif /* CONFIG_PROC_FS */
 
 static inline int ns_alloc_inum(struct ns_common *ns)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0eb0e87dbe9f..e5b8cf16cbaf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -430,6 +430,17 @@ union bpf_attr {
  * @xdp_md: pointer to xdp_md
  * @delta: An positive/negative integer to be added to xdp_md.data
  * Return: 0 on success or negative on error
+ *
+ * u64 bpf_sk_netns_id(sk)
+ * Returns unique value that identifies netns of given socket or skb.
+ * The upper 32-bits of the return value contain device id where namespace
+ * filesystem resides and lower 32-bits contain inode number within
+ * that filesystem. It's the same value as:
+ *  struct stat st;
+ *  stat("/proc/pid/ns/net", );
+ *  return (st->st_dev << 32)  | st->st_ino;
+ * @sk: pointer to struct sock or struct __sk_buff
+ * Return: file

Re: [PATCH net-next 00/10] bnxt_en: Add XDP support.

2017-01-31 Thread Alexei Starovoitov

On Tue, Jan 31, 2017 at 9:33 PM, Andy Gospodarek <a...@greyhouse.net> wrote:
> On Tue, Jan 31, 2017 at 10:36 AM, Andy Gospodarek <a...@greyhouse.net> wrote:
>> On Mon, Jan 30, 2017 at 08:47:47PM -0800, Alexei Starovoitov wrote:
>>> On Mon, Jan 30, 2017 at 08:49:25PM -0500, Michael Chan wrote:
>>> > The first 8 patches refactor the code (rx/tx code paths and ring logic)
>>> > and add the basic infrastructure to support XDP.  The 9th patch adds
>>> > basic ndo_xdp to support XDP_DROP and XDP_PASS only.  The 10th patch
>>> > completes the series with XDP_TX.
>>>
>>> Looks great.
>>> Could you please share performance numbers ?
>>
>> I'll post some later today.
>
> I finally got my system moved around to what I'd hoped would be the
> right spot in my lab, but the system used for generating the traffic
> was only able to send 6Mpps with pktgen, so it was not a great test.
>
> My receiving system with i7-6700 CPU @ 3.40GHz seemed to have no issue
> handling this 6Mpps load -- mpstat showed only one core was ~25%
> utilitzed with all of that servicing softirqs.  The rest of the cores
> were 100% idle.
>
> I'm going to search for other traffic generation tools/systems to make
> sure I can get at least line-rate for the 10GbE cards I was using.

hmm. last time I tried pktgen on bnx2x it was easily doing 14Mpps with burst on.
Have you been using samples/pktgen/pktgen_sample03_burst_single_flow.sh ?
Waiting for this set to land to start benchmarking on bnxt.
So having a baseline will certainly help :)

Thanks!

Re: [PATCH net-next 00/10] bnxt_en: Add XDP support.

2017-01-31 Thread Alexei Starovoitov

On Mon, Jan 30, 2017 at 11:38 PM, Michael Chan
 wrote:
>
> I need to first figure out what xdp_adjust_head means.  If it is ok,
> I'd like to defer it.

I'd prefer if it's done asap.

mlx4 support added in
commit ea3349a03519 ("mlx4: xdp: Reserve headroom for receiving packet
when XDP prog is active")

mlx5 support added in
commit d8bec2b29a82 ("net/mlx5e: Support bpf_xdp_adjust_head()")

both were relatively small changes to the driver.

[PATCH v4 net] bpf: add bpf_sk_netns_id() helper

in cases where bpf programs are looking at sockets and packets
that belong to different netns, it could be useful to get an id
that uniquely identify a netns within the whole system.

Therefore introduce 'u64 bpf_sk_netns_id(sk);' helper. It returns
unique value that identifies netns of given socket or dev_net(skb->dev)
The upper 32-bits of the return value contain device id where namespace
filesystem resides and lower 32-bits contain inode number within that 
filesystem.
It's the same as
 struct stat st;
 stat("/proc/pid/ns/net", );
 return (st->st_dev << 32)  | st->st_ino;

For example to disallow raw sockets in all non-init netns
the bpf_type_cgroup_sock program can do:
if (sk->type == SOCK_RAW && bpf_sk_netns_id(sk) != 0x3f075)
  return 0;
where 0x3f075 comes from combination of st_dev and st_ino
of /proc/pid/ns/net

Note that all bpf programs types are global. The same socket filter
program can be attached to sockets in different netns,
just like cls_bpf can see ingress/egress packets of multiple
net_devices in different netns. The cgroup_bpf programs are
the most exposed to sockets and devices across netns,
but the need to identify netns applies to all.
For example, if bpf_type_cgroup_skb didn't exist the system wide
monitoring daemon could have used ld_preload mechanism and
attached the same program to see traffic from applications
across netns. Therefore make bpf_sk_netns_id() helper available
to all network related bpf program types.
For socket, cls_bpf and cgroup_skb programs this helper
can be considered a new feature, whereas for cgroup_sock
programs that modify sk->bound_dev_if (like 'ip vrf' does)
it's a bug fix, since 'ip vrf' needs to be netns aware.

Signed-off-by: Alexei Starovoitov <a...@kernel.org>
Reviewed-by: David Ahern <d...@cumulusnetworks.com>
Tested-by: David Ahern <d...@cumulusnetworks.com>
---
Eric, I'v added proc_get_ns_devid_inum() to nsfs.c
right next to __ns_get_path(), so when it is time in the future
to make nsfs more namespace aware, it will be easy to adjust
both new_inode_pseudo(mnt->mnt_sb) line and proc_get_ns_devid_inum()
I thought about using ns->stashed, but it's obviously transient
inode and not usable. If later we decide to store dev_t into ns_common
it will be fine as well. We'll just change proc_get_ns_devid_inum()
without affecting user space.

Note that eBPF is 64-bit and sizeof(long)==8 in bpf program,
so no need to mess with old_encode_dev and old_valid_dev checks.
new_encode_dev() is enough.

v2->v3: build bot complained. s/static/static inline/. no other changes.
v3->v4: fixed fallthrough case. Thanks Daniel.
---
 fs/nsfs.c |  7 +++
 include/linux/proc_ns.h   |  3 ++-
 include/uapi/linux/bpf.h  | 14 +-
 net/core/filter.c | 45 ++-
 samples/bpf/bpf_helpers.h |  2 ++
 samples/bpf/sock_flags_kern.c |  2 ++
 samples/bpf/sockex1_kern.c|  2 ++
 7 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index 8c9fb29c6673..1a604bccef86 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -49,6 +49,13 @@ static void nsfs_evict(struct inode *inode)
ns->ops->put(ns);
 }
 
+u64 proc_get_ns_devid_inum(struct ns_common *ns)
+{
+   u64 dev = new_encode_dev(nsfs_mnt->mnt_sb->s_dev);
+
+   return (dev << 32) | ns->inum;
+}
+
 static void *__ns_get_path(struct path *path, struct ns_common *ns)
 {
struct vfsmount *mnt = nsfs_mnt;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 12cb8bd81d2d..531c16105198 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -48,7 +48,7 @@ extern int pid_ns_prepare_proc(struct pid_namespace *ns);
 extern void pid_ns_release_proc(struct pid_namespace *ns);
 extern int proc_alloc_inum(unsigned int *pino);
 extern void proc_free_inum(unsigned int inum);
-
+extern u64 proc_get_ns_devid_inum(struct ns_common *ns);
 #else /* CONFIG_PROC_FS */
 
 static inline int pid_ns_prepare_proc(struct pid_namespace *ns) { return 0; }
@@ -61,6 +61,7 @@ static inline int proc_alloc_inum(unsigned int *inum)
 }
 static inline void proc_free_inum(unsigned int inum) {}
 
+static inline u64 proc_get_ns_devid_inum(struct ns_common *ns) { return 0; }
 #endif /* CONFIG_PROC_FS */
 
 static inline int ns_alloc_inum(struct ns_common *ns)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0eb0e87dbe9f..e5b8cf16cbaf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -430,6 +430,17 @@ union bpf_attr {
  * @xdp_md: pointer to xdp_md
  * @delta: An positive/negative integer to be added to xdp_md.data
  * Return: 0 on success or negative on error
+ *
+ * u64 bpf_sk_netns_id(sk)
+ * Returns unique value that identifies netns of given socket or skb.
+ * The upper 32-bits of the return value contain device id where namespace
+ * file

Re: [PATCH v2 net] bpf: add bpf_sk_netns_id() helper

On Sat, Feb 04, 2017 at 08:17:57PM -0800, Andy Lutomirski wrote:
> On Sat, Feb 4, 2017 at 8:05 PM, Alexei Starovoitov
> <alexei.starovoi...@gmail.com> wrote:
> > On Sat, Feb 04, 2017 at 07:33:14PM -0800, Andy Lutomirski wrote:
> >> On Sat, Feb 4, 2017 at 7:25 PM, Alexei Starovoitov
> >> <alexei.starovoi...@gmail.com> wrote:
> >> > On Sat, Feb 04, 2017 at 09:15:10AM -0800, Andy Lutomirski wrote:
> >> >> On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <a...@fb.com> wrote:
> >> >> > Note that all bpf programs types are global.
> >> >>
> >> >> I don't think this has a clear enough meaning to work with.  In
> >> >
> >> > Please clarify what you mean. The quoted part says
> >> > "bpf programs are global". What is not "clear enough" there?
> >>
> >> What does "bpf programs are global" mean?  I am genuinely unable to
> >> figure out what you mean.  Here are some example guesses of what you
> >> might mean:
> >>
> >>  - BPF programs are compiled independently of a namespace.  This is
> >> certainly true, but I don't think it matters.
> >>
> >>  - You want BPF programs to affect everything on the system.  But this
> >> doesn't seem right to be -- they only affect things in the relevant
> >> cgroup, so they're not global in that sense.
> >
> > All bpf program types are global in the sense that you can
> > make all of them to operate across all possible scopes and namespaces.
> 
> I still don't understand what you mean here.  A seccomp program runs
> in the process that installs it and children -- it does not run in
> "all possible scopes". 

seccomp and classic bpf is different, since there is no concept
of the program there. cbpf is a stateless set of instructions
that belong to some entity like seccomp or socket. ebpf is stateful
and starts with the program, then hook and then scope.

> A socket filter runs on a single socket and
> therefore runs in a single netns.  So presumably I'm still
> misunderstanding you

in classic - yes. ebpf can have the same program attached to
multiple sockets in different netns.
For classic - the object is the socket and the user can only
manipulate that socket. For extended - the object is the program
and it can exist on its own. Like the program can be pinned in bpffs
while it's not attached to any hook.
For classic bpf the ideas of CRIU naturally apply, since
it checkpoints the socket and it happens that socket has
a set of statless cbpf instructions within. So it's
expected to save/restore cbpf as part of socket save/restore.
In case of ebpf the program exists independently of the socket.
Doing save/restore of the ebpf program attached to a socket
is meaningless, since it could be pinned in bpffs, attached
to other sockets, has state in bpf maps, some other process
might be actively talking to that program and so on.
ebpf is a graph of interconnected pieces. To criu such thing
one really need to freeze the whole system, all of it processes,
and stop the kernel. I don't see criu ever be applied to ebpf.

> > cgroup only gives a scope for the program to run, but it's
> > not limited by it. The user can have the same program
> > attached to two or more different cgroups, so one program
> > will run across multiple cgroups.
> 
> Does this mean "BPF programs are compiled independently of a
> namespace?"  If so, I don't see why it's relevant at all.  Sure, you
> could compile a BPF program once and install it in two different
> scopes, but that doesn't mean that the kernel should *run* it globally
> in any sense.  Can you clarify?

if single program is attached to different sockets it exactly
means that the kernel should run it for different sockets :)

> >>  - The set of BPF program types and the verification rules are
> >> independent of cgroup and namespace.  This is true, but I don't think
> >> it matters.
> >
> > It matters. That's actually the key to understand. The loading part
> > verifies correctness for particular program type.
> > Afterwards the same program can be attached to any place.
> > Including different cgroups and different namespaces.
> > The 'attach' part is like 'switch on' that enables program
> > on particular hook. The scope (whether it's socket or netdev or cgroup)
> > is a scope that program author uses to narrow down the hook,
> > but it's not an ultimate restriction.
> > For example the socket program can be attached to sockets and
> > share information with cls_bpf program attached to netdev.
> > The kprobe tracing program can peek into kernel internal data
> > and sh

Re: [PATCH net] bpf: expose netns inode to bpf programs

On Sat, Feb 04, 2017 at 09:05:29PM -0800, Andy Lutomirski wrote:
> 
> I'm not saying that at all.  I'm saying that this use case sounds
> valid, but maybe it could be solved differently.  Here are some ideas:

Great. Combining multiple threads. Replied in bpf_sk_netns_id thread.

Re: [PATCH v2 net] bpf: add bpf_sk_netns_id() helper

On Mon, Feb 06, 2017 at 06:57:45PM -0800, Andy Lutomirski wrote:
> On Mon, Feb 6, 2017 at 5:42 PM, Alexei Starovoitov
> <alexei.starovoi...@gmail.com> wrote:
> > On Sat, Feb 04, 2017 at 08:17:57PM -0800, Andy Lutomirski wrote:
> >> On Sat, Feb 4, 2017 at 8:05 PM, Alexei Starovoitov
> >> <alexei.starovoi...@gmail.com> wrote:
> >> > On Sat, Feb 04, 2017 at 07:33:14PM -0800, Andy Lutomirski wrote:
> >> >> What does "bpf programs are global" mean?  I am genuinely unable to
> >> >> figure out what you mean.  Here are some example guesses of what you
> >> >> might mean:
> >> >>
> >> >>  - BPF programs are compiled independently of a namespace.  This is
> >> >> certainly true, but I don't think it matters.
> >> >>
> >> >>  - You want BPF programs to affect everything on the system.  But this
> >> >> doesn't seem right to be -- they only affect things in the relevant
> >> >> cgroup, so they're not global in that sense.
> >> >
> >> > All bpf program types are global in the sense that you can
> >> > make all of them to operate across all possible scopes and namespaces.
> >>
> >> I still don't understand what you mean here.  A seccomp program runs
> >> in the process that installs it and children -- it does not run in
> >> "all possible scopes".
> >
> > seccomp and classic bpf is different, since there is no concept
> > of the program there. cbpf is a stateless set of instructions
> > that belong to some entity like seccomp or socket. ebpf is stateful
> > and starts with the program, then hook and then scope.
> 
> So... are you saying that a loaded eBPF object is global in the sense
> that if you attach the same object to more than one thing (two
> sockets, say), the *same* program runs and shares its state?  If so, I
> agree, but that's still not an argument that the *same* attachment of
> an eBPF program to a cgroup should run in multiple network namespaces.
> You could also attach the (same) program once per netns and its state
> would be shared.
> 
> I'm pretty sure I've never suggested that an eBPF program be bound to
> a namespace.  I just think that a lot of things get more
> straightforward if an *attachment* of an eBPF program to a cgroup is
> bound to a single namespace.

Thank you for this whole discussion over the last few months.
Frankly in the beginning I wasn't 100% sure about the api we picked,
now I'm completely convinced that we absolutely made the right choice.
It's clean and keeps scoping constructs clear and explicit for the users.
So I am proposing that in the future we should add the ability to
scope bpf programs by netns. Just like current api scopes them
by cgroup. The attachment must be explicit.
Current api attaches type_cgroup_* program to a hook and scopes it
by a given cgroup. At that time some apps within that cgroup may
already run in different netns and attaching process may be
in yet another netns. There is no way to have sane semantics
without explicitly specifying the scope. Currently we do it
by explicitly specifying the cgroup. In the future we need
to extend it by specifying netns (without cgroup). Then
for container technologies that are based on netns we'll
have an efficient way of scoping programs to given netns.
And when afnetns is finally ready, the same scoping approach
will work for afnetns as well. For cases that need to
have two scopes at the same time (like cgroup and netns)
the bpf_sk_netns_id() helper will work one way and
some future bpf_sk_cgroup_id() helper will work from
the other side.
So far in multi-scope cases one dimension is dominating.
Like number of cgroups is way larger than number of netns,
so explicit bpf_sk_netns_id() from inside the programs
is faster than doing the same in the kernel.
And if in the future there will be a case with a lot of
cgroups and a lot of netns at the same time, we'll extend
the api further to specify two scopes to bpf_prog_attach command.
The kernel side will probably need a hashtable to lookup
a bpf prog based on (cgroup, netns) pair of pointers.

> >> A socket filter runs on a single socket and
> >> therefore runs in a single netns.  So presumably I'm still
> >> misunderstanding you
> >
> > in classic - yes. ebpf can have the same program attached to
> > multiple sockets in different netns.
> > For classic - the object is the socket and the user can only
> > manipulate that socket. For extended - the object is the program
> > and it can exist on its own. Like the program can be pinned in bpffs
> > while it's not attached to any hook.
> > For classic bpf the ideas of CRIU naturally apply, si

Re: [PATCH net] bpf: expose netns inode to bpf programs

On Sat, Feb 04, 2017 at 07:27:01PM -0800, Andy Lutomirski wrote:
> On Sat, Feb 4, 2017 at 7:10 PM, Alexei Starovoitov
> <alexei.starovoi...@gmail.com> wrote:
> > On Sat, Feb 04, 2017 at 09:07:19AM -0800, Andy Lutomirski wrote:
> >> >> can see a namespaced view of the world.  For this to work, presumably
> >> >> we need to make sure that eBPF programs that are installed by programs
> >> >> that are in a container don't see traffic that isn't in that
> >> >> container.
> >> >
> >> > such approach will break existing users.
> >>
> >> What existing users?  The whole feature has never been in a released 
> >> kernel.
> >
> > the v3 patch commit log explains the use cases that work across netns.
> > And they would break if netns boundary is suddenly started to be
> > enforced for networking program types.
> 
> This is why these issues should be addressed carefully before this
> feature is stabilized in 4.10, and I'm increasingly unconvinced that
> this will happen.  There may be a single week left, and the cgroup
> override issue is still entirely unaddressed, for example.

imo cgroup override issue is not an issue and the api can be extended
at any time. I offered to change the default to disable override,
but I need override to be there, since it's the fastest and most
flexible way. So the sooner we can land bpf_sk_netns_id() patch
the faster I can post override disable.

> Also, the v3 commit log's example seems a bit weak, since seccomp can
> handle that case (block non-init-netns raw sockets) and it's not clear
> to me that a filter like that serves a purpose regardless besides
> kernel attack surface reduction.

seccomp doesn't even support extended bpf yet and that makes it
quite inflexible.

> > If you're suggesting to limit root netns for cgroup_sock program type only
> > then David explained weeks ago why it's not acceptable for vrf use
> > case which was the main reason to add that program type in the first place.
> > Also it would make one specific bpf program type to be different
> > from all others which will be confusing.
> >
> 
> I'm suggesting limiting it to the init netns for *all* cases.
> (Specifically, limiting installing programs to the init netns.  But
> perhaps limiting running of programs to the netns in which they're
> installed would be the best solution after all.)  If there isn't a
> well-understood solution that plays well with netns and that satisfies
> ip vrf, the it's worth seriously thinking about whether ip vrf is
> really ready for 4.10.

The proposed v3 is imo solves the ip vrf 'malfunction' that you found.
v1 was also solving it and iproute2 patch was ready, but as Eric pointed
out just ns.inum is not enough.
The 'malfunction' can also be addressed _without_ kernel changes,
but the kernel can make it easier, hence v3 patch.

Andy, can you please reply in one thread? In particular in v3 patch?
I'm not sure why you making more or less the same points in
different threads. I'm sure it's hard for everyone to follow.

Re: [PATCH net] bpf: expose netns inode to bpf programs

On Sat, Feb 04, 2017 at 07:22:03PM -0800, Andy Lutomirski wrote:
> On Sat, Feb 4, 2017 at 7:18 PM, Alexei Starovoitov
> <alexei.starovoi...@gmail.com> wrote:
> > On Sat, Feb 04, 2017 at 09:08:38AM -0800, Andy Lutomirski wrote:
> >> > So use-case would be that someone wants to attach the very same
> >> > prog via tc to various netdevs sitting in different netns, and
> >> > that prog looks up a map, controlled by initns, with skb->netns_inum
> >> > as key and the resulting value could contain allowed feature bits
> >> > for that specific netns prog the skbs goes through? That would be
> >> > a feature, not "concern", no? At the same time, it's up to the
> >> > user or mgmt app what gets loaded so f.e. it might just as well
> >> > tailor/optimize the progs individually for the devs sitting in
> >> > netns-es to avoid such map lookup.
> >>
> >> Agreed.  I don't see why you would install the exact same program on
> >> two sockets in different netnses if the program contains, say, an
> >> ifindex.  Why not just install a variant with the right ifindex into
> >> each socket?
> >
> > In other cases people prefer to have one program compiled once
> > and thoroughly tested offline, since some folks are worried
> > that on-the-fly compilation may cause generated code to
> > be rejected by the verifier.
> 
> I would be rather surprised if someone wrote a program that did had an
> expression like (ifindex == 17), tested it offline, and refused to
> update the 17 based on the actual ifindex in use.

All programs have bugs. What bpf subsytem is trying to do is
to be flexible and satisfy as many use cases as possible.
Boxing users into "one way to do things" isn't a successful
strategy in my opinion. perl vs python, if you like :)
bpf is perl like. We don't enforce spaces vs tabs :)

Re: [PATCH net] bpf: expose netns inode to bpf programs

On Sat, Feb 04, 2017 at 07:54:20PM -0800, Andy Lutomirski wrote:
>
> I've repeatedly asked how you plan to make a "don't override" flag
> have sensible semantics when someone tries to add a new flag or change
> the behavior to "don't override but, rather then rejecting programs
> down the hierarchy, just run them all".  You haven't answered that
> question.

I explained already that I need to do combining on the bpf side instead
of running the list, since running several programs where 90% of
the logic is the same is the overhead that is not acceptable
for production server. It may be fine for desktop, but not
when every cycle matters. With one program per cgroup I can
combine multiple of them on bpf side. In networking the most of
the prep work like packet parsing and lookups are common,
but the actions are different, so for one part of the hieararchy
I can have program A monitoring tcp and in other
part I can have program B monitoring tcp and udp.
What you're saying that for tcp and udp monitoring
run two programs. One for udp and one for tcp.
That is not efficient. Hence the need to combine
the programs on bpf side and attach only one with override.

The "dont override flag" was also explained before. Here it is again:
Without breaking above "override" scenario we can add a flag
that when the program is attached with that flag to one part of
the hierarchy the override of it will not be allowed in the descendent.
This extension can be done at any time in the future.
The only question is what is the default when such flag
is not present. The current default is override allowed.
You insist on changing the default right now.
I don't mind, therefore I'm working on such "dont override flag",
since if we need to change the default it needs to be done now,
but if it doesn't happen for 4.10, it's absolutely fine too.
For security use cases the flag will be added later
and sandboxing use cases will use that flag too.
There are no expections that if cgroup is there the program
attach command must always succeed. That's why we have error codes
and we can dissallow attach based on this flag or any future
restrictions. All of it is root now anyway and sandboxing/security
use cases need to wait until bpf side can be made unprivileged.
I see current api to have a ton of room for future extensions.

> Given that we're doing API design and it's way past the merge
> window, I just don't see how any of this is ready for 4.10.

We are NOT doing design now. Design and implementation was
done back last summer. It took many interations and lots
of discussions on public lists with netdev and cgroup lists cc-ed.

Re: [PATHv3 net-next] bpf: enable verifier to add 0 to packet ptr

On Sat, Feb 04, 2017 at 08:37:29AM -0800, William Tu wrote:
> The patch fixes the case when adding a zero value to the packet
> pointer.  The zero value could come from src_reg equals type
> BPF_K or CONST_IMM.  The patch fixes both, otherwise the verifer
> reports the following error:
>   [...]
> R0=imm0,min_value=0,max_value=0
> R1=pkt(id=0,off=0,r=4)
> R2=pkt_end R3=fp-12
> R4=imm4,min_value=4,max_value=4
> R5=pkt(id=0,off=4,r=4)
>   269: (bf) r2 = r0 // r2 becomes imm0
>   270: (77) r2 >>= 3
>   271: (bf) r4 = r1 // r4 becomes pkt ptr
>   272: (0f) r4 += r2// r4 += 0
>   addition of negative constant to packet pointer is not allowed
> 
> Signed-off-by: William Tu <u9012...@gmail.com>
> Signed-off-by: Mihai Budiu <mbu...@vmware.com>
> Cc: Daniel Borkmann <dan...@iogearbox.net>
> Cc: Alexei Starovoitov <a...@kernel.org>
> Acked-by: Daniel Borkmann <dan...@iogearbox.net>
...
> + "direct packet access: test14 (pkt_ptr += 0, CONST_IMM, good 
> access)",
> + .insns = {
> + BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
> + offsetof(struct __sk_buff, data)),
> + BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
> + offsetof(struct __sk_buff, data_end)),
> + BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
> + BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 22),
> + BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 7),
> + BPF_MOV64_IMM(BPF_REG_5, 12),
> + BPF_ALU64_IMM(BPF_RSH, BPF_REG_5, 4),
> + BPF_MOV64_REG(BPF_REG_6, BPF_REG_2),
> + BPF_ALU64_REG(BPF_ADD, BPF_REG_6, BPF_REG_5),
> + BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_6, 0),

Now looks great. Thanks!
Acked-by: Alexei Starovoitov <a...@kernel.org>

Re: [PATCH net] bpf: expose netns inode to bpf programs

On Sat, Feb 04, 2017 at 09:07:19AM -0800, Andy Lutomirski wrote:
> >> can see a namespaced view of the world.  For this to work, presumably
> >> we need to make sure that eBPF programs that are installed by programs
> >> that are in a container don't see traffic that isn't in that
> >> container.
> >
> > such approach will break existing users.
> 
> What existing users?  The whole feature has never been in a released kernel.

the v3 patch commit log explains the use cases that work across netns.
And they would break if netns boundary is suddenly started to be
enforced for networking program types.
If you're suggesting to limit root netns for cgroup_sock program type only
then David explained weeks ago why it's not acceptable for vrf use
case which was the main reason to add that program type in the first place.
Also it would make one specific bpf program type to be different
from all others which will be confusing.

Re: [PATCH v3 net] bpf: add bpf_sk_netns_id() helper


On 2/6/17 3:39 PM, Daniel Borkmann wrote:

On 02/04/2017 04:34 AM, Alexei Starovoitov wrote:
[...]

+BPF_CALL_1(bpf_skb_netns_id, struct sk_buff *, skb)
+{
+struct net_device *dev = skb->dev;
+
+if (!dev)
+return 0;
+return proc_get_ns_devid_inum(_net(dev)->ns);
+}
+
+static const struct bpf_func_proto bpf_skb_netns_id_proto = {
+.func= bpf_skb_netns_id,
+.gpl_only= false,
+.ret_type= RET_INTEGER,
+.arg1_type= ARG_PTR_TO_CTX,
+};
+
  static const struct bpf_func_proto *
  sk_filter_func_proto(enum bpf_func_id func_id)
  {
@@ -2620,6 +2649,8 @@ sk_filter_func_proto(enum bpf_func_id func_id)
  case BPF_FUNC_trace_printk:
  if (capable(CAP_SYS_ADMIN))
  return bpf_get_trace_printk_proto();
+case BPF_FUNC_sk_netns_id:
+return _skb_netns_id_proto;
  default:
  return NULL;
  }


Btw, I think here's an oversight that would still need to be
fixed. Above would mean that trace printk from unprivileged would
fall through and use _skb_netns_id_proto as proto now instead
of NULL. So BPF_FUNC_sk_netns_id needs to be placed above the
BPF_FUNC_trace_printk case, not in its fall-through path. Looks
like Chenbo in his get_socket_cookie missed this, too. Other than
that BPF bits seem good to me.


Ahh, right. Good catch.
I'll add 'else return NULL;' otherwise somebody might step on it again.
Thanks Daniel!

Eric,
still waiting for your review of nsfs.c bits.

Re: [PATCH net-next v3 04/11] bpf: Use bpf_load_program() from the library

2017-02-07 Thread Alexei Starovoitov


On 2/7/17 1:44 PM, Mickaël Salaün wrote:

-   union bpf_attr attr;
+   union bpf_attr attr = {};

-   bzero(, sizeof(attr));


I think somebody mentioned that there are compilers out there
that don't do it correctly, hence it was done with explicit bzero.
Arnaldo, Wang, do you remember the details?

Re: [net-next PATCH 5/5] virtio_net: XDP support for adjust_head

On Fri, Feb 03, 2017 at 05:42:54AM +0200, Michael S. Tsirkin wrote:
> On Thu, Feb 02, 2017 at 03:21:57PM -0800, John Fastabend wrote:
> > Add support for XDP adjust head by allocating a 256B header region
> > that XDP programs can grow into. This is only enabled when a XDP
> > program is loaded.
> > 
> > In order to ensure that we do not have to unwind queue headroom push
> > queue setup below bpf_prog_add. It reads better to do a prog ref
> > unwind vs another queue setup call.
> > 
> > At the moment this code must do a full reset to ensure old buffers
> > without headroom on program add or with headroom on program removal
> > are not used incorrectly in the datapath. Ideally we would only
> > have to disable/enable the RX queues being updated but there is no
> > API to do this at the moment in virtio so use the big hammer. In
> > practice it is likely not that big of a problem as this will only
> > happen when XDP is enabled/disabled changing programs does not
> > require the reset. There is some risk that the driver may either
> > have an allocation failure or for some reason fail to correctly
> > negotiate with the underlying backend in this case the driver will
> > be left uninitialized. I have not seen this ever happen on my test
> > systems and for what its worth this same failure case can occur
> > from probe and other contexts in virtio framework.
> > 
> > Signed-off-by: John Fastabend 
> 
> So I am currently testing what I consider a cleaner way to do this:
> 
> MERGEABLE_BUFFER_MIN_ALIGN_SHIFT by 1:
> -#define MERGEABLE_BUFFER_MIN_ALIGN_SHIFT ((PAGE_SHIFT + 1) / 2)
> +#define MERGEABLE_BUFFER_MIN_ALIGN_SHIFT (PAGE_SHIFT / 2 + 1)
> 
> and then we do not need a reset as we get a free bit to store presence
> of XDP buffer.

That does indeed sounds cleaner. Do you have an ETA for this?
Do you mind if John's set get applied for 4.11 while you're working
on a better version?
adjust_head support is a blocker for virtio+xdp adoption.
I think we need a compromise for 4.11.

Re: [net-next PATCH v2 0/5] XDP adjust head support for virtio

On 2/2/17 7:14 PM, John Fastabend wrote:

This series adds adjust head support for virtio. The following is my
test setup. I use qemu + virtio as follows,

./x86_64-softmmu/qemu-system-x86_64 \
-hda /var/lib/libvirt/images/Fedora-test0.img \
-m 4096 -enable-kvm -smp 2 -netdev tap,id=hn0,queues=4,vhost=on \
-device
virtio-net-pci,netdev=hn0,mq=on,guest_tso4=off,guest_tso6=off,guest_ecn=off,guest_ufo=off,vectors=9

In order to use XDP with virtio until LRO is supported TSO must be
turned off in the host. The important fields in the above command line
are the following,

guest_tso4=off,guest_tso6=off,guest_ecn=off,guest_ufo=off

thank you for sharing the command line!
Pretty hard to figure this out.
Too bad you dropped the patch that allows dynamic switch off of LRO.
Long term I don't see why guest shouldn't be allowed to turn that knob.

Also note it is possible to conusme more queues than can be supported
because when XDP is enabled for retransmit XDP attempts to use a queue
per cpu. My standard queue count is 'queues=4'.

After loading the VM I run the relevant XDP test programs in,

./sammples/bpf

For this series I tested xdp1, xdp2, and xdp_tx_iptunnel. I usually test
with iperf (-d option to get bidirectional traffic), ping, and pktgen.
I also have a modified xdp1 that returns XDP_PASS on any packet to ensure
the normal traffic path to the stack continues to work with XDP loaded.

same here.
xdp testing requires two physical machines with specific nics,
so hard to automate.
At least the virtio+xdp gives us ability to test the programs
automatically. So virtio+xdp will get the most test coverage and
all hw nics will be using it as a yardstick. Very important to
make it easy to use.

For bpf and generic xdp bits:
Acked-by: Alexei Starovoitov <a...@kernel.org>

Re: [PATCH net-next] bpf: fix verifier issue at check_packet_ptr_add

On Thu, Feb 02, 2017 at 07:26:44PM -0800, William Tu wrote:
> Thanks. below is my program. The verifier fails at line 272, when
> writing to ICMP header.
> -
> ; ebpf_packetEnd = ((void*)(long)skb->data_end);
>  206:   r2 = *(u32 *)(r6 + 4)
> ; ebpf_packetStart = ((void*)(long)skb->data);
>  207:   r1 = *(u32 *)(r6 + 0)
> ...
> r10 is "struct hd" at local stack
> r1 is skb->data
> ...
> ; if (hd.icmp.ebpf_valid) {
>  261:   r4 = *(u64 *)(r10 - 40)
>  262:   if r4 == 0 goto 29
> ; if (ebpf_packetEnd < ebpf_packetStart + BYTES(ebpf_packetOffsetInBits + 
> 32)) {
>  263:   r4 = r0
>  264:   r4 += 32
>  265:   r4 >>= 3
>  266:   r5 = r1
>  267:   r5 += r4
>  268:   if r5 > r2 goto 23
> ; write_byte(ebpf_packetStart, BYTES(ebpf_packetOffsetInBits) + 0,
> (ebpf_byte) << 0);
>  269:   r2 = r0
>  270:   r2 >>= 3
>  271:   r4 = r1
>  272:   r4 += r2
>  273:   r2 = *(u64 *)(r10 - 80)
>  274:   *(u8 *)(r4 + 0) = r2
> 
> verifier log
> 
> from 208 to 260: R0=inv,min_value=0,max_value=0 R1=pkt(id=0,off=0,r=0)
> R2=pkt_end R3=fp-12 R6=ctx R7=imm0,min_value=0,max_value=0
> R8=inv,min_value=0,max_value=0 R9=inv R10=fp
> 260: (bf) r0 = r7
> 261: (79) r4 = *(u64 *)(r10 -40)
> 262: (15) if r4 == 0x0 goto pc+29
>  R0=imm0,min_value=0,max_value=0 R1=pkt(id=0,off=0,r=0) R2=pkt_end
> R3=fp-12 R4=inv R6=ctx R7=imm0,min_value=0,max_value=0
> R8=inv,min_value=0,max_value=0 R9=inv R10=fp
> 263: (bf) r4 = r0
> 264: (07) r4 += 32
> 265: (77) r4 >>= 3
> 266: (bf) r5 = r1
> 267: (0f) r5 += r4
> 268: (2d) if r5 > r2 goto pc+23
>  R0=imm0,min_value=0,max_value=0 R1=pkt(id=0,off=0,r=4) R2=pkt_end
> R3=fp-12 R4=imm4,min_value=4,max_value=4 R5=pkt(id=0,off=4,r=4) R6=ctx
> R7=imm0,min_value=0,max_value=0 R8=inv,min_value=0,max_value=0 R9=inv
> R10=fp
> 269: (bf) r2 = r0
> 270: (77) r2 >>= 3
> 271: (bf) r4 = r1
> 272: (0f) r4 += r2
> 273: (79) r2 = *(u64 *)(r10 -80)
> 274: (73) *(u8 *)(r4 +0) = r2
> 
> ---
> The full C source code, llvm-objdump, and verifier log.
> https://gist.github.com/williamtu/abaeb11563872508d47cfabed23ac9ea
> https://gist.github.com/williamtu/3e308a14c5795c82d516f934be0560cc
> https://gist.github.com/williamtu/1ae888cea019d065eab3057f74d38905#file-gistfile1-txt

thanks for sharing.
the C program looks auto-generated? Just curious what did you
use to do it?

The line 272 is r4 += r2
where R4=imm4 and R2=pkt_end
Right now verifier doesn't accept any arithmetic with pkt_end,
since it expects the programs to have cannonical form of
if (ptr > pkt_end)
  goto fail;

Even if we add it, I'm not sure what 'pkt_end + 4' suppose to do.
It's a pointer after the valid packet range.

I'm the most puzzled with the diff:
-   if (imm <= 0) {
+   if (imm < 0) {
how is it making the program to pass verifier?

PS
gentle reminder to avoid top posting.

Re: [PATCH net-next v1 7/7] bpf: Always test unprivileged programs


On 2/5/17 3:14 PM, Mickaël Salaün wrote:

-   if (unpriv && test->prog_type)
-   continue;
+   if (!test->prog_type) {
+   if (!unpriv)
+   set_admin(false);
+   printf("#%d/u %s ", i, test->descr);
+   do_test_single(test, true, , );
+   if (!unpriv)
+   set_admin(true);
+   }

-   printf("#%d %s ", i, test->descr);
-   do_test_single(test, unpriv, , );
+   if (!unpriv) {
+   printf("#%d/p %s ", i, test->descr);
+   do_test_single(test, false, , );
+   }


great idea.
Acked-by: Alexei Starovoitov <a...@kernel.org>

as far as other patches.. we need to figure out how to avoid conflicts
between net-next and Arnaldo's tree where Joe's patches went.

Mickael,
can you see some way of splitting the patch set between trees?
Like above test_verfier.c improvement needs to go into net-next.
The rest can go via perf

Re: [RFC PATCH net-next 1/2] bpf: Save original ebpf instructions

On Mon, Feb 06, 2017 at 03:13:15PM +0100, Daniel Borkmann wrote:
> On 02/06/2017 11:56 AM, Quentin Monnet wrote:
> >2017-02-03 (15:28 -0700) ~ David Ahern 
> >>On 2/3/17 2:09 PM, Daniel Borkmann wrote:
> >>>On 02/03/2017 09:38 PM, David Ahern wrote:
> Similar to classic bpf, support saving original ebpf instructions
> 
> Signed-off-by: David Ahern 
> >>>
> >>>Not convinced that this is in the right direction, this not only 
> >>>*significantly*
> >>>increases mem footprint for each and every program, but also when you dump 
> >>>this,
> >>>then map references from relocs inside the insns are meaningless (f.e. 
> >>>what about
> >>>prog arrays used in tail calls?), so things like criu also won't be able 
> >>>to use
> >>>this kind of interface for dump and restore. If it's just for debugging, 
> >>>then
> >>>why not extend the existing tracing infrastructure around bpf that was 
> >>>started
> >>>with intention to gain more visibility.
> >>
> >>Yes, saving the original bpf increases the memory footprint. If you 
> >>noticed, a kmemdup is used for the exact instruction size (no page round 
> >>up). Right now programs are limited to a single page, so worst case  is an 
> >>extra page per program. I am open to other suggestions. For example, 
> >>bpf_prog is rounded up to a page which means there could be room at the end 
> >>of the page for the original instructions. This is definitely true for the 
> >>ip vrf programs which will be < 32 instructions even with the namespace 
> >>checking and the conversions done kernel side.
> >>
> >>Tracepoints will not solve the problem for me for a number of reasons. 
> >>Tracepoints have to be hit to return data, and there is no way the 
> >>tracepoint can return relevant information for me to verify that the 
> >>correct filter was downloaded. I want the original code. I want to audit 
> >>what was installed. In my case there could be N VRFs, and I want 'ip vrf' 
> >>or ifupdown2 or any other command to be able to verify that each cgroup has 
> >>the correct program, and to verify that the default VRF does *not* have a 
> >>program installed.
> >>
> >>Generically, the bpf code might contain relative data but that's for the 
> >>user or decoder program to deal with. Surely there is no harm in returning 
> >>the original, downloaded bpf code to a properly privileged process. If I am 
> >>debugging some weird network behavior, I want to be able to determine what 
> >>bpf code is running where and to see what it is doing to whatever degree 
> >>possible. Saving the original code is the first part of this.
> 
> Well, worst case cost would be ~8 additional pages per program that
> are very rarely used; assume you want to attach a prog per netdevice,
> or worse, one for ingress, one for egress ... and yet we already
> complain that netdevice itself is way too big and needs a diet ...
> That makes it much much worse, though. Using the remainder of the
> used page is thus not enough, I'm afraid.
> 
> But also then, what do you do with 4k insns (or multiple of these in
> case of tail calls), is there a decompiler in the works? Understandably
> that for vrf it might be trivial to just read disasm, but that's just
> one very specific use-case. I can see the point of dumping for the
> sake of CRIU use-case given that cBPF is supported there in the past,
> and CRIU strives to catch up with all kind of kernel APIs. It's
> restricted wrt restore to either the same kernel version or newer
> kernels, but that's generic issue for CRIU. So far I'm not aware of
> user requests for CRIU support, but it could come in future, though.
> Thus, should we have some dump interface it definitely needs to be
> i) generic and ii) usable for CRIU, so we don't end up with multiple
> dump mechanisms due to one API not being designed good enough to
> already cover it.
> 
> Dump result would need to look a bit similar to what we have in ELF
> file, that is, you'd need to construct map descriptions and reference
> them in the insn sequence similar to relocations in the loader. I
> haven't looked into whether it's feasible, but we should definitely
> evaluate possibilities to transform the post-verifier insn sequence
> back into a pre-verifier one for dumping, so we don't need to pay the
> huge additional cost (also a major fraction of the insns might be
> just duplicated here as they weren't rewritten). Perhaps it could
> be done with the help of small annotations, or some sort of diff
> that we only need to store. Not saying it's trivial, but we should
> definitely try hard first and design it carefully if we want to
> support such API.

+1

such instruction dump is meaningful only for stateless programs.
Out of the top of my head the minium requirement is that they
shouldn't use maps and no tailcalls.
Also the extra memory is the concern, so I think we need a flag
for BPF_PROG_LOAD command like 'store stateless original prog'.
Then at load

Re: [PATCH net] bpf: expose netns inode to bpf programs

On Sat, Feb 04, 2017 at 09:08:38AM -0800, Andy Lutomirski wrote:
> > So use-case would be that someone wants to attach the very same
> > prog via tc to various netdevs sitting in different netns, and
> > that prog looks up a map, controlled by initns, with skb->netns_inum
> > as key and the resulting value could contain allowed feature bits
> > for that specific netns prog the skbs goes through? That would be
> > a feature, not "concern", no? At the same time, it's up to the
> > user or mgmt app what gets loaded so f.e. it might just as well
> > tailor/optimize the progs individually for the devs sitting in
> > netns-es to avoid such map lookup.
> 
> Agreed.  I don't see why you would install the exact same program on
> two sockets in different netnses if the program contains, say, an
> ifindex.  Why not just install a variant with the right ifindex into
> each socket?

for some use cases it may be feasible to generate different
program on the fly for every netns that is created. Some people
generate them already for things like IP addresses. Like to stop
an attack to particular vip the fastest xdp program has
that vip built-in, then compiled on the fly and loaded.
In other cases people prefer to have one program compiled once
and thoroughly tested offline, since some folks are worried
that on-the-fly compilation may cause generated code to
be rejected by the verifier.
Therefore new program for every netns may be acceptable
to some users, but not all.

Re: [PATCH v2 net] bpf: add bpf_sk_netns_id() helper

On Sat, Feb 04, 2017 at 09:15:10AM -0800, Andy Lutomirski wrote:
> On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <a...@fb.com> wrote:
> > Note that all bpf programs types are global.
> 
> I don't think this has a clear enough meaning to work with.  In

Please clarify what you mean. The quoted part says
"bpf programs are global". What is not "clear enough" there?

> I think that this patch plus a minor change to prevent installing
> cgroup+bpf programs if the installer isn't in the init netns + fs ns
> would work because it would allow new, migratable semantics to be
> added down the road to relax the restriction.

Forcing installer to be in init netns is not acceptable to David
who added cgroup_sock in the first place. I'm not sure why
we have to discuss that bit in circles.

Re: [PATCH v2 net] bpf: add bpf_sk_netns_id() helper

On Sat, Feb 04, 2017 at 07:33:14PM -0800, Andy Lutomirski wrote:
> On Sat, Feb 4, 2017 at 7:25 PM, Alexei Starovoitov
> <alexei.starovoi...@gmail.com> wrote:
> > On Sat, Feb 04, 2017 at 09:15:10AM -0800, Andy Lutomirski wrote:
> >> On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <a...@fb.com> wrote:
> >> > Note that all bpf programs types are global.
> >>
> >> I don't think this has a clear enough meaning to work with.  In
> >
> > Please clarify what you mean. The quoted part says
> > "bpf programs are global". What is not "clear enough" there?
> 
> What does "bpf programs are global" mean?  I am genuinely unable to
> figure out what you mean.  Here are some example guesses of what you
> might mean:
> 
>  - BPF programs are compiled independently of a namespace.  This is
> certainly true, but I don't think it matters.
> 
>  - You want BPF programs to affect everything on the system.  But this
> doesn't seem right to be -- they only affect things in the relevant
> cgroup, so they're not global in that sense.

All bpf program types are global in the sense that you can
make all of them to operate across all possible scopes and namespaces.
cgroup only gives a scope for the program to run, but it's
not limited by it. The user can have the same program
attached to two or more different cgroups, so one program
will run across multiple cgroups.

>  - The set of BPF program types and the verification rules are
> independent of cgroup and namespace.  This is true, but I don't think
> it matters.

It matters. That's actually the key to understand. The loading part
verifies correctness for particular program type.
Afterwards the same program can be attached to any place.
Including different cgroups and different namespaces.
The 'attach' part is like 'switch on' that enables program
on particular hook. The scope (whether it's socket or netdev or cgroup)
is a scope that program author uses to narrow down the hook,
but it's not an ultimate restriction.
For example the socket program can be attached to sockets and
share information with cls_bpf program attached to netdev.
The kprobe tracing program can peek into kernel internal data
and share it with cls_bpf or any other type as long as
everything is root. The information flow is global to the whole system.

> Because we're one week or so from 4.10 final, the 4.10-rc code is
> problematic even for ip vrf, and there isn't a clear solution yet.
> There are a bunch of requirements that seem to conflict, and something
> has to give.

let's go back to the beginning:
- you've identified a 'malfunction' in ip vrf. It's valid one. Thank you.
- can it be fixed without kernel changes ? Yes. David offered to do so.
- can kernel make it easier to address? Yes. I posted v1.
- proposed sk->netns_inum v1 patch together with posted iproute2 change
addresses the 'malfunction' ? Yes, but Eric didn't like inode only. All fair.
- proposed bpf_sk_netns_id() v3 patch together with upcoming iproute2
change will address it? Yes.

What's not to like? There were several clear solutions.
The last one seems to be the best.
And the sooner it lands the faster I can add override disable flag.

Re: [PATCH net-next v2 3/3] bpf: Always test unprivileged programs


On 2/6/17 12:52 PM, Mickaël Salaün wrote:

If selftests are run as root, then execute the unprivileged checks as
well. This switch from 240 to 364 tests.

The test numbers are suffixed with "/u" when executed as unprivileged or
with "/p" when executed as privileged.

The geteuid() check is replaced with a capability check.

Handling capabilities requires the libcap dependency.

Signed-off-by: Mickaël Salaün <m...@digikod.net>
Cc: Alexei Starovoitov <a...@fb.com>
Cc: Daniel Borkmann <dan...@iogearbox.net>
Cc: Shuah Khan <sh...@kernel.org>


Acked-by: Alexei Starovoitov <a...@kernel.org>

you can keep acks when there are no changes to the patch.

Re: [PATCH net-next 0/9] mlx4: order-0 allocations and page recycling

2017-02-07 Thread Alexei Starovoitov

On Tue, Feb 07, 2017 at 08:26:23AM -0800, Eric Dumazet wrote:
> On Tue, 2017-02-07 at 08:06 -0800, Eric Dumazet wrote:
>

Awesome that you've started working on this. I think it's correct approach
and mlx5 should be cleaned up in similar way.
Long term we should be able to move all page alloc/free out of the drivers
completely.

> > /*
> >  * make sure we read the CQE after we read the ownership bit
> >  */
> > dma_rmb();
> > +   prefetch(frags[0].page);
> 
> Note that I would like to instead do a prefetch(frags[1].page)

yeah, this two look weird:
+   prefetch(frags[0].page);
+   va = page_address(frags[0].page) + frags[0].page_offset;

on most archs page_addres() is just math (not a load from memory),
but the result != frags[0].page, so I'm missing what are you trying to prefetch?

prefetch(frags[1].page)
is even more confusing. what will it prefetch?

btw we had a patch that was doing prefetch of 'va' of next packet
and it was very helpful. Like this:
   pref_index = (index + 1) & ring->size_mask;
   pref = ring->rx_info + (pref_index << priv->log_rx_info);
   prefetch(page_address(pref->page) + pref->page_offset);

but since you're redesigning rxing->rx_info... not sure how will it fit.

> So I will probably change how ring->rx_info is allocated
> 
> wasting all that space and forcing vmalloc() is silly :
> 
> tmp = size * roundup_pow_of_two(MLX4_EN_MAX_RX_FRAGS *
> sizeof(struct mlx4_en_rx_alloc));

I think you'd still need roundup_pow_of_two otherwise priv->log_rx_info
optimization won't work.

Re: [PATCH net-next] bpf: fix verifier issue at check_packet_ptr_add

On Thu, Feb 02, 2017 at 09:31:06PM -0800, William Tu wrote:
> 
> Yes, this is auto-generated. We want to use P4 2016 as front end to
> generate ebpf for XDP.

P4 2016 front-end ? is it public? Is there a 2017 version? ;)
just curious.

> >
> > The line 272 is r4 += r2
> > where R4=imm4 and R2=pkt_end
> 
> R2 is no longer pkt_end, it's R2 == R0 == 0
> 269: (bf) r2 = r0
> 270: (77) r2 >>= 3
> 271: (bf) r4 = r1
> 272: (0f) r4 += r2
> 
> So at line 272, it's pkt_ptr = pkt_ptr + 0
> thus the following fix works for us.
> -   if (imm <= 0) {
> +   if (imm < 0) {

got it. I forgot that we have:
  if (src_reg->type == CONST_IMM) {
  /* pkt_ptr += reg where reg is known constant */
  imm = src_reg->imm;
  goto add_imm;
  }
and got confused by if (BPF_SRC(insn->code) == BPF_K) bit.
Thanks for explaining!
Could you respin with the extra test for it in the test_verifier.c ?
Since it's a rare case, would be good to keep it working.

Re: [PATCH net-next] bpf: test for AND edge cases


On 2/2/17 9:00 AM, Josef Bacik wrote:

These two tests are based on the work done for f23cc643f9ba.  The first test is
just a basic one to make sure we don't allow AND'ing negative values, even if it
would result in a valid index for the array.  The second is a cleaned up version
of the original testcase provided by Jann Horn that resulted in the commit.

Signed-off-by: Josef Bacik <jba...@fb.com>


Thanks for the tests! Much appreciated.


diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 853d7e4..44404f1 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -2905,6 +2905,61 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "invalid bpf_context access",
},
+   {
+   "invalid and of negative number",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
+   BPF_MOV64_IMM(BPF_REG_1, 6),
+   BPF_ALU64_IMM(BPF_AND, BPF_REG_1, -4),
+   BPF_ALU64_IMM(BPF_LSH, BPF_REG_1, 2),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
+   BPF_ST_MEM(BPF_DW, BPF_REG_0, 0,
+  offsetof(struct test_val, foo)),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map2 = { 3 },
+   .errstr_unpriv = "R0 pointer arithmetic prohibited",
+   .errstr = "R0 min value is negative, either use unsigned index or do a 
if (index >=0) check.",


the errstr doesn't have to compare the whole string. In case we find
typos or adjust the hint message, we'd need to change the test as well,
but I see it's being used as-is in other tests already, so we'll
fix all of them at once when time comes.

Acked-by: Alexei Starovoitov <a...@kernel.org>

Re: [PATCH v4 3/3] samples/bpf: add lpm-trie benchmark

On Sat, Jan 21, 2017 at 05:26:13PM +0100, Daniel Mack wrote:
> From: David Herrmann <dh.herrm...@gmail.com>
> 
> Extend the map_perf_test_{user,kern}.c infrastructure to stress test
> lpm-trie lookups. We hook into the kprobe on sys_gettid() and measure
> the latency depending on trie size and lookup count.
> 
> On my Intel Haswell i7-6400U, a single gettid() syscall with an empty
> bpf program takes roughly 6.5us on my system. Lookups in empty tries
> take ~1.8us on first try, ~0.9us on retries. Lookups in tries with 8192
> entries take ~7.1us (on the first _and_ any subsequent try).
> 
> Signed-off-by: David Herrmann <dh.herrm...@gmail.com>
> Reviewed-by: Daniel Mack <dan...@zonque.org>

Acked-by: Alexei Starovoitov <a...@kernel.org>

Thank you for all the hard work you've put into these patches.
All looks great to me.

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

On Thu, Jan 19, 2017 at 08:04:59PM -0800, Andy Lutomirski wrote:
> On Thu, Jan 19, 2017 at 6:39 PM, Alexei Starovoitov
> <alexei.starovoi...@gmail.com> wrote:
> > On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote:
> >> I think it could work by making a single socket cgroup controller that
> >> handles all cgroup things that are bound to a socket.  Using
> >
> > Such 'socket cgroup controller' would limit usability of the feature
> > to sockets and force all other use cases like landlock to invent
> > their own wheel, which is undesirable. Everyone will be
> > inventing new 'foo cgroup controller', while all of them
> > are really bpf features. They are different bpf program
> > types that attach to different hooks and use cgroup for scoping.
> 
> Can you elaborate on why that would be a problem?  In a cgroup v1
> world, users who want different hierarchies for different types of
> control could easily want one hierarchy for socket hooks and a
> different hierarchy for lsm hooks.  In a cgroup v2 delegation world, I
> could easily imagine the decision to delegate socket hooks being
> different from the decision to delegate lsm hooks.  Almost all of the
> code would be shared between different bpf-using cgroup controllers.

how do you think it can be enforced when directory is chowned?

> >> Having thought about this some more, I think that making it would
> >> alleviate a bunch of my concerns, as it would make the semantics if
> >> the capable() check were relaxed to ns_capable() be sane.  Here's what
> >
> > here we're on the same page. For any meaningful discussion about
> > 'bpf cgroup controller' to happen bpf itself needs to become
> > delegatable in cgroup sense. In other words BPF_PROG_TYPE_CGROUP*
> > program types need to become available for unprivileged users.
> > The only unprivileged prog type today is BPF_PROG_TYPE_SOCKET_FILTER.
> > To make it secure we severely limited its functionality.
> > All bpf advances since then (like new map types and verifier extensions)
> > were done for root only. If early on the priv vs unpriv bpf features
> > were 80/20. Now it's close to 95/5. No work has been done to
> > make socket filter type more powerful. It still has to use
> > slow-ish ld_abs skb access while tc/xdp have direct packet access.
> > Things like register value tracking is root only as well and so on
> > and so forth.
> > We cannot just flip the switch and allow type_cgroup* to unpriv
> > and I don't see any volunteers willing to do this work.
> > Until that happens there is no point coming up with designs
> > for 'cgroup bpf controller'... whatever that means.
> 
> Sure there is.  If delegation can be turned on without changing the
> API, then the result will be easier to work with and have fewer
> compatibility issues.

... and open() of the directory done by the current api will preserve
cgroup delegation when and only when bpf_prog_type_cgroup_*
becomes unprivileged.
I'm not proposing creating new api here.

> >
> >> I currently should happen before bpf+cgroup is enabled in a release:
> >>
> >> 1. Make it netns-aware.  This could be as simple as making it only
> >> work in the root netns because then real netns awareness can be added
> >> later without breaking anything.  The current situation is bad in that
> >> network namespaces are just ignored and it's plausible that people
> >> will start writing user code that depends on having network namespaces
> >> be ignored.
> >
> > nothing in bpf today is netns-aware and frankly I don't see
> > how cgroup+bpf has anything to do with netns.
> > For regular sockets+bpf we don't check netns.
> > When tcpdump opens raw socket and attaches bpf there are no netns
> > checks, since socket itself gives a scope for the program to run.
> > Same thing applies to cgroup+bpf. cgroup gives a scope for the program.
> > But, say, we indeed add 'if !root ns' check to BPF_CGROUP_INET_*
> > hooks.
> 
> 
> Here I completely disagree with you.  tcpdump sees packets in its
> network namespace.  Regular sockets apply bpf filters to the packets
> seen by that socket, and the socket itself is scoped to a netns.
> 
> Meanwhile, cgroup+bpf actually appears to be buggy in this regard even
> regardless of what semantics you think are better.  sk_bound_dev_if is
> exposed as a u32 value, but sk_bound_dev_if only has meaning within a
> given netns.  The "ip vrf" stuff will straight-up malfunction if a
> process affected by its hook runs in a different netns from the netns
> that "ip vrf" was run in.

how is that any differ

Re: [PATCH v4 1/3] bpf: add a longest prefix match trie map implementation

On Sat, Jan 21, 2017 at 05:26:11PM +0100, Daniel Mack wrote:
> This trie implements a longest prefix match algorithm that can be used
> to match IP addresses to a stored set of ranges.
> 
> Internally, data is stored in an unbalanced trie of nodes that has a
> maximum height of n, where n is the prefixlen the trie was created
> with.
> 
> Tries may be created with prefix lengths that are multiples of 8, in
> the range from 8 to 2048. The key used for lookup and update operations
> is a struct bpf_lpm_trie_key, and the value is a uint64_t.
> 
> The code carries more information about the internal implementation.
> 
> Signed-off-by: Daniel Mack <dan...@zonque.org>
> Reviewed-by: David Herrmann <dh.herrm...@gmail.com>

Looks great to me.
Acked-by: Alexei Starovoitov <a...@kernel.org>

Re: [patch] samples/bpf: silence shift wrapping warning

On Sat, Jan 21, 2017 at 07:51:43AM +0300, Dan Carpenter wrote:
> max_key is a value in the 0-63 range, so on 32 bit systems the shift
> could wrap.
> 
> Signed-off-by: Dan Carpenter <dan.carpen...@oracle.com>

Looks fine. I think 'net-next' is ok.

Acked-by: Alexei Starovoitov <a...@kernel.org>

> diff --git a/samples/bpf/lwt_len_hist_user.c b/samples/bpf/lwt_len_hist_user.c
> index ec8f3bb..bd06eef 100644
> --- a/samples/bpf/lwt_len_hist_user.c
> +++ b/samples/bpf/lwt_len_hist_user.c
> @@ -68,7 +68,7 @@ int main(int argc, char **argv)
>   for (i = 1; i <= max_key + 1; i++) {
>   stars(starstr, data[i - 1], max_value, MAX_STARS);
>   printf("%8ld -> %-8ld : %-8ld |%-*s|\n",
> -(1l << i) >> 1, (1l << i) - 1, data[i - 1],
> +(1ULL << i) >> 1, (1ULL << i) - 1, data[i - 1],
>  MAX_STARS, starstr);
>   }
>

Re: [PATCH v4 2/3] bpf: Add tests for the lpm trie map

On Sat, Jan 21, 2017 at 05:26:12PM +0100, Daniel Mack wrote:
> From: David Herrmann <dh.herrm...@gmail.com>
> 
> The first part of this program runs randomized tests against the
> lpm-bpf-map. It implements a "Trivial Longest Prefix Match" (tlpm)
> based on simple, linear, single linked lists. The implementation
> should be pretty straightforward.
> 
> Based on tlpm, this inserts randomized data into bpf-lpm-maps and
> verifies the trie-based bpf-map implementation behaves the same way
> as tlpm.
> 
> The second part uses 'real world' IPv4 and IPv6 addresses and tests
> the trie with those.
> 
> Signed-off-by: David Herrmann <dh.herrm...@gmail.com>
> Signed-off-by: Daniel Mack <dan...@zonque.org>

Acked-by: Alexei Starovoitov <a...@kernel.org>

Re: [PATCH] bpf: don't kfree an uninitialized im_node

On Tue, Jan 24, 2017 at 6:16 AM, Colin King  wrote:
> From: Colin Ian King 
>
> There are some error exit paths to the label 'out' that end up
> kfree'ing an uninitialized im_node.  Fix this by inititializing
> im_node to NULL to avoid kfree'ing a garbage address.

this fix already landed. See:
commit d140199af510 ("bpf, lpm: fix kfree of im_node in trie_update_elem")

> Issue found by CoverityScan, CID#1398022 ("Uninitialized pointer read")

Nice. Good to know that static analysis can do such checks.

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

On Mon, Jan 23, 2017 at 08:32:02PM -0800, Andy Lutomirski wrote:
> On Mon, Jan 23, 2017 at 8:05 PM, David Ahern  wrote:
> > On 1/23/17 8:37 PM, Andy Lutomirski wrote:
> >> Yes, it is a bug because cgroup+bpf causes unwitting programs to be
> >> subject to BPF code installed by a different, potentially unrelated
> >> process.  That's a new situation.  The failure can happen when a
> >> privileged supervisor (whoever runs ip vrf) runs a clueless or
> >> unprivileged program (the thing calling unshare()).
> >
> > There are many, many ways to misconfigure networking and to run programs in 
> > a context or with an input argument that causes the program to not work at 
> > all, not work as expected or stop working. This situation is no different.
> >
> > For example, the only aspect of BPF_PROG_TYPE_CGROUP_SOCK filters that are 
> > namespace based is the ifindex. You brought up the example of changing 
> > namespaces where the ifindex is not defined. Alexei mentioned an example 
> > where interfaces can be moved to another namespace breaking any ifindex 
> > based programs. Another example is the interface can be deleted. Deleting 
> > an interface with sockets bound to it does not impact the program in any 
> > way - no notification, no wakeup, nothing. The sockets just don't work.
> 
> And if you use 'ip vrf' to bind to a vrf with ifindex 4 and a program
> unshares netns and creates an interface with ifindex 4, then that
> program will end up with its sockets magically bound to ifindex 4 and
> will silently malfunction.
> 
> I can think of multiple ways to address this problem.  You could scope
> the hooks to a netns (which is sort of what my patch does).  You could
> find a way to force programs in a given cgroup to only execute in a
> single netns, although that would probably cause other breakage.  You
> could improve the BPF hook API to be netns-aware, which could plausbly
> address issues related to unshare() but might get very tricky when
> setns() is involved.

scoping cgroup to netns will create weird combination of cgroup and netns
which was never done before. cgroup and netns scopes have to be able
to overlap in arbitrary way. Application shouldn't not be able to escape
cgroup scope by changing netns. They are orthogonal scoping constructs.
What we can do instead is to do add bpf helper function to retrieve
the netns id, so that program created for specific netns and specific
ifindex will be more resilient to netns changes.
Note that this ifindex behavior has been present since classic bpf
and its shortcomings are understood. We need to improve it for all bpf
users. It really has nothing to do with cgroups.
cls_bpf filters can do packet redirect into ifindex and they can break
if user process changes.
The 'malfunction' scenario describe above is no different than
cls_bpf is using bpf_skb_under_cgroup() to scope particular
skb->sk->cgroup and doing redirect into ifindex.
It will 'malfunction' exactly the same way if application stays
in the same cgroup, but moves from one netns into another.

> My point is that, in 4.10-rc, it doesn't work right, and I doubt this
> problem is restricted to just 'ip vrf'.  Without some kind of change
> to the way that netns and cgroup+bpf interact, anything that uses
> sk_bound_dev_if or reads the ifindex on an skb will be subject to a
> huge footgun that unprivileged programs can trigger and any future
> attempt to make the cgroup+bpf work for unprivileged users is going to
> be more complicated than it deserves to be.

For n-th time, the current BPF_PROG_TYPE_CGROUP* is root only and
speculation about unprivileged usage are not helping the discussion.

> things up so that unshare will malfunction.  It should avoid
> malfunctioning when running Linux programs that are unaware of it.

I agree that for VRF use case it will help to make programs netns
aware by adding new bpf_get_current_netns_id() or something helper,
but it's up to the program to function properly or be broken.
Adding weird netns restrictions is not going to make bpf programs
any more sane. The program author can always shoot in the foot.

I will work on the patch to add that.

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

On Mon, Jan 23, 2017 at 12:36:08PM -0800, Andy Lutomirski wrote:
> To see how cgroup+bpf interacts with network namespaces, I wrote a
> little program called show_bind that calls getsockopt(...,
> SO_BINDTODEVICE, ...) and prints the result.  It did this:
> 
>  # ./ip link add dev vrf0 type vrf table 10
>  # ./ip vrf exec vrf0 ./show_bind
>  Default binding is "vrf0"
>  # ./ip vrf exec vrf0 unshare -n ./show_bind
>  show_bind: getsockopt: No such device
> 
> What's happening here is that "ip vrf" looks up vrf0's ifindex in
> the init netns and installs a hook that binds sockets to that
> ifindex.  When the hook runs in a different netns, it sets
> sk_bound_dev_if to an ifindex from the wrong netns, resulting in
> incorrect behavior.  In this particular example, the ifindex was 4
> and there was no ifindex 4 in the new netns.  If there had been,
> this test would have malfunctioned differently
> 
> Since it's rather late in the release cycle, let's punt.  This patch
> makes it impossible to install cgroup+bpf hooks outside the init
> netns and makes them not run on sockets that aren't in the init
> netns.
> 
> In a future release, it should be relatively straightforward to make
> these hooks be local to a netns and, if needed, to add a flag so
> that hooks can be made global if necessary.  Global hooks should
> presumably be constrained so that they can't write to any ifindex
> fields.
> 
> Cc: Daniel Borkmann <dan...@iogearbox.net>
> Cc: Alexei Starovoitov <a...@kernel.org>
> Cc: David Ahern <d...@cumulusnetworks.com>
> Signed-off-by: Andy Lutomirski <l...@kernel.org>
> ---
> 
> DaveM, this mitigates a bug in a feature that's new in 4.10, and the
> bug can be hit using current iproute2 -git.  please consider this for
> -net.
> 
> Changes from v1:
>  - Fix the commit message.  'git commit' was very clever and thought that
>all the interesting bits of the test case were intended to be comments
>and stripped them.  Whoops!
> 
> kernel/bpf/cgroup.c  | 21 +
>  kernel/bpf/syscall.c | 11 +++
>  2 files changed, 32 insertions(+)
> 
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index a515f7b007c6..a824f543de69 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
>   if (!sk || !sk_fullsock(sk))
>   return 0;
>  
> + /*
> +  * For now, socket bpf hooks attached to cgroups can only be
> +  * installed in the init netns and only affect the init netns.
> +  * This could be relaxed in the future once some semantic issues
> +  * are resolved.  For example, ifindexes belonging to one netns
> +  * should probably not be visible to hooks installed by programs
> +  * running in a different netns.
> +  */
> + if (sock_net(sk) != _net)
> + return 0;

Nack.
Such check will create absolutely broken abi that we won't be able to fix later.

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index e89acea22ecf..c0bbc55e244d 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>   struct cgroup *cgrp;
>   enum bpf_prog_type ptype;
>  
> + /*
> +  * For now, socket bpf hooks attached to cgroups can only be
> +  * installed in the init netns and only affect the init netns.
> +  * This could be relaxed in the future once some semantic issues
> +  * are resolved.  For example, ifindexes belonging to one netns
> +  * should probably not be visible to hooks installed by programs
> +  * running in a different netns.

the comment is bogus and shows complete lack of understanding
how bpf is working and how it's being used.
See SKF_AD_IFINDEX that was in classic bpf forever.

> +  */
> + if (current->nsproxy->net_ns != _net)
> + return -EINVAL;

this restriction I actually don't mind, since it indeed can be
relaxed later, but please make it proper with net_eq()

Re: XDP offload to hypervisor

On Mon, Jan 23, 2017 at 11:40:29PM +0200, Michael S. Tsirkin wrote:
> I've been thinking about passing XDP programs from guest to the
> hypervisor.  Basically, after getting an incoming packet, we could run
> an XDP program in host kernel.
> 
> If the result is XDP_DROP or XDP_TX we don't need to wake up the guest at all!

that's an interesting idea!
Long term 'xdp offload' needs to be defined, since NICs become smarter
and can accelerate xdp programs.
So pushing the xdp program down from virtio in the guest into host
and from x86 into nic cpu should probably be handled through the same api.

> When using tun for networking - especially with adjust_head - this
> unfortunately probably means we need to do a data copy unless there is
> enough headroom.  How much is enough though?

Frankly I don't understand the whole virtio nit picking that was happening.
imo virtio+xdp by itself is only useful for debugging, development and testing
of xdp programs in a VM. The discussion about performance of virtio+xdp
will only be meaningful when corresponding host part is done.
Likely in the form of vhost extensions and may be driver changes.
Trying to optimize virtio+xdp when host is doing traditional skb+vhost
isn't going to be impactful.
But when host can do xdp in phyiscal NIC that can deliver raw
pages into vhost that gets picked up by guest virtio, then we hopefully
will be around 10G line rate. page pool is likely needed in such scenario.
Some new xdp action like xdp_tx_into_vhost or whatever.
And guest will be seeing full pages that host nic provided and discussion
about headroom will be automatically solved.
Arguing that skb has 64-byte headroom and therefore we need to
reduce XDP_PACKET_HEADROOM is really upside down.

> Another issue is around host/guest ABI. Guest BPF could add new features
> at any point. What if hypervisor can not support it all?  I guess we
> could try loading program into hypervisor and run it within guest on
> failure to load, but this ignores question of cross-version
> compatibility - someone might start guest on a new host
> then try to move to an old one. So we will need an option
> "behave like an older host" such that guest can start and then
> move to an older host later. This will likely mean
> implementing this validation of programs in qemu userspace unless linux
> can supply something like this. Is this (disabling some features)
> something that might be of interest to larger bpf community?

In case of x86->nic offload not all xdp features will be supported
by the nic and that is expected. The user will request 'offload of xdp prog'
in some form and if it cannot be done, then xdp programs will run
on x86 as before. Same thing, I imagine, is applicable to virtio->host
offload. Therefore I don't see a need for user space visible
feature negotiation.

> With a device such as macvtap there exist configurations where a single
> guest is in control of the device (aka passthrough mode) in that case
> there's a potential to run xdp on host before host skb is built, unless
> host already has an xdp program attached.  If it does we could run the
> program within guest, but what if a guest program got attached first?
> Maybe we should pass a flag in the packet "xdp passed on this packet in
> host". Then, guest can skip running it.  Unless we do a full reset
> there's always a potential for packets to slip through, e.g. on xdp
> program changes. Maybe a flush command is needed, or force queue or
> device reset to make sure nothing is going on. Does this make sense?

All valid questions and concerns.
Since there is still no xdp_adjust_head support in virtio,
it feels kinda early to get into detailed 'virtio offload' discussion.

Re: [PATCH net-next] bpf, lpm: fix kfree of im_node in trie_update_elem

On Tue, Jan 24, 2017 at 01:26:46AM +0100, Daniel Borkmann wrote:
> We need to initialize im_node to NULL, otherwise in case of error path
> it gets passed to kfree() as uninitialized pointer.
> 
> Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map 
> implementation")
> Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>

Great catch.
Acked-by: Alexei Starovoitov <a...@kernel.org>

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

On Mon, Jan 23, 2017 at 06:42:27PM -0800, Andy Lutomirski wrote:
> On Mon, Jan 23, 2017 at 6:09 PM, Alexei Starovoitov
> <alexei.starovoi...@gmail.com> wrote:
> > On Mon, Jan 23, 2017 at 12:36:08PM -0800, Andy Lutomirski wrote:
> >> To see how cgroup+bpf interacts with network namespaces, I wrote a
> >> little program called show_bind that calls getsockopt(...,
> >> SO_BINDTODEVICE, ...) and prints the result.  It did this:
> >>
> >>  # ./ip link add dev vrf0 type vrf table 10
> >>  # ./ip vrf exec vrf0 ./show_bind
> >>  Default binding is "vrf0"
> >>  # ./ip vrf exec vrf0 unshare -n ./show_bind
> >>  show_bind: getsockopt: No such device
> >>
> >> What's happening here is that "ip vrf" looks up vrf0's ifindex in
> >> the init netns and installs a hook that binds sockets to that
> >> ifindex.  When the hook runs in a different netns, it sets
> >> sk_bound_dev_if to an ifindex from the wrong netns, resulting in
> >> incorrect behavior.  In this particular example, the ifindex was 4
> >> and there was no ifindex 4 in the new netns.  If there had been,
> >> this test would have malfunctioned differently
> >>
> >> Since it's rather late in the release cycle, let's punt.  This patch
> >> makes it impossible to install cgroup+bpf hooks outside the init
> >> netns and makes them not run on sockets that aren't in the init
> >> netns.
> >>
> >> In a future release, it should be relatively straightforward to make
> >> these hooks be local to a netns and, if needed, to add a flag so
> >> that hooks can be made global if necessary.  Global hooks should
> >> presumably be constrained so that they can't write to any ifindex
> >> fields.
> >>
> >> Cc: Daniel Borkmann <dan...@iogearbox.net>
> >> Cc: Alexei Starovoitov <a...@kernel.org>
> >> Cc: David Ahern <d...@cumulusnetworks.com>
> >> Signed-off-by: Andy Lutomirski <l...@kernel.org>
> >> ---
> >>
> >> DaveM, this mitigates a bug in a feature that's new in 4.10, and the
> >> bug can be hit using current iproute2 -git.  please consider this for
> >> -net.
> >>
> >> Changes from v1:
> >>  - Fix the commit message.  'git commit' was very clever and thought that
> >>all the interesting bits of the test case were intended to be comments
> >>and stripped them.  Whoops!
> >>
> >> kernel/bpf/cgroup.c  | 21 +
> >>  kernel/bpf/syscall.c | 11 +++
> >>  2 files changed, 32 insertions(+)
> >>
> >> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> >> index a515f7b007c6..a824f543de69 100644
> >> --- a/kernel/bpf/cgroup.c
> >> +++ b/kernel/bpf/cgroup.c
> >> @@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
> >>   if (!sk || !sk_fullsock(sk))
> >>   return 0;
> >>
> >> + /*
> >> +  * For now, socket bpf hooks attached to cgroups can only be
> >> +  * installed in the init netns and only affect the init netns.
> >> +  * This could be relaxed in the future once some semantic issues
> >> +  * are resolved.  For example, ifindexes belonging to one netns
> >> +  * should probably not be visible to hooks installed by programs
> >> +  * running in a different netns.
> >> +  */
> >> + if (sock_net(sk) != _net)
> >> + return 0;
> >
> > Nack.
> > Such check will create absolutely broken abi that we won't be able to fix 
> > later.
> 
> Please explain how the change results in a broken ABI and how the
> current ABI is better.  I gave a fully worked out example of how the
> current ABI misbehaves using only standard Linux networking tools.

I gave an example in the other thread that demonstrated
cgroup escape by the appliction when it changes netns.
If application became part of cgroup, it has to stay there,
no matter setns() calls afterwards.

> >
> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> index e89acea22ecf..c0bbc55e244d 100644
> >> --- a/kernel/bpf/syscall.c
> >> +++ b/kernel/bpf/syscall.c
> >> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> >>   struct cgroup *cgrp;
> >>   enum bpf_prog_type ptype;
> >>
> >> + /*
> >> +  * For now, socket bpf hooks attached to cgroups can only be
> >> +  * installed in the init n

Re: XDP offload to hypervisor

On Tue, Jan 24, 2017 at 05:33:37AM +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 23, 2017 at 05:02:02PM -0800, Alexei Starovoitov wrote:
> > Frankly I don't understand the whole virtio nit picking that was happening.
> > imo virtio+xdp by itself is only useful for debugging, development and 
> > testing
> > of xdp programs in a VM. The discussion about performance of virtio+xdp
> > will only be meaningful when corresponding host part is done.
> > Likely in the form of vhost extensions and may be driver changes.
> > Trying to optimize virtio+xdp when host is doing traditional skb+vhost
> > isn't going to be impactful.
> 
> Well if packets can be dropped without a host/guest
> transition then yes, that will have an impact even
> with traditional skbs.

I don't think it's worth optimizing for though, since the speed of drop
matters for ddos-like use case and if we let host be flooded with skbs,
we already lost, since the only thing cpu is doing is allocating skbs
and moving them around. Whether drop is happening upon entry into VM
or host does it in some post-vhost layer doesn't change the picture much.
That said, I do like the idea of offloading virto+xdp into host somehow.

Re: [patch] samples/bpf: silence shift wrapping warning

On Mon, Jan 23, 2017 at 5:27 AM, Arnaldo Carvalho de Melo
<arnaldo.m...@gmail.com> wrote:
> Em Sun, Jan 22, 2017 at 02:51:25PM -0800, Alexei Starovoitov escreveu:
>> On Sat, Jan 21, 2017 at 07:51:43AM +0300, Dan Carpenter wrote:
>> > max_key is a value in the 0-63 range, so on 32 bit systems the shift
>> > could wrap.
>> >
>> > Signed-off-by: Dan Carpenter <dan.carpen...@oracle.com>
>>
>> Looks fine. I think 'net-next' is ok.
>
> I could process these patches, if that would help,

Thanks for the offer.
I don't think there will be conflicts with all the work happening in net-next,
but it's best to avoid even possibility of it when we can.
Dan,
can you please resend the patch cc-ing Dave and netdev ?
please mention [PATCH net-next] in the subject.

> - Arnaldo
>
>> Acked-by: Alexei Starovoitov <a...@kernel.org>
>
>> > diff --git a/samples/bpf/lwt_len_hist_user.c 
>> > b/samples/bpf/lwt_len_hist_user.c
>> > index ec8f3bb..bd06eef 100644
>> > --- a/samples/bpf/lwt_len_hist_user.c
>> > +++ b/samples/bpf/lwt_len_hist_user.c
>> > @@ -68,7 +68,7 @@ int main(int argc, char **argv)
>> > for (i = 1; i <= max_key + 1; i++) {
>> > stars(starstr, data[i - 1], max_value, MAX_STARS);
>> > printf("%8ld -> %-8ld : %-8ld |%-*s|\n",
>> > -  (1l << i) >> 1, (1l << i) - 1, data[i - 1],
>> > +  (1ULL << i) >> 1, (1ULL << i) - 1, data[i - 1],
>> >MAX_STARS, starstr);
>> > }
>> >

Re: [net PATCH v5 1/6] virtio_net: use dev_kfree_skb for small buffer XDP receive

On Tue, Jan 24, 2017 at 8:02 PM, John Fastabend
 wrote:
>
> Finally just to point out here are the drivers with XDP support on latest
> net tree,
>
> mlx/mlx5
> mlx/mlx4
> qlogic/qede
> netronome/nfp
> virtio_net
>
> And here is the list of adjust header support,
>
> mlx/mlx4
>

in net-next it's actually:
yes: mlx4, mlx5
no: qede, nfp, virtio
while nfp and virtio are working on it.

xdp_adjust_head() is must have for load balancer,
so the sooner it lands for virtio the easier it will be
to develop xdp programs. Initially I expected
e1k+xdp to be the base line for debugging and
development of xdp programs, but since not everyone
agreed on e1k the virtio+xdp filled in the gap.
So without adjust_head in virtio I see very little use for it
in our environment.
It is a must have feature regardless of timing.
I will backport whatever is necessary, but distros
will stick with official releases and imo it's not great
from xdp adoption point of view to have
virtio driver lacking key features.

Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-01-26 Thread Alexei Starovoitov


On 1/26/17 8:37 AM, Andy Lutomirski wrote:

Think of bpf programs as safe kernel modules. They don't have
confined boundaries and program authors, if not careful, can shoot
themselves in the foot. We're not trying to prevent that because
it's impossible to check that the program is sane. Just like
it's impossible to check that kernel module is sane.
But in case of bpf we check that bpf program is _safe_ from the kernel
point of view. If it's doing some garbage, it's program's business.
Does it make more sense now?



With all due respect, I think this is not an acceptable way to think
about BPF at all.  If you think of BPF this way, I think there needs
to be a real discussion at KS or similar as to whether this is okay.
The reason is simple: the kernel promises a stable ABI to userspace
but not to kernel modules.  By thinking of BPF as more like a module,
you're taking a big shortcut that will either result in ABI breakage
down the road or in committing to a problematic stable ABI.


you misunderstood the analogy.
bpf abi is certainly stable. that's why we were careful of not
exposing anything to it that is not already stable.

Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-01-26 Thread Alexei Starovoitov


On 1/26/17 11:07 AM, Andy Lutomirski wrote:

On Thu, Jan 26, 2017 at 10:32 AM, Alexei Starovoitov <a...@fb.com> wrote:

On 1/26/17 10:12 AM, Andy Lutomirski wrote:


On Thu, Jan 26, 2017 at 9:46 AM, Alexei Starovoitov <a...@fb.com> wrote:


On 1/26/17 8:37 AM, Andy Lutomirski wrote:



Think of bpf programs as safe kernel modules. They don't have
confined boundaries and program authors, if not careful, can shoot
themselves in the foot. We're not trying to prevent that because
it's impossible to check that the program is sane. Just like
it's impossible to check that kernel module is sane.
But in case of bpf we check that bpf program is _safe_ from the kernel
point of view. If it's doing some garbage, it's program's business.
Does it make more sense now?



With all due respect, I think this is not an acceptable way to think
about BPF at all.  If you think of BPF this way, I think there needs
to be a real discussion at KS or similar as to whether this is okay.
The reason is simple: the kernel promises a stable ABI to userspace
but not to kernel modules.  By thinking of BPF as more like a module,
you're taking a big shortcut that will either result in ABI breakage
down the road or in committing to a problematic stable ABI.




you misunderstood the analogy.
bpf abi is certainly stable. that's why we were careful of not
exposing anything to it that is not already stable.



In that case I don't understand what you're trying to say.  Eric
thinks your patch exposes a bad interface.  A bad interface for
userspace is a very different thing from a bad interface available to
kernel modules.  Are you saying that BPF is kernel-module-like in that
the ABI exposed to BPF programs doesn't need to meet the same quality
standards as userspace ABIs?



of course not.
ns.inum is already exposed to user space as a value.
This patch exposes it to bpf program in a convenient and stable way,


Here's what I'm imaging Eric is thinking:

ns.inum is currently exposed to userspace via procfs.  In principle,
the value could be local to a namespace, though, which would enable
CRIU to be able to preserve namespace inode numbers across a
checkpoint+restore operation.  If this happened, the contained and
restored procfs would see a different inode number than the outermost
procfs.


sure. there are many different ways for the program to see inode
that either was already reused or disappeared.
What I'm saying that it is expected. We cannot prevent that from
bpf side. Just like ifindex value read by the program can be bogus
as in the example I just provided.


If you start exposing the raw ns.inum field to BPF programs and those
programs are not themselves scoped to a namespace, then this could
create a problem for CRIU.


criu doesn't support ebpf because maps are not snapshot-able and
programs are detached from the control plane. I cannot see how one can
criu of xdp or cls program. The ssh connection to the box might die in
the middle while criu is messing with unknown. Hence the analogy to
the kernel modules. Imagine a set of mini-kernel modules and a set
of apps that depend on them. What kind of criu can we even talk about?


But you told Eric that his nack doesn't matter, and maybe it would be
nice to ask him to clarify instead.


Fair enough. Eric, thoughts?

Re: [PATCH net-next 1/4] mlx5: Make building eswitch configurable


On 1/29/17 1:11 AM, Saeed Mahameed wrote:


ConnectX4/5 and hopefully so on .. provide three different isolated
steering layers:

3. vport layer: avaialbe for any PF/VF vport nic driver instance
(netdevice), it allows vlan/mac filtering
  ,RSS hashing and n-tuple steering (for both encapsulated and
nonencapsulated traffic) and RFS steering. ( the code above only
writes flow entries of a PF/VF to its own vport flow tables, there is
another mechanism to propagate l2 steering rules down to eswitch from
the vport layer.

2. eswitch layer: Available for PFs only with
HCA_CAP.vport_group_manager capability set.
it allows steering between PF and different VFs on the same host (vlan
mac steering and ACL filters in sriov legacy mode, and fancy n-tuple
steering and offloads for switchdev mode - eswitch_offloads.c - )
if this table is not create the default is pass-throu traffic to PF

1. L2 table: Available for PFs only with HCA_CAP.vport_group_manager
capability set.
needed for MH configurations and only PF is allowed and should write
"request UC MAC - set_l2_table_entry" on behalf of the PF itself and
it's own VFs.

- On a bare metal machine only layer 3 is required (all traffic is
passed to the PF vport).
- On a MH configuration layer 3 and 1 are required.
- On a SRIOV configuration layer 3 and 2 are required.
- On MH with SRIOV all layers are required.

in the driver, eswitch and L2 layers are handled by PF@eswitch.c.

So for your question:

PF always init_eswitch ( no eswitch -sriov- tables are created), and
the eswitch will start listening for vport_change_events.

A PF/VF or netdev vport instance on any steering changes updates
should call  mlx5e_vport_context_update[1]

vport_context_update is A FW command that will store the current
UC/MC/VLAN list and promiscuity info of a vport.

The FW will generate an event to the PF driver eswitch manager (vport
manager) mlx5_eswitch_vport_event [2], and the PF eswitch will call
set_l2_table_entry for each UC mac on each vport change event of any
vport (including its own vport), in case of SRIOV is enabled it will
update eswitch tables as well.

To simplify my answer the function calls are:
Vport VF/PF netdevice:
mlx5e_set_rx_mode_work
 mlx5e_vport_context_update
mlx5e_vport_context_update_addr_list  --> FW event will be
generated to the PF esiwtch manager

PF eswitch manager(eswitch.c) on a vport change FW event:
mlx5_eswitch_vport_event
   esw_vport_change_handler
esw_vport_change_handle_locked
esw_apply_vport_addr_list
   esw_add_uc_addr
  set_l2_table_entry --> this will
update the l2 table in case MH is enabled.


all makes sense. To test this logic I added printk-s
to above functions, but I only see:
# ip link set eth0 addr 24:8a:07:47:2b:6e
[  148.861914] mlx5e_vport_context_update_addr_list: is_uc 1 err 0
[  148.875152] mlx5e_vport_context_update_addr_list: is_uc 0 err 0

MLX5_EVENT_TYPE_NIC_VPORT_CHANGE doesn't come into mlx5_eq_int().
Yet nic seems to work fine. Packets come and go.

broken firmware or expected behavior?

# ethtool -i eth0
driver: mlx5_core
version: 3.0-1 (January 2015)
firmware-version: 14.16.2024

[PATCH v2 net-next] net: ethtool: convert large order kmalloc allocations to vzalloc

under memory pressure 'ethtool -S' command may warn:
[ 2374.385195] ethtool: page allocation failure: order:4, mode:0x242c0c0
[ 2374.405573] CPU: 12 PID: 40211 Comm: ethtool Not tainted
[ 2374.423071] Call Trace:
[ 2374.423076]  [] dump_stack+0x4d/0x64
[ 2374.423080]  [] warn_alloc_failed+0xeb/0x150
[ 2374.423082]  [] ? __alloc_pages_direct_compact+0x43/0xf0
[ 2374.423084]  [] __alloc_pages_nodemask+0x4dc/0xbf0
[ 2374.423091]  [] ? cmd_exec+0x722/0xcd0 [mlx5_core]
[ 2374.423095]  [] alloc_pages_current+0x8c/0x110
[ 2374.423097]  [] alloc_kmem_pages+0x19/0x90
[ 2374.423099]  [] kmalloc_order_trace+0x2e/0xe0
[ 2374.423101]  [] __kmalloc+0x204/0x220
[ 2374.423105]  [] dev_ethtool+0xe4e/0x1f80
[ 2374.423106]  [] ? dev_get_by_name_rcu+0x5e/0x80
[ 2374.423108]  [] dev_ioctl+0x156/0x560
[ 2374.423111]  [] ? mem_cgroup_commit_charge+0x78/0x3c0
[ 2374.423117]  [] sock_do_ioctl+0x42/0x50
[ 2374.423119]  [] sock_ioctl+0x1b3/0x250
[ 2374.423121]  [] do_vfs_ioctl+0x92/0x580
[ 2374.423123]  [] ? do_audit_syscall_entry+0x4b/0x70
[ 2374.423124]  [] ? syscall_trace_enter_phase1+0xfc/0x120
[ 2374.423126]  [] SyS_ioctl+0x79/0x90
[ 2374.423127]  [] do_syscall_64+0x50/0xa0
[ 2374.423129]  [] entry_SYSCALL64_slow_path+0x25/0x25

~1160 mlx5 counters ~= order 4 allocation which is unlikely to succeed
under memory pressure. Convert them to vzalloc() as ethtool_get_regs() does.
Also take care of drivers without counters similar to
commit 67ae7cf1eeda ("ethtool: Allow zero-length register dumps again")
and reduce warn_on to warn_on_once.

Signed-off-by: Alexei Starovoitov <a...@kernel.org>
---
Dave, I think by 'careful about size calculations' you mean
to take care of zero-length. Please let me know if I misunderstood.

I couldn't find any in-tree drivers that have zero length strings
and we already warn_on ops->get_sset_count() == 0, so makes sense
to warn in ethtool_get_strings() as well.
---
 net/core/ethtool.c | 39 ++-
 1 file changed, 22 insertions(+), 17 deletions(-)

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 236a21e3c878..6b3eee0834a0 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1817,11 +1817,13 @@ static int ethtool_get_strings(struct net_device *dev, 
void __user *useraddr)
ret = __ethtool_get_sset_count(dev, gstrings.string_set);
if (ret < 0)
return ret;
+   if (ret > S32_MAX / ETH_GSTRING_LEN)
+   return -ENOMEM;
+   WARN_ON_ONCE(!ret);
 
gstrings.len = ret;
-
-   data = kcalloc(gstrings.len, ETH_GSTRING_LEN, GFP_USER);
-   if (!data)
+   data = vzalloc(gstrings.len * ETH_GSTRING_LEN);
+   if (gstrings.len && !data)
return -ENOMEM;
 
__ethtool_get_strings(dev, gstrings.string_set, data);
@@ -1830,12 +1832,13 @@ static int ethtool_get_strings(struct net_device *dev, 
void __user *useraddr)
if (copy_to_user(useraddr, , sizeof(gstrings)))
goto out;
useraddr += sizeof(gstrings);
-   if (copy_to_user(useraddr, data, gstrings.len * ETH_GSTRING_LEN))
+   if (gstrings.len &&
+   copy_to_user(useraddr, data, gstrings.len * ETH_GSTRING_LEN))
goto out;
ret = 0;
 
 out:
-   kfree(data);
+   vfree(data);
return ret;
 }
 
@@ -1912,14 +1915,15 @@ static int ethtool_get_stats(struct net_device *dev, 
void __user *useraddr)
n_stats = ops->get_sset_count(dev, ETH_SS_STATS);
if (n_stats < 0)
return n_stats;
-   WARN_ON(n_stats == 0);
-
+   if (n_stats > S32_MAX / sizeof(u64))
+   return -ENOMEM;
+   WARN_ON_ONCE(!n_stats);
if (copy_from_user(, useraddr, sizeof(stats)))
return -EFAULT;
 
stats.n_stats = n_stats;
-   data = kmalloc(n_stats * sizeof(u64), GFP_USER);
-   if (!data)
+   data = vzalloc(n_stats * sizeof(u64));
+   if (n_stats && !data)
return -ENOMEM;
 
ops->get_ethtool_stats(dev, , data);
@@ -1928,12 +1932,12 @@ static int ethtool_get_stats(struct net_device *dev, 
void __user *useraddr)
if (copy_to_user(useraddr, , sizeof(stats)))
goto out;
useraddr += sizeof(stats);
-   if (copy_to_user(useraddr, data, stats.n_stats * sizeof(u64)))
+   if (n_stats && copy_to_user(useraddr, data, n_stats * sizeof(u64)))
goto out;
ret = 0;
 
  out:
-   kfree(data);
+   vfree(data);
return ret;
 }
 
@@ -1948,17 +1952,18 @@ static int ethtool_get_phy_stats(struct net_device 
*dev, void __user *useraddr)
return -EOPNOTSUPP;
 
n_stats = phy_get_sset_count(phydev);
-
if (n_stats < 0)
return n_stats;
-   WARN_ON(n_stats == 0);
+   if (n_stats > S32_MAX / sizeof(u64))
+   return -ENOMEM;
+   WARN_ON_ONCE(!n_stats);
 
if (copy_from_user(, userad

Re: [PATCH net-next 1/4] mlx5: Make building eswitch configurable


On 1/30/17 1:18 PM, Saeed Mahameed wrote:

On Mon, Jan 30, 2017 at 6:45 PM, Alexei Starovoitov <a...@fb.com> wrote:

On 1/29/17 1:11 AM, Saeed Mahameed wrote:



ConnectX4/5 and hopefully so on .. provide three different isolated
steering layers:

3. vport layer: avaialbe for any PF/VF vport nic driver instance
(netdevice), it allows vlan/mac filtering
   ,RSS hashing and n-tuple steering (for both encapsulated and
nonencapsulated traffic) and RFS steering. ( the code above only
writes flow entries of a PF/VF to its own vport flow tables, there is
another mechanism to propagate l2 steering rules down to eswitch from
the vport layer.

2. eswitch layer: Available for PFs only with
HCA_CAP.vport_group_manager capability set.
it allows steering between PF and different VFs on the same host (vlan
mac steering and ACL filters in sriov legacy mode, and fancy n-tuple
steering and offloads for switchdev mode - eswitch_offloads.c - )
if this table is not create the default is pass-throu traffic to PF

1. L2 table: Available for PFs only with HCA_CAP.vport_group_manager
capability set.
needed for MH configurations and only PF is allowed and should write
"request UC MAC - set_l2_table_entry" on behalf of the PF itself and
it's own VFs.

- On a bare metal machine only layer 3 is required (all traffic is
passed to the PF vport).
- On a MH configuration layer 3 and 1 are required.
- On a SRIOV configuration layer 3 and 2 are required.
- On MH with SRIOV all layers are required.

in the driver, eswitch and L2 layers are handled by PF@eswitch.c.

So for your question:

PF always init_eswitch ( no eswitch -sriov- tables are created), and
the eswitch will start listening for vport_change_events.

A PF/VF or netdev vport instance on any steering changes updates
should call  mlx5e_vport_context_update[1]

vport_context_update is A FW command that will store the current
UC/MC/VLAN list and promiscuity info of a vport.

The FW will generate an event to the PF driver eswitch manager (vport
manager) mlx5_eswitch_vport_event [2], and the PF eswitch will call
set_l2_table_entry for each UC mac on each vport change event of any
vport (including its own vport), in case of SRIOV is enabled it will
update eswitch tables as well.

To simplify my answer the function calls are:
Vport VF/PF netdevice:
mlx5e_set_rx_mode_work
  mlx5e_vport_context_update
 mlx5e_vport_context_update_addr_list  --> FW event will be
generated to the PF esiwtch manager

PF eswitch manager(eswitch.c) on a vport change FW event:
mlx5_eswitch_vport_event
esw_vport_change_handler
 esw_vport_change_handle_locked
 esw_apply_vport_addr_list
esw_add_uc_addr
   set_l2_table_entry --> this will
update the l2 table in case MH is enabled.



all makes sense. To test this logic I added printk-s
to above functions, but I only see:
# ip link set eth0 addr 24:8a:07:47:2b:6e
[  148.861914] mlx5e_vport_context_update_addr_list: is_uc 1 err 0
[  148.875152] mlx5e_vport_context_update_addr_list: is_uc 0 err 0

MLX5_EVENT_TYPE_NIC_VPORT_CHANGE doesn't come into mlx5_eq_int().


Strange, just double checked and i got those events on latest net-next
bare-metal box.


Yet nic seems to work fine. Packets come and go.



Is it multi host configuration or bare metal ?


multihost


Do you have internal loopback traffic between different hosts ?


in a multihost? how can I check that?
Is there an ethtool command?


broken firmware or expected behavior?


which driver did you test ? backported or net-next ?


both backported and net-next with Tom's patches.


if it is backported driver please verify that on driver load the
following occurs :

1. VPORTS change events are globally enabled:
in mlx5_start_eqs@eq.c:
async_event_mask |= (1ull << MLX5_EVENT_TYPE_NIC_VPORT_CHANGE);


this one is done.


2. UC address change events are enabled for vport 0 (PF):
In eswitch_attach or on eswitch_init (depends on the kernel version) @eswitch.c
esw_enable_vport(esw, 0, UC_ADDR_CHANGE); is called.


this one is not. Tom's proposal to compile out eswitch.c
removes invocation of mlx5_eswitch_attach() and
corresponding esw_enable_vport() call as well.
The question is why is it necessary?
What will break if it's not done?
so far we don't see any adverse effects in both multihost
and baremetal setups.


BTW folks, i am going to be on vacation for the rest of the week, so
please expect slow responses.


have a great time off. I hope other mlx folks can answer.

Re: [PATCH v2 net-next] net: ethtool: convert large order kmalloc allocations to vzalloc


On 1/30/17 7:28 PM, Joe Perches wrote:

On Mon, 2017-01-30 at 18:25 -0800, Alexei Starovoitov wrote:

under memory pressure 'ethtool -S' command may warn:
[ 2374.385195] ethtool: page allocation failure: order:4, mode:0x242c0c0
[ 2374.405573] CPU: 12 PID: 40211 Comm: ethtool Not tainted
[ 2374.423071] Call Trace:
[ 2374.423076]  [] dump_stack+0x4d/0x64
[ 2374.423080]  [] warn_alloc_failed+0xeb/0x150
[ 2374.423082]  [] ? __alloc_pages_direct_compact+0x43/0xf0
[ 2374.423084]  [] __alloc_pages_nodemask+0x4dc/0xbf0
[ 2374.423091]  [] ? cmd_exec+0x722/0xcd0 [mlx5_core]
[ 2374.423095]  [] alloc_pages_current+0x8c/0x110
[ 2374.423097]  [] alloc_kmem_pages+0x19/0x90
[ 2374.423099]  [] kmalloc_order_trace+0x2e/0xe0
[ 2374.423101]  [] __kmalloc+0x204/0x220
[ 2374.423105]  [] dev_ethtool+0xe4e/0x1f80
[ 2374.423106]  [] ? dev_get_by_name_rcu+0x5e/0x80
[ 2374.423108]  [] dev_ioctl+0x156/0x560
[ 2374.423111]  [] ? mem_cgroup_commit_charge+0x78/0x3c0
[ 2374.423117]  [] sock_do_ioctl+0x42/0x50
[ 2374.423119]  [] sock_ioctl+0x1b3/0x250
[ 2374.423121]  [] do_vfs_ioctl+0x92/0x580
[ 2374.423123]  [] ? do_audit_syscall_entry+0x4b/0x70
[ 2374.423124]  [] ? syscall_trace_enter_phase1+0xfc/0x120
[ 2374.423126]  [] SyS_ioctl+0x79/0x90
[ 2374.423127]  [] do_syscall_64+0x50/0xa0
[ 2374.423129]  [] entry_SYSCALL64_slow_path+0x25/0x25

~1160 mlx5 counters ~= order 4 allocation which is unlikely to succeed
under memory pressure. Convert them to vzalloc() as ethtool_get_regs() does.
Also take care of drivers without counters similar to
commit 67ae7cf1eeda ("ethtool: Allow zero-length register dumps again")
and reduce warn_on to warn_on_once.


I think this is generally not a good idea as
most uses already fit fine in a kcalloc.


most nics have large numbers of counters that don't fit into one page.
that's already pushing mm.
especially in this case control plane apps call 'ethtool -S'
periodically.


Maybe use Michal Hocko's kvmalloc changes.
https://lkml.org/lkml/2017/1/30/120


v1 discussion here
http://patchwork.ozlabs.org/patch/721122/
as I mentioned there long term I don't mind using kvmalloc,
but the issue has to be fixed now. Either via vzalloc or nowarn+noretry.
My stress testing with memory hog shows that kmalloc fails
quite often, thankfully user space daemon is ready for failures,
whereas vzalloc approach works all the time and no extra headaches
for user space.

Re: [RFC PATCH 2/2] ixgbe: add af_packet direct copy support


On 1/27/17 1:34 PM, John Fastabend wrote:

+   h2 = page_address(rx_buffer->page) + rx_buffer->page_offset - hdrlen;
+   eth = page_address(rx_buffer->page) + rx_buffer->page_offset,


I don't think it compiles ;)


+   /* This indicates a bug in ixgbe leaving for testing purposes */
+   WARN_ON(TP_STATUS_USER & h2->tp_status);
+   len = le16_to_cpu(rx_desc->wb.upper.length);
+   h2->tp_len = len;
+   h2->tp_snaplen = len;
+   h2->tp_mac = ALIGN(TPACKET_ALIGN(TPACKET2_HDRLEN), L1_CACHE_BYTES);
+   h2->tp_net = h2->tp_mac + ETH_HLEN;
+   h2->tp_sec = div_s64_rem(ns, NSEC_PER_SEC, );
+   h2->tp_nsec = rem;
+
+   sll = (void *)h2 + TPACKET_ALIGN(sizeof(struct tpacket2_hdr));
+   sll->sll_halen = ETH_HLEN;
+   memcpy(sll->sll_addr, eth->h_source, ETH_ALEN);
+   sll->sll_family = AF_PACKET;
+   sll->sll_hatype = rx_ring->netdev->type;
+   sll->sll_protocol = eth->h_proto;
+   sll->sll_pkttype = PACKET_HOST;
+   sll->sll_ifindex = rx_ring->netdev->ifindex;


performance wise it looks very expensive to do all these header copies
and integer divide for every packet.
I think unless we move to new dumb and simple header format
performance of this approach is not going to be satisfactory.

Re: [PATCH net-next 00/10] bnxt_en: Add XDP support.

On Mon, Jan 30, 2017 at 08:49:25PM -0500, Michael Chan wrote:
> The first 8 patches refactor the code (rx/tx code paths and ring logic)
> and add the basic infrastructure to support XDP.  The 9th patch adds
> basic ndo_xdp to support XDP_DROP and XDP_PASS only.  The 10th patch
> completes the series with XDP_TX.

Looks great.
Could you please share performance numbers ?

Also please add something like:
  if (prog && prog->xdp_adjust_head) {
  netdev_warn(dev, "Does not support bpf_xdp_adjust_head()\n");
  return -EOPNOTSUPP;
  }
unless you plan to add adjut_head support until net-next closes.
Note, it's must have for load balancer functionality.

Re: [PATCH net-next 1/4] mlx5: Make building eswitch configurable

2017-01-27 Thread Alexei Starovoitov


On 1/27/17 1:15 PM, Saeed Mahameed wrote:

It is only mandatory for configurations that needs eswitch, where the
driver has no way to know about them, for a good old bare metal box,
eswitch is not needed.

we can do some work to strip the l2 table logic - needed for PFs to
work on multi-host - out of eswitch but again that would further
complicate the driver code since eswitch will still need to update l2
tables for VFs.


Saeed,
for multi-host setups every host in that multi-host doesn't
actually see the eswitch, no? Otherwise broken driver on one machine
can affect the other hosts in the same bundle? Please double check,
since this is absolutely critical HW requirement.

Re: [PATCH net-next 1/4] mlx5: Make building eswitch configurable

2017-01-28 Thread Alexei Starovoitov


On 1/28/17 3:20 AM, Saeed Mahameed wrote:

On Sat, Jan 28, 2017 at 1:23 AM, Alexei Starovoitov <a...@fb.com> wrote:

On 1/27/17 1:15 PM, Saeed Mahameed wrote:


It is only mandatory for configurations that needs eswitch, where the
driver has no way to know about them, for a good old bare metal box,
eswitch is not needed.

we can do some work to strip the l2 table logic - needed for PFs to
work on multi-host - out of eswitch but again that would further
complicate the driver code since eswitch will still need to update l2
tables for VFs.



Saeed,
for multi-host setups every host in that multi-host doesn't
actually see the eswitch, no? Otherwise broken driver on one machine
can affect the other hosts in the same bundle? Please double check,


each host (PF) has its own eswitch, and each eswitch lives in its own
"steering-space"
  and it can't affect others.


since this is absolutely critical HW requirement.



The only shared HW resources between hosts (PFs) is the simple l2 table,
and the only thing a host can ask from the l2 talbe (FW) is: "forward
UC MAC to me", and it is the responsibility of the the driver eswitch
to do so.

the l2 table is created and managed by FW, SW eswitch can only request
from FW, and the FW is trusted.


ok. clear. thanks for explaining.
Could you describe the sequence of function calls within mlx5
that does the assignment of uc mac for PF ?
since I'm missing where eswitch is involved.
I can see:
mlx5e_nic_enable | mlx5e_set_mac
  queue_work(priv->wq, >set_rx_mode_work);
mlx5e_set_rx_mode_work
  mlx5e_apply_netdev_addr
mlx5e_add_l2_flow_rule

Re: [PATCH 0/6 v3] kvmalloc

2017-01-25 Thread Alexei Starovoitov

On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko <mho...@kernel.org> wrote:
> On Wed 25-01-17 14:10:06, Michal Hocko wrote:
>> On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
>> > On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
>> > > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
>> > > > Hi,
>> > > > this has been previously posted as a single patch [1] but later on more
>> > > > built on top. It turned out that there are users who would like to have
>> > > > __GFP_REPEAT semantic. This is currently implemented for costly >64B
>> > > > requests. Doing the same for smaller requests would require to redefine
>> > > > __GFP_REPEAT semantic in the page allocator which is out of scope of
>> > > > this series.
>> > > >
>> > > > There are many open coded kmalloc with vmalloc fallback instances in
>> > > > the tree.  Most of them are not careful enough or simply do not care
>> > > > about the underlying semantic of the kmalloc/page allocator which means
>> > > > that a) some vmalloc fallbacks are basically unreachable because the
>> > > > kmalloc part will keep retrying until it succeeds b) the page allocator
>> > > > can invoke a really disruptive steps like the OOM killer to move 
>> > > > forward
>> > > > which doesn't sound appropriate when we consider that the vmalloc
>> > > > fallback is available.
>> > > >
>> > > > As it can be seen implementing kvmalloc requires quite an intimate
>> > > > knowledge if the page allocator and the memory reclaim internals which
>> > > > strongly suggests that a helper should be implemented in the memory
>> > > > subsystem proper.
>> > > >
>> > > > Most callers I could find have been converted to use the helper 
>> > > > instead.
>> > > > This is patch 5. There are some more relying on __GFP_REPEAT in the
>> > > > networking stack which I have converted as well but considering we do
>> > > > not have a support for __GFP_REPEAT for requests smaller than 64kB I
>> > > > have marked it RFC.
>> > >
>> > > Are there any more comments? I would really appreciate to hear from
>> > > networking folks before I resubmit the series.
>> >
>> > while this patchset was baking the bpf side switched to use 
>> > bpf_map_area_alloc()
>> > which fixes the issue with missing __GFP_NORETRY that we had to fix 
>> > quickly.
>> > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure 
>> > with map alloc")
>> > it covers all kmalloc/vmalloc pairs instead of just one place as in this 
>> > set.
>> > So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
>>
>> OK, will do. Thanks for the heads up.
>
> Just for the record, I will fold the following into the patch 1
> ---
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 19b6129eab23..8697f43cf93c 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
>
>  void *bpf_map_area_alloc(size_t size)
>  {
> -   /* We definitely need __GFP_NORETRY, so OOM killer doesn't
> -* trigger under memory pressure as we really just want to
> -* fail instead.
> -*/
> -   const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
> -   void *area;
> -
> -   if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
> -   area = kmalloc(size, GFP_USER | flags);
> -   if (area != NULL)
> -   return area;
> -   }
> -
> -   return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
> -PAGE_KERNEL);
> +   return kvzalloc(size, GFP_USER);
>  }
>
>  void bpf_map_area_free(void *area)

Looks fine by me.
Daniel, thoughts?

Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-01-25 Thread Alexei Starovoitov


On 1/25/17 9:46 PM, Eric W. Biederman wrote:

Alexei Starovoitov <a...@fb.com> writes:


in cases where bpf programs are looking at sockets and packets
that belong to different netns, it could be useful to read netns inode,
so that programs can make intelligent decisions.
For example to disallow raw sockets in all non-init netns the program can do:
if (sk->type == SOCK_RAW && sk->netns_inum != 0xf075)
   return 0;
where 0xf075 inode comes from /proc/pid/ns/net

Similarly TC cls_bpf/act_bpf and socket filters can do
if (skb->netns_inum == expected_inode)


Nacked-by: "Eric W. Biederman" <ebied...@xmission.com>


I very much value your opinion, but your ack or nack doesn't apply here.
Exposing existing inode has nothing to do with what you maintain.
It's a bpf feature that is exposing already visible to user space id.
Period.
If I was proposing to expose some internal namespace id, then yes,
we'd need to have a discussion. ns.inum is already visible.


Particularly you need to compare more than the inode number.
Further I have never guaranteed there will be exactly one inode
per network namespace, just that if the device number and the inode
number pair match they are the same.


people already rely on inodes for _all_ namespaces.
The current implementation of
net_ns_net_init->..>ida_get_new is stuck the way it is.
We can change ids, generation algorithm, but uniqueness is
already assumed by user space.


Beyond that the entire concept is complete rubbish.


care to explain what you think the 'entire concept' is?


The only sane thing is to interpret whatever your bpf program in the
context of the program which installs it.


that's impossible. The programs are operating in the context that
is disjoined from the app that installs it.


If you can't do that you have a very broken piece of userspace
interface.  Which appears to be the case here.


Call it rubbish, but this is how it is.
cls_bpf is operating on packets. xdp_bpf is operating on raw dma buffers
and there we might need eventually lookup sockets and net namespaces.
Think of bpf programs as safe kernel modules. They don't have
confined boundaries and program authors, if not careful, can shoot
themselves in the foot. We're not trying to prevent that because
it's impossible to check that the program is sane. Just like
it's impossible to check that kernel module is sane.
But in case of bpf we check that bpf program is _safe_ from the kernel
point of view. If it's doing some garbage, it's program's business.
Does it make more sense now?

[PATCH net] bpf: expose netns inode to bpf programs

2017-01-25 Thread Alexei Starovoitov

in cases where bpf programs are looking at sockets and packets
that belong to different netns, it could be useful to read netns inode,
so that programs can make intelligent decisions.
For example to disallow raw sockets in all non-init netns the program can do:
if (sk->type == SOCK_RAW && sk->netns_inum != 0xf075)
  return 0;
where 0xf075 inode comes from /proc/pid/ns/net

Similarly TC cls_bpf/act_bpf and socket filters can do
if (skb->netns_inum == expected_inode)

The lack of netns awareness was a concern even for socket filters,
since the application can attach the same bpf program to sockets
in a different netns. Just like tc cls_bpf program can work in
different netns as well, so it has to be addressed uniformly
across all types of bpf programs.

Signed-off-by: Alexei Starovoitov <a...@kernel.org>
---
with corresponding change in 'ip vrf' that David Ahern is working on,
this will address 'malfunction' concern that Andy discovered in 'ip vrf',
hence this fix is needed for 'net'.
---
 include/uapi/linux/bpf.h  |  2 ++
 net/core/filter.c | 27 +++
 samples/bpf/sock_flags_kern.c |  2 ++
 samples/bpf/sockex1_kern.c|  3 +++
 4 files changed, 34 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0eb0e87dbe9f..867cbe043d77 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -546,6 +546,7 @@ struct __sk_buff {
__u32 tc_classid;
__u32 data;
__u32 data_end;
+   __u32 netns_inum;
 };
 
 struct bpf_tunnel_key {
@@ -581,6 +582,7 @@ struct bpf_sock {
__u32 family;
__u32 type;
__u32 protocol;
+   __u32 netns_inum;
 };
 
 #define XDP_PACKET_HEADROOM 256
diff --git a/net/core/filter.c b/net/core/filter.c
index 1969b3f118c1..b2a15402c034 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3118,6 +3118,22 @@ static u32 sk_filter_convert_ctx_access(enum 
bpf_access_type type, int dst_reg,
*insn++ = BPF_MOV64_IMM(dst_reg, 0);
break;
 #endif
+   case offsetof(struct __sk_buff, netns_inum):
+#ifdef CONFIG_NET_NS
+   /* return dev_net(skb->dev)->ns.inum; */
+   BUILD_BUG_ON(FIELD_SIZEOF(struct net, ns.inum) != 4);
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, dev),
+ dst_reg, src_reg,
+ offsetof(struct sk_buff, dev));
+   *insn++ = BPF_JMP_IMM(BPF_JEQ, dst_reg, 0, 2);
+   *insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), dst_reg, dst_reg,
+ offsetof(struct net_device, nd_net));
+   *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, dst_reg,
+ offsetof(struct net, ns.inum));
+#else
+   *insn++ = BPF_MOV64_IMM(dst_reg, 0);
+   break;
+#endif
}
 
return insn - insn_buf;
@@ -3163,6 +3179,17 @@ static u32 sock_filter_convert_ctx_access(enum 
bpf_access_type type,
*insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg, SK_FL_PROTO_MASK);
*insn++ = BPF_ALU32_IMM(BPF_RSH, dst_reg, SK_FL_PROTO_SHIFT);
break;
+   case offsetof(struct bpf_sock, netns_inum):
+#ifdef CONFIG_NET_NS
+   /* return sock_net(sk)->ns.inum; */
+   *insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), dst_reg, src_reg,
+ offsetof(struct sock, sk_net));
+   *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, dst_reg,
+ offsetof(struct net, ns.inum));
+#else
+   *insn++ = BPF_MOV64_IMM(dst_reg, 0);
+   break;
+#endif
}
 
return insn - insn_buf;
diff --git a/samples/bpf/sock_flags_kern.c b/samples/bpf/sock_flags_kern.c
index 533dd11a6baa..6d9073509b45 100644
--- a/samples/bpf/sock_flags_kern.c
+++ b/samples/bpf/sock_flags_kern.c
@@ -9,8 +9,10 @@ SEC("cgroup/sock1")
 int bpf_prog1(struct bpf_sock *sk)
 {
char fmt[] = "socket: family %d type %d protocol %d\n";
+   char fmt2[] = "socket: netns_inum %u\n";
 
bpf_trace_printk(fmt, sizeof(fmt), sk->family, sk->type, sk->protocol);
+   bpf_trace_printk(fmt2, sizeof(fmt2), sk->netns_inum);
 
/* block PF_INET6, SOCK_RAW, IPPROTO_ICMPV6 sockets
 * ie., make ping6 fail
diff --git a/samples/bpf/sockex1_kern.c b/samples/bpf/sockex1_kern.c
index ed18e9a4909c..d522a70bd661 100644
--- a/samples/bpf/sockex1_kern.c
+++ b/samples/bpf/sockex1_kern.c
@@ -24,6 +24,9 @@ int bpf_prog1(struct __sk_buff *skb)
if (value)
__sync_fetch_and_add(value, skb->len);
 
+   char fmt[] = "skb %p netns inode %u\n";
+
+   bpf_trace_printk(fmt, sizeof(fmt), skb, skb->netns_inum);
return 0;
 }
 char _license[] SEC("license") = "GPL";
-- 
2.8.0

Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-01-26 Thread Alexei Starovoitov


On 1/26/17 10:12 AM, Andy Lutomirski wrote:

On Thu, Jan 26, 2017 at 9:46 AM, Alexei Starovoitov <a...@fb.com> wrote:

On 1/26/17 8:37 AM, Andy Lutomirski wrote:


Think of bpf programs as safe kernel modules. They don't have
confined boundaries and program authors, if not careful, can shoot
themselves in the foot. We're not trying to prevent that because
it's impossible to check that the program is sane. Just like
it's impossible to check that kernel module is sane.
But in case of bpf we check that bpf program is _safe_ from the kernel
point of view. If it's doing some garbage, it's program's business.
Does it make more sense now?



With all due respect, I think this is not an acceptable way to think
about BPF at all.  If you think of BPF this way, I think there needs
to be a real discussion at KS or similar as to whether this is okay.
The reason is simple: the kernel promises a stable ABI to userspace
but not to kernel modules.  By thinking of BPF as more like a module,
you're taking a big shortcut that will either result in ABI breakage
down the road or in committing to a problematic stable ABI.



you misunderstood the analogy.
bpf abi is certainly stable. that's why we were careful of not
exposing anything to it that is not already stable.



In that case I don't understand what you're trying to say.  Eric
thinks your patch exposes a bad interface.  A bad interface for
userspace is a very different thing from a bad interface available to
kernel modules.  Are you saying that BPF is kernel-module-like in that
the ABI exposed to BPF programs doesn't need to meet the same quality
standards as userspace ABIs?


of course not.
ns.inum is already exposed to user space as a value.
This patch exposes it to bpf program in a convenient and stable way,
therefore I don't see why it's a big deal to you and Eric and why it
has anything to do with namespaces in general. It doesn't change
any existing behavior and doesn't impose any new restrictions.
Like ns.inum can be moved around. User space visible field
'netns_inum' is a shadow of kernel field. Only 'netns_inum'
has to be stable and that is my headache.

The kernel module analogy is an attempt to explain that programs
can do insane things.
Like the user can create a socket attach a program to it, change
netns, create another socket and attach the same program.
Inside the program it can do 'if (skb->ifindex == xxx)'.
This would be nonsensical program, since ifindex is obviously scoped
by netns and comparing ifindex without regard to netns is bogus.
But kernel cannot prevent users to write such programs.
Hence the kernel module analogy: the kernel cannot prevent
nonsensical modules.

With this patch the user will be able to do
if (skb->netns_inum == ... && skb->ifindex == ...)
which would be more sane thing to do, but without appropriate
control plane, it's also nonsensical, since netns inode and
dev ifindex can disappear while the program is running.
We obviously don't want to pin net_devices and netns-es for the program.
It would be debugging nightmare. Therefore the user has to write
the program understanding all this.

Re: [PATCH v2] virtio_net: fix PAGE_SIZE > 64k

On Tue, Jan 24, 2017 at 7:48 PM, John Fastabend
 wrote:
>
> It is a concern on my side. I want XDP and Linux stack to work
> reasonably well together.

btw the micro benchmarks showed that page per packet approach
that xdp took in mlx4 should be 10% slower vs normal operation
for tcp/ip stack. We thought that for our LB use case
it will be an acceptable slowdown, but turned out that overall we
got a performance boost, since xdp model simplified user space
and got data path faster, so we magically got extra free cpu
that is used for other apps on the same host and overall
perf win despite extra overhead in tcp/ip.
Not all use cases are the same and not everyone will be as lucky,
so I'd like to see performance of xdp_pass improving too, though
it turned out to be not as high priority as I initially estimated.

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

On Tue, Jan 24, 2017 at 01:24:54PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 24, 2017 at 12:29 PM, David Ahern  
> wrote:
> >
> > Users do not run around exec'ing commands in random network contexts 
> > (namespace, vrf, device, whatever) and expect them to just work.
> 
> I worked on some code once (deployed in production, even!) that calls
> unshare() to make a netns and creates an interface and runs code in
> there and expects it to just work.

that is exactly the counter argument to your proposal which is making cgroups
ignore bpf programs acting on sockets/apps that are no longer in the root ns.
If app can escape cgroup by switching netns, it breaks the scoping assumption
in a way that user space control plane that uses cgroup+bpf cannot oversee.
The program should just work and stay visible to bpf program within
that cgroup scope. More below.

> setns() is a bit more complicated, but it should still be fine.
> netns_install() requires CAP_NET_ADMIN over the target netns, so you
> can only switch in to a netns if you already have privilege in that
> netns.

it's not fine. Consider our main use case which is cgroup-scoped traffic
monitoring and enforcement for well behaving apps.
The container management framework cannot use sandboxing and monitor
all syscalls, because it's unacceptable overhead for production apps.
These apps do unknown things including netns. Some of them run as root too.
The container management needs to see what these apps are doing from
networking point of view with lowest possible overhead. While
cgroup+bpf hook is independent from netns, it's all fine. Initially
the program will simply count bytes/packets and associate them with jobs
where job is a collection of processes. In some cases we need to
restrict whom these jobs can talk to and amount of traffic they send.
Container management doesn't place jobs into netns-es today, because
veth overhead is too high. Something like afnetns might be an option.
Whether it's going to be netns or afnetns should be orthogonal to
cgroup scoped monitoring. A job per cgroup may have multiple
netns inside and the other way around. Multiple cgroup scopes
within single netns. I don't think there is a case where
cgroup/netns scopes will be misaligned, like:
cgroup A + netns 1 and 2
cgroup B + netns 2 and 3
but I don't see a reason to restrict that either.

> But perhaps they should be less disjoint.  As far as I know,
> cgroup+bpf is the *only* network configuration that is being set up to
> run across all network namespaces.  No one has said why this is okay,
> let alone why it's preferable to making it work per-netns just like
> basically all other Linux network configuration.

Single cls_bpf program attached to all netdevs in tc egress
hook can call bpf_skb_under_cgroup() to check whether given skb
is under given cgroup and afterwards decide to drop/redirect.
In terms of 'run across all netns' it's exactly the same.
Our monitoring use case came out of it.
Unfortunately bpf_skb_under_cgroup() in cls_bpf is not scalable.
It works fine for quick on the box monitoring where user logs in
to the box and can see what particular job is doing,
but it's not usable when there are thousands of cgroups and
we need to monitor them all.

> I was hoping for an actual likely use case for the bpf hooks to be run
> in all namespaces.  You're arguing that iproute2 can be made to work
> mostly okay if bpf hooks can run in all namespaces, but the use case
> of intentionally making sk_bound_dev_if invalid across all namespaces
> seems dubious.

I think what Tejun is proposing regarding adding global_ifindex
is a great idea and it will make ifindex interaction cleaner not only
for cgroup+bpf, but for socket and cls_bpf programs too.
I think it would be ok to disallow unprivileged programs (whenever
they come) to use of bpf_sock->bound_dev_if and only
allow bpf_sock->global_bound_dev_if and that should solve
your security concern for future unpriv programs.

I think bpf_get_sock_netns_id() helper or sk->netns_id field would
be good addition as well, since it will allow 'ip vrf' to be smarter today.
It's also more flexible, since bpf_type_cgroup_sock program can make
its own decision when netns_id != expecte_id instead of hard coded.
Today the ip vrf use case does: 'sk->bound_dev_if = idx; return OK;'
it will be able to:
  if (sk->netns_id != expected_id)
return DROP;
  sk->bound_dev_if = idx;
  return OK;
or
  if (sk->netns_id != expected_id)
return OK;
  sk->bound_dev_if = idx;
  return OK;
or it will be able to bpf_trace_printk() or bpf_perf_event_output()
to send notification to vrf user space daemon and so on.
Such checks will run at the same speed as checks inside
__cgroup_bpf_run_filter_sk(), but all other users won't pay
for them when not in use. imo it's a win-win.

In parallel I'm planning to work on 'disallow override' flag
for bpf_prog_attach. That should solve remaining concern.
I still disagree about urgency of

[PATCH net-next] net: ethtool: silence kmalloc warning

2017-01-28 Thread Alexei Starovoitov

under memory pressure 'ethtool -S' command may warn:
[ 2374.385195] ethtool: page allocation failure: order:4, mode:0x242c0c0
[ 2374.405573] CPU: 12 PID: 40211 Comm: ethtool Not tainted
[ 2374.423071] Call Trace:
[ 2374.423076]  [] dump_stack+0x4d/0x64
[ 2374.423080]  [] warn_alloc_failed+0xeb/0x150
[ 2374.423082]  [] ? __alloc_pages_direct_compact+0x43/0xf0
[ 2374.423084]  [] __alloc_pages_nodemask+0x4dc/0xbf0
[ 2374.423091]  [] ? cmd_exec+0x722/0xcd0 [mlx5_core]
[ 2374.423095]  [] alloc_pages_current+0x8c/0x110
[ 2374.423097]  [] alloc_kmem_pages+0x19/0x90
[ 2374.423099]  [] kmalloc_order_trace+0x2e/0xe0
[ 2374.423101]  [] __kmalloc+0x204/0x220
[ 2374.423105]  [] dev_ethtool+0xe4e/0x1f80
[ 2374.423106]  [] ? dev_get_by_name_rcu+0x5e/0x80
[ 2374.423108]  [] dev_ioctl+0x156/0x560
[ 2374.423111]  [] ? mem_cgroup_commit_charge+0x78/0x3c0
[ 2374.423117]  [] sock_do_ioctl+0x42/0x50
[ 2374.423119]  [] sock_ioctl+0x1b3/0x250
[ 2374.423121]  [] do_vfs_ioctl+0x92/0x580
[ 2374.423123]  [] ? do_audit_syscall_entry+0x4b/0x70
[ 2374.423124]  [] ? syscall_trace_enter_phase1+0xfc/0x120
[ 2374.423126]  [] SyS_ioctl+0x79/0x90
[ 2374.423127]  [] do_syscall_64+0x50/0xa0
[ 2374.423129]  [] entry_SYSCALL64_slow_path+0x25/0x25

~1160 mlx5 counters ~= order 4 allocation which is unlikely to succeed
under memory pressure.  Since 'get stats' command is not critical
avoid reclaim and warning.
Also convert to safer kmalloc_array.

Signed-off-by: Alexei Starovoitov <a...@kernel.org>
---
Long term this place is a good candidate to use kvmalloc() once it's merged.
---
 net/core/ethtool.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 236a21e3c878..be681a06bf3f 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1820,7 +1820,8 @@ static int ethtool_get_strings(struct net_device *dev, 
void __user *useraddr)
 
gstrings.len = ret;
 
-   data = kcalloc(gstrings.len, ETH_GSTRING_LEN, GFP_USER);
+   data = kcalloc(gstrings.len, ETH_GSTRING_LEN,
+  GFP_USER | __GFP_NORETRY | __GFP_NOWARN);
if (!data)
return -ENOMEM;
 
@@ -1918,7 +1919,8 @@ static int ethtool_get_stats(struct net_device *dev, void 
__user *useraddr)
return -EFAULT;
 
stats.n_stats = n_stats;
-   data = kmalloc(n_stats * sizeof(u64), GFP_USER);
+   data = kmalloc_array(n_stats, sizeof(u64),
+GFP_USER | __GFP_NORETRY | __GFP_NOWARN);
if (!data)
return -ENOMEM;
 
@@ -1957,7 +1959,8 @@ static int ethtool_get_phy_stats(struct net_device *dev, 
void __user *useraddr)
return -EFAULT;
 
stats.n_stats = n_stats;
-   data = kmalloc_array(n_stats, sizeof(u64), GFP_USER);
+   data = kmalloc_array(n_stats, sizeof(u64),
+GFP_USER | __GFP_NORETRY | __GFP_NOWARN);
if (!data)
return -ENOMEM;
 
-- 
2.8.0

Re: linux-next: build failure after merge of the net tree

2017-02-15 Thread Alexei Starovoitov


On 2/15/17 7:02 PM, Stephen Rothwell wrote:

Hi all,

On Tue, 14 Feb 2017 09:12:50 +1100 Stephen Rothwell  
wrote:


After merging the net tree, today's linux-next build (powerpc64le perf)
failed like this:

Warning: tools/include/uapi/linux/bpf.h differs from kernel
bpf.c: In function 'bpf_prog_attach':
bpf.c:180:6: error: 'union bpf_attr' has no member named 'attach_flags'; did 
you mean 'map_flags'?
   attr.attach_flags  = flags;
   ^

Caused by commit

   7f677633379b ("bpf: introduce BPF_F_ALLOW_OVERRIDE flag")



So do we have a fix for this?  I am sure that Dave would like to send
his "net" tree to Linus sometime soonish ...


Do you mind resending it to netdev with my Ack ?
please mention [PATCH net] in subj, so it's get caught by Dave's scripts.

Re: linux-next: build failure after merge of the net tree

2017-02-15 Thread Alexei Starovoitov

On 2/15/17 7:27 PM, David Miller wrote:

From: Alexei Starovoitov <a...@fb.com>
Date: Wed, 15 Feb 2017 19:06:02 -0800

On 2/15/17 7:02 PM, Stephen Rothwell wrote:

Hi all,

On Tue, 14 Feb 2017 09:12:50 +1100 Stephen Rothwell
<s...@canb.auug.org.au> wrote:

After merging the net tree, today's linux-next build (powerpc64le
perf)
failed like this:

Warning: tools/include/uapi/linux/bpf.h differs from kernel
bpf.c: In function 'bpf_prog_attach':
bpf.c:180:6: error: 'union bpf_attr' has no member named
'attach_flags'; did you mean 'map_flags'?
attr.attach_flags  = flags;
^

Caused by commit

7f677633379b ("bpf: introduce BPF_F_ALLOW_OVERRIDE flag")

So do we have a fix for this?  I am sure that Dave would like to send
his "net" tree to Linus sometime soonish ...

Do you mind resending it to netdev with my Ack ?
please mention [PATCH net] in subj, so it's get caught by Dave's
scripts.

I applied the fix that synced the two header files already.

Ahh. Thanks!

Re: [PATCH net-next] bpf: return errno -ENOMEM when exceeding RLIMIT_MEMLOCK

2017-02-21 Thread Alexei Starovoitov

On Tue, Feb 21, 2017 at 02:00:13PM +0100, Jesper Dangaard Brouer wrote:
> On Tue, 21 Feb 2017 00:06:11 -0800
> Alexei Starovoitov <alexei.starovoi...@gmail.com> wrote:
> 
> > On Mon, Feb 20, 2017 at 05:25:58PM +0100, Jesper Dangaard Brouer wrote:
> > > On Mon, 20 Feb 2017 16:57:34 +0100
> > > Daniel Borkmann <dan...@iogearbox.net> wrote:
> > >   
> > > > On 02/20/2017 04:35 PM, Jesper Dangaard Brouer wrote:  
> > > > > It is confusing users of samples/bpf that exceeding the resource
> > > > > limits for RLIMIT_MEMLOCK result in an "Operation not permitted"
> > > > > message.  This is due to bpf limits check return -EPERM.
> > > > >
> > > > > Instead return -ENOMEM, like most other users of this API.
> > > > >
> > > > > Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and 
> > > > > programs")
> > > > > Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
> > > > > Fixes: 5ccb071e97fb ("bpf: fix overflow in prog accounting")
> > > > 
> > > > Btw, last one just moves the helper so fixes doesn't really apply
> > > > there, but apart from that this is already uapi exposed behavior
> > > > like this for ~1.5yrs, so unfortunately too late to change now. I
> > > > think the original intention (arguably confusing in this context)
> > > > was that user doesn't have (rlimit) permission to allocate this
> > > > resource.  
> > > 
> > > This is obviously confusing end-users, thus it should be fixed IMHO.  
> > 
> > I don't think it's confusing and I think EPERM makes
> > the most sense as return code in such situation.
> 
> Most other kernel users return ENOMEM.

the places in the kernel that check for memlock are not fully consistent.

perf does this:
  lock_limit = rlimit(RLIMIT_MEMLOCK);
  lock_limit >>= PAGE_SHIFT;
  locked = vma->vm_mm->pinned_vm + extra;

  if ((locked > lock_limit) && perf_paranoid_tracepoint_raw() &&
  !capable(CAP_IPC_LOCK)) {
  ret = -EPERM;
  goto unlock;
  }

and in bpf land we got an idea of using memlock limit from perf.

> Documented it here:
>  https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html

Thanks! Looks good.

Re: Questions on XDP

2017-02-18 Thread Alexei Starovoitov

On Sat, Feb 18, 2017 at 3:48 PM, John Fastabend
 wrote:
>
> We are running our vswitch in userspace now for many workloads
> it would be nice to have these in kernel if possible.
...
> Maybe Alex had something else in mind but we have many virtual interfaces
> plus physical interfaces in vswitch use case. Possibly thousands.

virtual interfaces towards many VMs is certainly good use case
that we need to address.
we'd still need to copy the packet from memory of one vm into another,
right? so per packet allocation strategy for virtual interface can
be anything.

Sounds like you already have patches that do that?
Excellent. Please share.

Re: Questions on XDP

2017-02-20 Thread Alexei Starovoitov

On Sat, Feb 18, 2017 at 06:16:47PM -0800, Alexander Duyck wrote:
>
> I was thinking about the fact that the Mellanox driver is currently
> mapping pages as bidirectional, so I was sticking to the device to
> device case in regards to that discussion.  For virtual interfaces we
> don't even need the DMA mapping, it is just a copy to user space we
> have to deal with in the case of vhost.  In that regard I was thinking
> we need to start looking at taking XDP_TX one step further and
> possibly look at supporting the transmit of an xdp_buf on an unrelated
> netdev.  Although it looks like that means adding a netdev pointer to
> xdp_buf in order to support returning that.

xdp_tx variant (via bpf_xdp_redirect as John proposed) should work.
I don't see why such tx into another netdev cannot be done today.
The only requirement that it shouldn't be driver specific.
Whichever way it's implemented in igxbe/i40e should be applicable
to mlx*, bnx*, nfp at least.

> Anyway I am just running on conjecture at this point.  But it seems
> like if we want to make XDP capable of doing transmit we should
> support something other than bounce on the same port since that seems
> like a "just saturate the bus" use case more than anything.  I suppose
> you can do a one armed router, or have it do encap/decap for a tunnel,
> but that is about the limits of it. 

one armed router is exactly our ILA router use case.
encap/decap is our load balancer use case.

>From your other email:
> 1.  The Tx code is mostly just a toy.  We need support for more
> functional use cases.

this tx toy is serving real traffic.
Adding more use cases to xdp is nice, but we cannot sacrifice
performance of these bread and butter use cases like ddos and lb.

> 2.  1 page per packet is costly, and blocks use on the intel drivers,
> mlx4 (after Eric's patches), and 64K page architectures.

1 page per packet is costly on archs with 64k pagesize. that's it.
I see no reason to waste x86 cycles to improve perf on such archs.
If the argument is truesize socket limits due to 4k vs 2k, then
please show the patch where split page can work just as fast
as page per packet and everyone will be giving two thumbs up.
If we can have 2k truesize with the same xdp_drop/tx performance
then by all means please do it.

But I suspect what is really happening is a premature defense
of likely mediocre ixgbe xdp performance on xdp_drop due to split page...
If so, that's only ixgbe's fault and trying to make other
nics slower to have apple to apples with ixgbe is just wrong.

> 3.  Should we support scatter-gather to support 9K jumbo frames
> instead of allocating order 2 pages?

we can, if main use case of mtu < 4k doesn't suffer.

> If we allow it to do transmit on
> other netdevs then suddenly this has the potential to replace
> significant existing infrastructure.

what existing infrastructure are we talking about?
The clear containers need is clear :)
The xdp_redirect into vhost/virtio would be great to have,
but xdp_tx from one port into another of physical nic
is much less clear. That's 'saturate pci' demo.

> Sorry if I am stirring the hornets nest here.  I just finished the DMA
> API changes to allow DMA page reuse with writable pages on ixgbe, and
> igb/i40e/i40evf should be getting the same treatment shortly.  So now
> I am looking forward at XDP and just noticing a few things that didn't
> seem to make sense given the work I was doing to enable the API.

did I miss the patches that already landed ?
I don't see any recycling done by i40e_clean_tx_irq or by
ixgbe_clean_tx_irq ...

Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-24 Thread Alexei Starovoitov

On Wed, Feb 22, 2017 at 11:38:49AM -0500, Willem de Bruijn wrote:
> 
> * Limitations / Known Issues
> 
> - PF_INET6 is not yet supported.

we struggled so far to make it work in our setups which are ipv6 only.
Looking at patches it seems the code should just work.
What particularly is missing ?

Great stuff. Looking forward to net-next reopening :)

Re: [PATCH net 3/6] net/mlx5e: Do not reduce LRO WQE size when not using build_skb

2017-02-22 Thread Alexei Starovoitov

On Wed, Feb 22, 2017 at 7:20 AM, Saeed Mahameed  wrote:
> From: Tariq Toukan 
>
> When rq_type is Striding RQ, no room of SKB_RESERVE is needed
> as SKB allocation is not done via build_skb.
>
> Fixes: e4b85508072b ("net/mlx5e: Slightly reduce hardware LRO size")
> Signed-off-by: Tariq Toukan 
> Signed-off-by: Saeed Mahameed 

why this one is a bug fix?
Sound like an optimization from commit log.

Re: Questions on XDP

2017-02-18 Thread Alexei Starovoitov

On Sat, Feb 18, 2017 at 10:18 AM, Alexander Duyck
 wrote:
>
>> XDP_DROP does not require having one page per frame.
>
> Agreed.

why do you think so?
xdp_drop is targeting ddos where in good case
all traffic is passed up and in bad case
most of the traffic is dropped, but good traffic still needs
to be serviced by the layers after. Like other xdp
programs and the stack.
Say ixgbe+xdp goes with 2k per packet,
very soon we will have a bunch of half pages
sitting in the stack and other halfs requiring
complex refcnting and making the actual
ddos mitigation ineffective and forcing nic to drop packets
because it runs out of buffers. Why complicate things?
packet per page approach is simple and effective.
virtio is different. there we don't have hw that needs
to have buffers ready for dma.

> Looking at the Mellanox way of doing it I am not entirely sure it is
> useful.  It looks good for benchmarks but that is about it.  Also I

it's the opposite. It already runs very nicely in production.
In real life it's always a combination of xdp_drop, xdp_tx and
xdp_pass actions.
Sounds like ixgbe wants to do things differently because
of not-invented-here. That new approach may turn
out to be good or bad, but why risk it?
mlx4 approach works.
mlx5 has few issues though, because page recycling
was done too simplistic. Generic page pool/recycling
that all drivers will use should solve that. I hope.
Is the proposal to have generic split-page recycler ?
How that is going to work?

> don't see it extending out to the point that we would be able to
> exchange packets between interfaces which really seems like it should
> be the ultimate goal for XDP_TX.

we don't have a use case for multi-port xdp_tx,
but I'm not objecting to doing it in general.
Just right now I don't see a need to complicate
drivers to do so.

> It seems like eventually we want to be able to peel off the buffer and
> send it to something other than ourselves.  For example it seems like
> it might be useful at some point to use XDP to do traffic
> classification and have it route packets between multiple interfaces
> on a host and it wouldn't make sense to have all of them map every
> page as bidirectional because it starts becoming ridiculous if you
> have dozens of interfaces in a system.

dozen interfaces? Like a single nic with dozen ports?
or many nics with many ports on the same system?
are you trying to build a switch out of x86?
I don't think it's realistic to have multi-terrabit x86 box.
Is it all because of dpdk/6wind demos?
I saw how dpdk was bragging that they can saturate
pcie bus. So? Why is this useful?
Why anyone would care to put a bunch of nics
into x86 and demonstrate that bandwidth of pcie is now
a limiting factor ?

> Also as far as the one page per frame it occurs to me that you will
> have to eventually deal with things like frame replication.

... only in cases where one needs to demo a multi-port
bridge with lots of nics in one x86 box.
I don't see practicality of such setup and I think
that copying full page every time xdp needs to
broadcast is preferred vs doing atomic refcnting
that will slow down the main case. broadcast is slow path.

My strong believe that xdp should not care about
niche architectures. It never meant to be a solution
for everyone and for all use cases.
If xdp sucks on powerpc, so be it.
cpus with 64k pages are doomed. We should
not sacrifice performance on x86 because of ppc.
I think it was a mistake that ixgbe choose to do that
in the past. When mb()s were added because
of powerpc and it took years to introduce dma_mb()
and return performance to good levels.
btw, dma_mb work was awesome.
In xdp I don't want to make such trade-offs.
Really only x86 and arm64 archs matter today.
Everything else is best effort.

Re: [PATCH net-next] bpf: return errno -ENOMEM when exceeding RLIMIT_MEMLOCK

2017-02-21 Thread Alexei Starovoitov

On Mon, Feb 20, 2017 at 05:25:58PM +0100, Jesper Dangaard Brouer wrote:
> On Mon, 20 Feb 2017 16:57:34 +0100
> Daniel Borkmann  wrote:
> 
> > On 02/20/2017 04:35 PM, Jesper Dangaard Brouer wrote:
> > > It is confusing users of samples/bpf that exceeding the resource
> > > limits for RLIMIT_MEMLOCK result in an "Operation not permitted"
> > > message.  This is due to bpf limits check return -EPERM.
> > >
> > > Instead return -ENOMEM, like most other users of this API.
> > >
> > > Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and 
> > > programs")
> > > Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
> > > Fixes: 5ccb071e97fb ("bpf: fix overflow in prog accounting")  
> > 
> > Btw, last one just moves the helper so fixes doesn't really apply
> > there, but apart from that this is already uapi exposed behavior
> > like this for ~1.5yrs, so unfortunately too late to change now. I
> > think the original intention (arguably confusing in this context)
> > was that user doesn't have (rlimit) permission to allocate this
> > resource.
> 
> This is obviously confusing end-users, thus it should be fixed IMHO.

I don't think it's confusing and I think EPERM makes
the most sense as return code in such situation.
There is also code in iovisor/bcc that specifically looking
for EPERM to adjust ulimit.
May be it's not documented properly, but that's different story.

Re: Questions on XDP

2017-02-20 Thread Alexei Starovoitov

On Mon, Feb 20, 2017 at 08:00:57PM -0800, Alexander Duyck wrote:
> 
> I assumed "toy Tx" since I wasn't aware that they were actually
> allowing writing to the page.  I think that might work for the XDP_TX
> case, 

Take a look at samples/bpf/xdp_tx_iptunnel_kern.c
It's close enough approximation of load balancer.
The packet header is rewritten by the bpf program.
That's where dma_bidirectional requirement came from.

> but the case where encap/decap is done and then passed up to the
> stack runs the risk of causing data corruption on some architectures
> if they unmap the page before the stack is done with the skb.  I
> already pointed out the issue to the Mellanox guys and that will
> hopefully be addressed shortly.

sure. the path were xdp program does decap and passes to the stack
is not finished. To make it work properly we need to expose
csum complete field to the program at least.

> As far as the Tx I need to work with John since his current solution
> doesn't have any batching support that I saw and that is a major
> requirement if we want to get above 7 Mpps for a single core.

I think we need to focus on both Mpps and 'perf report' together.
Single core doing 7Mpps and scaling linearly to 40Gbps line rate
is much better than single core doing 20Mpps and not scaling at all.
There could be sw inefficiencies and hw limits, hence 'perf report'
is must have when discussing numbers.

I think long term we will be able to agree on a set of real life
use cases and corresponding set of 'blessed' bpf programs and
create a table of nic, driver, use case 1, 2, 3, single core, multi.
Making level playing field for all nic vendors is one of the goals.

Right now we have xdp1, xdp2 and xdp_tx_iptunnel benchmarks.
They are approximations of ddos, router, load balancer
use cases. They obviously need work to get to 'blessed' shape,
but imo quite good to do vendor vs vendor comparison for
the use cases that we care about.
Eventually nic->vm and vm->vm use cases via xdp_redirect should
be added to such set of 'blessed' benchmarks too.
I think so far we avoided falling into trap of microbenchmarking wars.

> >>> 3.  Should we support scatter-gather to support 9K jumbo frames
> >>> instead of allocating order 2 pages?
> >>
> >> we can, if main use case of mtu < 4k doesn't suffer.
> >
> > Agreed I don't think it should degrade <4k performance. That said
> > for VM traffic this is absolutely needed. Without TSO enabled VM
> > traffic is 50% slower on my tests :/.
> >
> > With tap/vhost support for XDP this becomes necessary. vhost/tap
> > support for XDP is on my list directly behind ixgbe and redirect
> > support.
> 
> I'm thinking we just need to turn XDP into something like a
> scatterlist for such cases.  It wouldn't take much to just convert the
> single xdp_buf into an array of xdp_buf.

datapath has to be fast. If xdp program needs to look at all
bytes of the packet the performance is gone. Therefore I don't see
a need to expose an array of xdp_buffs to the program.
The alternative would be to add a hidden field to xdp_buff that keeps
SG in some form and data_end will point to the end of linear chunk.
But you cannot put only headers into linear part. If program
needs to see something that is after data_end, it will drop the packet.
So it's not at all split-header model. 'data-data_end' chunk
should cover as much as possible. We cannot get into sg games
while parsing the packet inside the program. Everything
that program needs to see has to be in the linear part.
I think such model will work well for jumbo packet case.
but I don't think it will work for VM's tso.
For xdp program to pass csum_partial packet from vm into nic
in a meaninful way it needs to gain knowledge of ip, l4, csum
and bunch of other meta-data fields that nic needs to do TSO.
I'm not sure it's worth exposing all that to xdp. Instead
can we make VM to do segmentation, so that xdp program don't
need to deal with gso packets ?
I think the main cost is packet csum and for this something
like xdp_tx_with_pseudo_header() helper can work.
xdp program will see individual packets with pseudo header
and hw nic will do final csum over packet.
The program will see csum field as part of xdp_buff
and if it's csum_partial it will use xdp_tx_with_pseudo_header()
to transmit the packet instead of xdp_redirect or xdp_tx.
The call may look like:
xdp_tx_with_pseudo_header(xdp_buff, ip_off, tcp_off);
and these two offsets the program will compute based on
the packet itself and not metadata that came from VM.
In other words I'd like xdp program to deal with raw packets
as much as possible. pseudo header is part of the packet.
So the only metadata program needs is whether packet has
pseudo header or not.
Saying it differently: whether the packet came from physical
nic and in xdp_buff we have csum field (from hw rx descriptor)
that has csum complete meaning or the packet came from VM,
pseudo header is populated and xdp_buff->csum is empty.
>From physical nic the

Re: [PATCH v2 net] bpf: introduce BPF_F_ALLOW_OVERRIDE flag

2017-02-11 Thread Alexei Starovoitov


On 2/11/17 9:47 PM, Tejun Heo wrote:

On Fri, Feb 10, 2017 at 08:28:24PM -0800, Alexei Starovoitov wrote:

If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
to the given cgroup the descendent cgroup will be able to override
effective bpf program that was inherited from this cgroup.
By default it's not passed, therefore override is disallowed.

Examples:
1.
prog X attached to /A with default
prog Y fails to attach to /A/B and /A/B/C
Everything under /A runs prog X

2.
prog X attached to /A with allow_override.
prog Y fails to attach to /A/B with default (non-override)
prog M attached to /A/B with allow_override.
Everything under /A/B runs prog M only.

3.
prog X attached to /A with allow_override.
prog Y fails to attach to /A with default.
The user has to detach first to switch the mode.

In the future this behavior may be extended with a chain of
non-overridable programs.

Also fix the bug where detach from cgroup where nothing is attached
was not throwing error. Return ENOENT in such case.

Add several testcases and adjust libbpf.

Fixes: 3007098494be ("cgroup: add support for eBPF programs")
Signed-off-by: Alexei Starovoitov <a...@kernel.org>


The cgroup part looks good to me.  Please feel free to add

Acked-by: Tejun Heo <t...@kernel.org>

One question tho.  Is there a specific reason to disallow attaching
!overridable program under an overridable one?  Isn't disallowing
attaching programs if the closest ancestor is !overridable enough?


yes. I felt the same way and v1 patch did that, but
as Andy pointed out the more strict rules will be easier to
extend later, so this v2 patch is using this approach.

Andy, ping !

Re: linux-next: build failure after merge of the net tree

2017-02-13 Thread Alexei Starovoitov


On 2/13/17 2:12 PM, Stephen Rothwell wrote:

Hi all,

After merging the net tree, today's linux-next build (powerpc64le perf)
failed like this:

Warning: tools/include/uapi/linux/bpf.h differs from kernel
bpf.c: In function 'bpf_prog_attach':
bpf.c:180:6: error: 'union bpf_attr' has no member named 'attach_flags'; did 
you mean 'map_flags'?
   attr.attach_flags  = flags;
   ^

Caused by commit

   7f677633379b ("bpf: introduce BPF_F_ALLOW_OVERRIDE flag")

Unfortunately, the perf header files are kept separate from the kernel
header files proper and are not automatically copied over :-(

I have applied the following build fix patch for today.


Yes. Thanks for the fix. It's more than a merge conflict.
I should have added it in the first place. Now we have both
perf and samples/bpf depend on tools/lib/bpf and I simply
forgot about this dependency, since building perf
is not my typical workflow.

Joe,
can you think of a way to make tools/lib/bpf to
use tools/include only?
Right now we just pull tools/lib/bpf/bpf.o in samples/bpf/Makefile
and that's a hack that caused this issue.
samples/bpf/ needs to depend on libbpf.a properly.

For the patch:
Acked-by: Alexei Starovoitov <a...@kernel.org>

Dave,
can you apply it to 'net' tree,
since the patch properly made it into patchwork ?
or I can resubmit it.
Thanks!


From: Stephen Rothwell <s...@canb.auug.org.au>
Date: Tue, 14 Feb 2017 08:22:20 +1100
Subject: [PATCH] bpf: kernel header files need to be copied into the tools 
directory

Signed-off-by: Stephen Rothwell <s...@canb.auug.org.au>
---
  tools/include/uapi/linux/bpf.h | 7 +++
  1 file changed, 7 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0eb0e87dbe9f..d2b0ac799d03 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -116,6 +116,12 @@ enum bpf_attach_type {

  #define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE

+/* If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
+ * to the given target_fd cgroup the descendent cgroup will be able to
+ * override effective bpf program that was inherited from this cgroup
+ */
+#define BPF_F_ALLOW_OVERRIDE   (1U << 0)
+
  #define BPF_PSEUDO_MAP_FD 1

  /* flags for BPF_MAP_UPDATE_ELEM command */
@@ -171,6 +177,7 @@ union bpf_attr {
__u32   target_fd;  /* container object to attach 
to */
__u32   attach_bpf_fd;  /* eBPF program to attach */
__u32   attach_type;
+   __u32   attach_flags;
};
  } __attribute__((aligned(8)));

Re: [PATCH] bpf: reduce compiler warnings by adding fallthrough comments

2017-02-13 Thread Alexei Starovoitov

On Tue, Feb 14, 2017 at 12:18:05AM +0100, Daniel Borkmann wrote:
> >kernel/bpf/verifier.c:2019:24: warning: this statement may fall through 
> >[-Wimplicit-fallthrough=]
> >false_reg->min_value = 0;
> >~^~~
> >
> >Reported-by: David Binderman <dcb...@hotmail.com>
> >Signed-off-by: Alexander Alemayhu <alexan...@alemayhu.com>
> 
> These fall-through comments are fine for net-next tree.
> 
> Acked-by: Daniel Borkmann <dan...@iogearbox.net>

lgtm as well
Acked-by: Alexei Starovoitov <a...@kernel.org>

Re: [RFC PATCH net] bpf: introduce BPF_F_ALLOW_OVERRIDE flag

2017-02-10 Thread Alexei Starovoitov

On Thu, Feb 09, 2017 at 10:59:23AM -0800, Alexei Starovoitov wrote:
> Andy,
> does it all make sense?

Andy, ping.

Re: [RFC PATCH net-next 1/2] bpf: Save original ebpf instructions

2017-02-09 Thread Alexei Starovoitov

On Thu, Feb 09, 2017 at 12:25:37PM +0100, Daniel Borkmann wrote:
>
> Correct the overlap both use-cases share is the dump itself. It needs
> to be in such a condition for CRIU, that it can be reloaded eventually,

I don't think it makes sense to drag criu into this discussion.
I expressed my take on criu in the other thread. tldr:
bpf is a graph of dependencies between programs, maps, applications
and kernel events. So to save/restore this graph one would need to solve
very hard problems of stopping multiple applications at once,
stopping kernel events and so on. I don't think it's worth going that route.

> >- Alternatively, the attach is always done by passing the FD as an
> >attribute, so the netlink dump could attach an fd to the running
> >program, return the FD as an attribute and the bpf program is retrieved
> >from the fd. This is a major departure from how dumps work with
> >processing attributes and needing to attach open files to a process will
> >be problematic. Integrating the bpf into the dump is a natural fit.
> 
> Right, I think it's a natural fit to place it into the various points/
> places where it's attached to, as we're stuck with that anyway for the
> attachment part. Meaning in cls_bpf, it would go as a mem blob into the
> netlink attribute. There would need to be a common BPF core helper that
> the various subsystem users call in order to generate that mentioned
> output format, and that resulting mem blob is then stuck into either
> nlattr, mem provided by syscall, etc.

I think if we use ten different ways to dump it, it will
complicate the user space tooling.
I'd rather see one way of doing it via new syscall command.
Pass prog_fd and it will return insns in some form.

Here is more concrete proposal:
- add two flags to PROG_LOAD:
  BPF_F_ENFORCE_STATELESS - it will require verifier to check that program
  doesn't use maps and any other global state (doesn't use bpf_redirect,
  doesn't use bpf_set_tunnel_key and tunnel_opt)
  This will ensure that program is stateless and pure instruction
  dump is meaningful. For 'ip vrf' case it will be enough.
  BPF_F_ALLOW_DUMP - it will save original program, so in the common
  case we wouldn't need to waste memory to save program

- add new bpf syscall command BPF_PROG_DUMP
  input: prog_fd, output: insns
  it will work right away with OBJ_GET command and the user will
  be able to dump stateless programs pinned in bpffs

- add approriate interfaces for different attach points to return prog_fd:
  for cgroup it will be new BPF_PROG_GET command.
  for socket it will be new getsockopt. (Actually BPF_PROG_GET can work
  for sockets too and probably better).
  for xdp and tc we need to find a way to return prog_fd.
  netlink is no good, since it would be very weird to install fd
  and return it async in netlink body. We can simply say that
  whoever wants to dump programs need to first pin them in bpffs
  and then attach to tc/xdp. iproute2 already does it anyway.
  Realistically tc/xdp programs are almost always stateful, so
  dump won't be available for them anyway.

If in the future we will discover magic way of restoring maps,
we can relax prog loading part and allow BPF_F_ALLOW_DUMP to be
used independently of BPF_F_ENFORCE_STATELESS flag.
(In the beginning these two flags would need to be used together).
Also we'll be able to extend BPF_PROG_DUMP independently
of attachment points. If we introduce something like global
variable section or read-only string section (like some folks asked)
we can dump it back. Insns will not be the only section.

My main point is that right now I'd really like to avoid
dealing with stateful bits (maps, etc), since there is no
good way of dumping it, while stateless will be enough
for 'ip vrf' and simple programs.

Thoughts?

Re: [RFC PATCH net] bpf: introduce BPF_F_ALLOW_OVERRIDE flag

2017-02-10 Thread Alexei Starovoitov


On 2/10/17 1:38 PM, Andy Lutomirski wrote:

On Thu, Feb 9, 2017 at 10:59 AM, Alexei Starovoitov <a...@fb.com> wrote:

If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
to the given cgroup the descendent cgroup will be able to override
effective bpf program that was inherited from this cgroup.
By default it's not passed, therefore override is disallowed.

Examples:
1.
prog X attached to /A with default
prog Y fails to attach to /A/B and /A/B/C
Everything under /A runs prog X

2.
prog X attached to /A with ALLOW_OVERRIDE
prog Y attached to /A/B with default. Everything under /A/B runs prog Y


I think that, for ease of future extension, Y should also need
ALLOW_OVERRIDE.  Otherwise, when non-overridable hooks can stack,
there could be confusion as to whether Y should override something or
should stack.


I see. Fair enough. It's indeed easier for future extensions.


2.
we can add another flag to reverse this call order too.
Instead of calling the progs from child to parent, do parent to child.


I think the order should depend on the hook.  Hooks for
process-initiated actions (egress, socket creation) should run
innermost first and hooks for outside actions (ingress) should be
outermost first.


There are use cases where both ingress and egress
would want both ordering. Like the monitoring would want to
see the bytes that app wants to send and it would want
to see the bytes that it's actually sending. So if something
in the middle wants to drop due to whatever conditions,
the monitoring needs to be the first and the last in the prog chain.
That's one of the use cases for 'attach_priority'.
Some high priority can be reserved for debugging and so on.


Andy,
does it all make sense?


Yes with the caveat above.


great!


Do you still insist on submitting this patch officially?


I'm not sure what you mean.


it's an RFC. In netdev we never apply rfc patches.


or you're ok keeping it overridable for now.


I really think the default should change for 4.10.  People are going


fine. will respin with requested change.

[PATCH v2 net] bpf: introduce BPF_F_ALLOW_OVERRIDE flag

2017-02-10 Thread Alexei Starovoitov

If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
to the given cgroup the descendent cgroup will be able to override
effective bpf program that was inherited from this cgroup.
By default it's not passed, therefore override is disallowed.

Examples:
1.
prog X attached to /A with default
prog Y fails to attach to /A/B and /A/B/C
Everything under /A runs prog X

2.
prog X attached to /A with allow_override.
prog Y fails to attach to /A/B with default (non-override)
prog M attached to /A/B with allow_override.
Everything under /A/B runs prog M only.

3.
prog X attached to /A with allow_override.
prog Y fails to attach to /A with default.
The user has to detach first to switch the mode.

In the future this behavior may be extended with a chain of
non-overridable programs.

Also fix the bug where detach from cgroup where nothing is attached
was not throwing error. Return ENOENT in such case.

Add several testcases and adjust libbpf.

Fixes: 3007098494be ("cgroup: add support for eBPF programs")
Signed-off-by: Alexei Starovoitov <a...@kernel.org>
---
v1->v2: disallowed overridable->non_override transition as suggested by Andy
added tests and fixed double detach bug

Andy, Daniel,
please review and ack quickly, so it can land into 4.10.
---
 include/linux/bpf-cgroup.h   | 13 
 include/uapi/linux/bpf.h |  7 +
 kernel/bpf/cgroup.c  | 59 +++---
 kernel/bpf/syscall.c | 20 
 kernel/cgroup.c  |  9 +++---
 samples/bpf/test_cgrp2_attach.c  |  2 +-
 samples/bpf/test_cgrp2_attach2.c | 68 +---
 samples/bpf/test_cgrp2_sock.c|  2 +-
 samples/bpf/test_cgrp2_sock2.c   |  2 +-
 tools/lib/bpf/bpf.c  |  4 ++-
 tools/lib/bpf/bpf.h  |  3 +-
 11 files changed, 151 insertions(+), 38 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 92bc89ae7e20..c970a25d2a49 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -21,20 +21,19 @@ struct cgroup_bpf {
 */
struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
struct bpf_prog __rcu *effective[MAX_BPF_ATTACH_TYPE];
+   bool disallow_override[MAX_BPF_ATTACH_TYPE];
 };
 
 void cgroup_bpf_put(struct cgroup *cgrp);
 void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
 
-void __cgroup_bpf_update(struct cgroup *cgrp,
-struct cgroup *parent,
-struct bpf_prog *prog,
-enum bpf_attach_type type);
+int __cgroup_bpf_update(struct cgroup *cgrp, struct cgroup *parent,
+   struct bpf_prog *prog, enum bpf_attach_type type,
+   bool overridable);
 
 /* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
-void cgroup_bpf_update(struct cgroup *cgrp,
-  struct bpf_prog *prog,
-  enum bpf_attach_type type);
+int cgroup_bpf_update(struct cgroup *cgrp, struct bpf_prog *prog,
+ enum bpf_attach_type type, bool overridable);
 
 int __cgroup_bpf_run_filter_skb(struct sock *sk,
struct sk_buff *skb,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e5b8cf16cbaf..69f65b710b10 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -116,6 +116,12 @@ enum bpf_attach_type {
 
 #define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
 
+/* If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
+ * to the given target_fd cgroup the descendent cgroup will be able to
+ * override effective bpf program that was inherited from this cgroup
+ */
+#define BPF_F_ALLOW_OVERRIDE   (1U << 0)
+
 #define BPF_PSEUDO_MAP_FD  1
 
 /* flags for BPF_MAP_UPDATE_ELEM command */
@@ -171,6 +177,7 @@ union bpf_attr {
__u32   target_fd;  /* container object to attach 
to */
__u32   attach_bpf_fd;  /* eBPF program to attach */
__u32   attach_type;
+   __u32   attach_flags;
};
 } __attribute__((aligned(8)));
 
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index a515f7b007c6..da0f53690295 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -52,6 +52,7 @@ void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup 
*parent)
e = rcu_dereference_protected(parent->bpf.effective[type],
  lockdep_is_held(_mutex));
rcu_assign_pointer(cgrp->bpf.effective[type], e);
+   cgrp->bpf.disallow_override[type] = 
parent->bpf.disallow_override[type];
}
 }
 
@@ -82,30 +83,63 @@ void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup 
*parent)
  *
  * Must be called with cgroup_mutex held.
  */
-void __cgroup_bpf_update(struct cgroup *cgrp,
-struct cgroup *parent,

[RFC PATCH net] bpf: introduce BPF_F_ALLOW_OVERRIDE flag

2017-02-09 Thread Alexei Starovoitov

If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
to the given cgroup the descendent cgroup will be able to override
effective bpf program that was inherited from this cgroup.
By default it's not passed, therefore override is disallowed.

Examples:
1.
prog X attached to /A with default
prog Y fails to attach to /A/B and /A/B/C
Everything under /A runs prog X

2.
prog X attached to /A with ALLOW_OVERRIDE
prog Y attached to /A/B with default. Everything under /A/B runs prog Y
prog M attached to /A/C with default. Everything under /A/C runs prog M
prog N fails to attach to /A/C/foo.
prog L attached to /A/D with ALLOW_OVERRIDE.
  Events under /A/D run prog L and can be overridden in /A/D/foo

/A still runs prog X
prog K attached to /A with ALLOW_OVERRIDE.
  /A now runs prog K while /A/B runs prog Y and /A/C runs prog M
prog J attached to /A with default.
  /A now runs prog J while /A/B runs prog Y.
  /A/B cannot be changed anymore (since parent disallows override),
  but can be cleared. After detach /A/B will run prog J.

Signed-off-by: Alexei Starovoitov <a...@kernel.org>
---

Below are few proposals for future extensions and not definitive:
1.
we can extend the behavior with a chain of non-overridable like:
prog X attached to /A with default
prog Y attached to /A/B with default
The events scoped by /A/B will run program Y first and if it returns 1
the prog X will be run. For control app there will be an illusion
that it owns cgroup /A/B with single prog and detach from /A/B will delete
prog Y unambiguously.
While another control app that attached to /A also see its prog X running,
unless prog Y filtered it out, which means (from X point of view)
that event didn't happen.
Attaching two programs to /A is not allowed.
We would need to combine prog X and Y into array to avoid link list
traversal for performance reasons, but that's an implementation detail.

2.
we can add another flag to reverse this call order too.
Instead of calling the progs from child to parent, do parent to child.

3.
we can extend the api further by adding 'attach_priority' flag as:
prog X attach /A prio=20
prog Y attach /A prio=10
prog N attach /A/B prio=20
prog M attach /A/B prio=10
in /A/B the sequence of progs will be M -> N -> Y -> X

prog X attach /A prio=10 and prog Y attach /A prio=10 will be disallowed,
but attach with the same prio to different cgroups is ok.
If attached with prio, detach must specify prio as well.
Attach transitions:
allow_override -> disable_override/single_prog = ok
allow_override -> prio (multi prog at the same cgroup) = ok
disable_override/single_prog -> prio = ok (with respect to child/parent order)
prio -> allow_override = fail
prio -> disable_override/single_prog = fail

***
To summarize the key to not breaking abi is to preserve user space
expectations. Right now (without this patch) we have progs
overridable by any descendent. Which means that control plane
application has to expect that something may overwrite the program.
Hence any new flag will not break this expectation
(overridable == control plane cannot assume that its attached
programs will run in the hostile environment)
and that's the main reason why I don't think we need to change anything now
and hence this patch is an RFC.

Adding 'allow_override' flag and changing the default to
override disallowed is also fine from api extensibility point of view.
Since for 'override disallowed' case the control plane app will
be expecting that any processes will not override its program
in the descendent cgroups and it will run. This would have to be preserved.
That's why the future api extensions (like #1 above) would have to do
the program chaining to preserve 'disallow override' flag expectations.
So imo it's safer to keep overridable as it is today, since this flag
adds a bit more restrictions to the future extensions
comparing to everything overridable.

Andy,
does it all make sense?
Do you still insist on submitting this patch officially?
or you're ok keeping it overridable for now.
Note that in the future it will not be possible to change the default,
but 'disallow_override' flag can added at any time:
Change the default in this patch and it can be appied for 4.12 or later.
---
 include/linux/bpf-cgroup.h | 13 ++---
 include/uapi/linux/bpf.h   |  7 +++
 kernel/bpf/cgroup.c| 25 +++--
 kernel/bpf/syscall.c   | 20 ++--
 kernel/cgroup.c|  9 +
 5 files changed, 51 insertions(+), 23 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 92bc89ae7e20..c970a25d2a49 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -21,20 +21,19 @@ struct cgroup_bpf {
 */
struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
struct bpf_prog __rcu *effective[MAX_BPF_ATTACH_TYPE];
+   bool disallow_override[MAX_BPF_ATTACH_TYPE];
 };
 
 void cgroup_bpf_put(struct cgroup *c

Re: [PATCH v3 1/2] bpf: add a longest prefix match trie map implementation

2017-01-18 Thread Alexei Starovoitov

On Wed, Jan 18, 2017 at 03:30:14PM +0100, David Herrmann wrote:
> Hi
> 
> On Sat, Jan 14, 2017 at 5:55 PM, Alexei Starovoitov <a...@fb.com> wrote:
> > Another alternative is to extend samples/bpf/map_perf_test
> > It has perf tests for most map types today (including lru)
> > and trie would be natural addition there.
> > I would prefer this latter option.
> 
> I hooked into gettid() and installed a simple kprobe bpf program that
> searches for an entry in an lpm trie. The bpf program does either 0,
> 1, 8, or 32 lookups in a row (always the same element). The trie has
> size 0, 1, or 8192. The data is below. The results vary by roughly 5%
> on every run.
> 
> A single gettid() syscall with an empty bpf program takes roughly
> 6.5us on my system. Lookups in empty tries take ~1.8us on first try,
> ~0.9us on retries. Lookups in tries with 8192 entries take ~7.1us (on
> the first _and_ any subsequent try).

thanks. please add them to commit log.

> https://gist.github.com/dvdhrm/4c90e61a1c39746d5c55ab9e0e29315e

looks good. please add it as an extra patch to this set.
and add
#pragma clang loop unroll(full)
before
+   for (i = 0; i < 8; ++i)
+   bpf_map_lookup_elem(_trie_map_alloc, );
to make sure it's unrolled regardless of the version of llvm.

> Trie-size: 0
> #Lookups: 0
> 0:lpm_perf kmalloc 9,230,321 events per sec
> -> 6.5us / syscall
> 
> Trie-size: 1
> #Lookups: 1
> 0:lpm_perf kmalloc 7,224,508 events per sec
> -> 8.3us / syscall
> 
> Trie-size: 1
> #Lookups: 8
> 0:lpm_perf kmalloc 4,152,740 events per sec
> -> 14.4us / syscall
> 
> Trie-size: 1
> #Lookups: 32
> 0:lpm_perf kmalloc 1,713,415 events per sec
> -> 35.0us / syscall
> 
> Trie-size: 8192
> #Lookups: 1
> 0:lpm_perf kmalloc 4,369,138 events per sec
> -> 13.7us / syscall
> 
> Trie-size: 8192
> #Lookups: 8
> 0:lpm_perf kmalloc 943,849 events per sec
> -> 63.6us / syscall
> 
> Trie-size: 8192
> #Lookups: 32
> 0:lpm_perf kmalloc 271,737 events per sec
> -> 220.8us / syscall

so it's the same lookup done 32-times per each syscall.
Since the latency of single lookup in 8 vs 32 experiments
are pretty close, it means that the 32 lookup numbers
amortize the cost of syscall pretty well and no need to go higher.
So I would use 32 as a loop count in stress_lpm_trie_map_alloc()

Thanks!

Re: [PATCH] bpf: Fix test_lru_sanity5() in test_lru_map.c

2017-01-17 Thread Alexei Starovoitov


On 1/17/17 1:07 AM, Daniel Borkmann wrote:

On 01/17/2017 07:17 AM, Martin KaFai Lau wrote:

test_lru_sanity5() fails when the number of online cpus
is fewer than the number of possible cpus.  It can be
reproduced with qemu by using cmd args "--smp cpus=2,maxcpus=8".

The problem is the loop in test_lru_sanity5() is testing
'i' which is incorrect.

This patch:
1. Make sched_next_online() always return -1 if it cannot
find a next cpu to schedule the process.
2. In test_lru_sanity5(), the parent process does
sched_setaffinity() first (through sched_next_online())
and the forked process will inherit it according to
the 'man sched_setaffinity'.

Fixes: 5db58faf989f ("bpf: Add tests for the LRU bpf_htab")
Reported-by: Daniel Borkmann <dan...@iogearbox.net>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>


Looks good, thanks for fixing!

Acked-by: Daniel Borkmann <dan...@iogearbox.net>

(Patch is against -net tree.)


Acked-by: Alexei Starovoitov <a...@kernel.org>

Re: [net PATCH] bpf: fix samples xdp_tx_iptunnel and tc_l2_redirect with fake KBUILD_MODNAME

2017-01-18 Thread Alexei Starovoitov

On Wed, Jan 18, 2017 at 05:19:00PM +0100, Jesper Dangaard Brouer wrote:
> Fix build errors for samples/bpf xdp_tx_iptunnel and tc_l2_redirect,
> when dynamic debugging is enabled (CONFIG_DYNAMIC_DEBUG) by defining a
> fake KBUILD_MODNAME.
> 
> Just like Daniel Borkmann fixed other samples/bpf in commit
> 96a8eb1eeed2 ("bpf: fix samples to add fake KBUILD_MODNAME").
> 
> Fixes: 12d8bb64e3f6 ("bpf: xdp: Add XDP example for head adjustment")
> Fixes: 90e02896f1a4 ("bpf: Add test for bpf_redirect to ipip/ip6tnl")
> Signed-off-by: Jesper Dangaard Brouer <bro...@redhat.com>

Thanks for the fix!
Acked-by: Alexei Starovoitov <a...@kernel.org>

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-19 Thread Alexei Starovoitov

On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote:
> I think it could work by making a single socket cgroup controller that
> handles all cgroup things that are bound to a socket.  Using

Such 'socket cgroup controller' would limit usability of the feature
to sockets and force all other use cases like landlock to invent
their own wheel, which is undesirable. Everyone will be
inventing new 'foo cgroup controller', while all of them
are really bpf features. They are different bpf program
types that attach to different hooks and use cgroup for scoping.

> Having thought about this some more, I think that making it would
> alleviate a bunch of my concerns, as it would make the semantics if
> the capable() check were relaxed to ns_capable() be sane.  Here's what

here we're on the same page. For any meaningful discussion about
'bpf cgroup controller' to happen bpf itself needs to become
delegatable in cgroup sense. In other words BPF_PROG_TYPE_CGROUP*
program types need to become available for unprivileged users.
The only unprivileged prog type today is BPF_PROG_TYPE_SOCKET_FILTER.
To make it secure we severely limited its functionality.
All bpf advances since then (like new map types and verifier extensions)
were done for root only. If early on the priv vs unpriv bpf features
were 80/20. Now it's close to 95/5. No work has been done to
make socket filter type more powerful. It still has to use
slow-ish ld_abs skb access while tc/xdp have direct packet access.
Things like register value tracking is root only as well and so on
and so forth.
We cannot just flip the switch and allow type_cgroup* to unpriv
and I don't see any volunteers willing to do this work.
Until that happens there is no point coming up with designs
for 'cgroup bpf controller'... whatever that means.

> I currently should happen before bpf+cgroup is enabled in a release:
> 
> 1. Make it netns-aware.  This could be as simple as making it only
> work in the root netns because then real netns awareness can be added
> later without breaking anything.  The current situation is bad in that
> network namespaces are just ignored and it's plausible that people
> will start writing user code that depends on having network namespaces
> be ignored.

nothing in bpf today is netns-aware and frankly I don't see
how cgroup+bpf has anything to do with netns.
For regular sockets+bpf we don't check netns.
When tcpdump opens raw socket and attaches bpf there are no netns
checks, since socket itself gives a scope for the program to run.
Same thing applies to cgroup+bpf. cgroup gives a scope for the program.
But, say, we indeed add 'if !root ns' check to BPF_CGROUP_INET_*
hooks. Then if the hooks are used for security, the process
only needs to do setns() to escape security sandbox. Obviously
broken semantics.

> 2. Make it inherit properly.  Inner cgroups should not override outer
> hooks.  As in (1), this could be simplified by preventing the same
> hook from being configured in both an ancestor and a descendent
> cgroup.  Then inheritance could be added for real later on.

In general it sounds fine, but it seems the reasoning to add
such restriction now (instead of later), so that program chain can
be added without breaking abi, since if we don't restrict it now
there will be no way to add it without breaking abi?!
That is incorrect assumption. We can add chaining and can add
'do not override' logic without breaking existing semantics.
For example, we can add 'priority' field to
struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */}
which would indicate relative position of multiple chained programs
applied to the same cgroup+hook pair. Multiple programs with
the same priority will be executed in the order they were added.
Programs with different priorities will execute in the priority order.
Such scheme will be more generic and flexible than earlier proposals.
Similarly we can add another flag that will say 'dissallow override
of bpf program in descendent cgroup'. It's all trivial to do,
since bpf syscall was designed for extensibility.

Also until bpf_type_cgroup* becomes unprivileged there is no reason
to add this 'priority/prog chaining' feature, since if it's
used for security the root can always override it no matter cgroup
hierarchy.

> 3. Give cgroup delegation support some serious thought.  In
> particular, if delegation would be straightforward but the current API
> wouldn't work well with delegation, then at least consider whether the
> API should change before it becomes stable so that two APIs don't need
> to be supported going forward.

please see example above. Since we went with bpf syscall (instead of
inextensible ioctl) we can add any new cgroup+bpf logic without
breaking current abi.
No matter how you twist it the cgroup+bpf is bpf specific feature.

Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-19 Thread Alexei Starovoitov

On Fri, Aug 19, 2016 at 11:19:41AM +0200, Pablo Neira Ayuso wrote:
> Hi Daniel,
> 
> On Wed, Aug 17, 2016 at 04:00:43PM +0200, Daniel Mack wrote:
> > I'd appreciate some feedback on this. Pablo has some remaining concerns
> > about this approach, and I'd like to continue the discussion we had
> > off-list in the light of this patchset.
> 
> OK, I'm going to summarize them here below:
> 
> * This new hook allows us to enforce an *administrative filtering
>   policy* that must be visible to anyone with CAP_NET_ADMIN. This is
>   easy to display in nf_tables as you can list the ruleset via the nft
>   userspace tool. Otherwise, in your approach if a misconfigured
>   filtering policy causes connectivity problems, I don't see how the
>   sysadmin is going to have an easy way to troubleshoot what is going on.
> 
> * Interaction with other software. As I could read from your patch,
>   what you propose will detach any previous existing filter. So I
>   don't see how you can attach multiple filtering policies from
>   different processes that don't cooperate each other. In nf_tables
>   this is easy since they can create their own tables so they keep their
>   ruleset in separate spaces. If the interaction is not OK, again the
>   sysadmin can very quickly debug this since the policies would be
>   visible via nf_tables ruleset listing.
> 
> * During the Netfilter Workshop, the main concern to add this new socket
>   ingress hook was that it is too specific. However this new hook in
>   the network stack looks way more specific more specific since *it only
>   works for cgroups*.
> 
> So what I'm proposing goes in the direction of using the nf_tables
> infrastructure instead:

Pablo, if you were proposing to do cgroups+nft as well as cgroups+bpf
we could have had much more productive discussion.
You were not participating in cgroup+bpf design and now bringing up
bogus points that make no sense to me. That's not helpful.
Please start another cgroups+nft thread and there we can discuss the
ways to do it cleanly without slowdown the stack.
netfilter hooks bloat the stack enough that some people compile them out.
If I were you, I'd focus on improving iptables/nft performance instead
of arguing about their coolness.

> Thanks for your patience on debating this!

I don't think you're sincere.

Re: [PATCH] samples/bpf: Add tunnel set/get tests.

2016-08-16 Thread Alexei Starovoitov

On Tue, Aug 16, 2016 at 07:03:01AM -0700, William Tu wrote:
> The patch creates sample code exercising bpf_skb_{set,get}_tunnel_key,
> and bpf_skb_{set,get}_tunnel_opt for GRE, VXLAN, and GENEVE.  A native
> tunnel device is created in a namespace to interact with a lwtunnel
> device out of the namespace, with metadata enabled.  The bpf_skb_set_*
> program is attached to tc egress and bpf_skb_get_* is attached to egress
> qdisc.  A ping between two tunnels is used to verify correctness and
> the result of bpf_skb_get_* printed by bpf_trace_printk.
> 
> Signed-off-by: William Tu <u9012...@gmail.com>

nice test. thanks!
Acked-by: Alexei Starovoitov <a...@kernel.org>

Re: [PATCH] bpf: update the comment about the length of analysis

2017-03-01 Thread Alexei Starovoitov

On Wed, Mar 01, 2017 at 04:25:51PM +0800, Gary Lin wrote:
> Commit 07016151a446 ("bpf, verifier: further improve search
> pruning") increased the limit of processed instructions from
> 32k to 64k, but the comment still mentioned the 32k limit.
> This commit updates the comment to reflect the change.
> 
> Cc: Alexei Starovoitov <a...@kernel.org>
> Cc: Daniel Borkmann <dan...@iogearbox.net>
> Signed-off-by: Gary Lin <g...@suse.com>

Acked-by: Alexei Starovoitov <a...@kernel.org>

[PATCH v2 net-next 6/6] samples/bpf: add sampleip example

From: Brendan Gregg <bgr...@netflix.com>

sample instruction pointer and frequency count in a BPF map

Signed-off-by: Brendan Gregg <bgr...@netflix.com>
Signed-off-by: Alexei Starovoitov <a...@kernel.org>
---
 samples/bpf/Makefile|   4 +
 samples/bpf/sampleip_kern.c |  38 +
 samples/bpf/sampleip_user.c | 196 
 3 files changed, 238 insertions(+)
 create mode 100644 samples/bpf/sampleip_kern.c
 create mode 100644 samples/bpf/sampleip_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index a69cf9045285..12b7304d55dc 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -26,6 +26,7 @@ hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
 hostprogs-y += trace_event
+hostprogs-y += sampleip
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -54,6 +55,7 @@ xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
 test_current_task_under_cgroup-objs := bpf_load.o libbpf.o \
   test_current_task_under_cgroup_user.o
 trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
+sampleip-objs := bpf_load.o libbpf.o sampleip_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -82,6 +84,7 @@ always += xdp1_kern.o
 always += xdp2_kern.o
 always += test_current_task_under_cgroup_kern.o
 always += trace_event_kern.o
+always += sampleip_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -107,6 +110,7 @@ HOSTLOADLIBES_xdp1 += -lelf
 HOSTLOADLIBES_xdp2 += -lelf
 HOSTLOADLIBES_test_current_task_under_cgroup += -lelf
 HOSTLOADLIBES_trace_event += -lelf
+HOSTLOADLIBES_sampleip += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/sampleip_kern.c b/samples/bpf/sampleip_kern.c
new file mode 100644
index ..774a681f374a
--- /dev/null
+++ b/samples/bpf/sampleip_kern.c
@@ -0,0 +1,38 @@
+/* Copyright 2016 Netflix, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+#define MAX_IPS8192
+
+struct bpf_map_def SEC("maps") ip_map = {
+   .type = BPF_MAP_TYPE_HASH,
+   .key_size = sizeof(u64),
+   .value_size = sizeof(u32),
+   .max_entries = MAX_IPS,
+};
+
+SEC("perf_event")
+int do_sample(struct bpf_perf_event_data *ctx)
+{
+   u64 ip;
+   u32 *value, init_val = 1;
+
+   ip = ctx->regs.ip;
+   value = bpf_map_lookup_elem(_map, );
+   if (value)
+   *value += 1;
+   else
+   /* E2BIG not tested for this example only */
+   bpf_map_update_elem(_map, , _val, BPF_NOEXIST);
+
+   return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/sampleip_user.c b/samples/bpf/sampleip_user.c
new file mode 100644
index ..260a6bdd6413
--- /dev/null
+++ b/samples/bpf/sampleip_user.c
@@ -0,0 +1,196 @@
+/*
+ * sampleip: sample instruction pointer and frequency count in a BPF map.
+ *
+ * Copyright 2016 Netflix, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define DEFAULT_FREQ   99
+#define DEFAULT_SECS   5
+#define MAX_IPS8192
+#define PAGE_OFFSET0x8800
+
+static int nr_cpus;
+
+static void usage(void)
+{
+   printf("USAGE: sampleip [-F freq] [duration]\n");
+   printf("   -F freq# sample frequency (Hertz), default 99\n");
+   printf("   duration   # sampling duration (seconds), default 5\n");
+}
+
+static int sampling_start(int *pmu_fd, int freq)
+{
+   int i;
+
+   struct perf_event_attr pe_sample_attr = {
+   .type = PERF_TYPE_SOFTWARE,
+   .freq = 1,
+   .sample_period = freq,
+   .config = PERF_COUNT_SW_CPU_CLOCK,
+   .inherit = 1,
+   };
+
+   for (i = 0; i < nr_cpus; i++) {
+   pmu_fd[i] = perf_event_open(_sample_attr, -1 /* pid */, i,
+   -1 /* group_fd */, 0 /* flags */);
+   if (pmu_fd[i] < 0) {
+   fprintf(stderr, "ERROR: Initializing perf sampling\n");
+   return 1;
+   }
+   assert(ioctl(pmu_fd[

[PATCH v2 net-next 5/6] samples/bpf: add perf_event+bpf example

The bpf program is called 50 times a second and does 
hashmap[kern_stackid]++
It's primary purpose to check that key bpf helpers like map lookup, update,
get_stackid, trace_printk and ctx access are all working.
It checks:
- PERF_COUNT_HW_CPU_CYCLES on all cpus
- PERF_COUNT_HW_CPU_CYCLES for current process and inherited perf_events to 
children
- PERF_COUNT_SW_CPU_CLOCK on all cpus
- PERF_COUNT_SW_CPU_CLOCK for current process

Signed-off-by: Alexei Starovoitov <a...@kernel.org>
---
 samples/bpf/Makefile   |   4 +
 samples/bpf/bpf_helpers.h  |   2 +
 samples/bpf/bpf_load.c |   7 +-
 samples/bpf/trace_event_kern.c |  65 +
 samples/bpf/trace_event_user.c | 213 +
 5 files changed, 290 insertions(+), 1 deletion(-)
 create mode 100644 samples/bpf/trace_event_kern.c
 create mode 100644 samples/bpf/trace_event_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index db3cb061bfcd..a69cf9045285 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -25,6 +25,7 @@ hostprogs-y += test_cgrp2_array_pin
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
+hostprogs-y += trace_event
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -52,6 +53,7 @@ xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
 test_current_task_under_cgroup-objs := bpf_load.o libbpf.o \
   test_current_task_under_cgroup_user.o
+trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -79,6 +81,7 @@ always += test_cgrp2_tc_kern.o
 always += xdp1_kern.o
 always += xdp2_kern.o
 always += test_current_task_under_cgroup_kern.o
+always += trace_event_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -103,6 +106,7 @@ HOSTLOADLIBES_test_overhead += -lelf -lrt
 HOSTLOADLIBES_xdp1 += -lelf
 HOSTLOADLIBES_xdp2 += -lelf
 HOSTLOADLIBES_test_current_task_under_cgroup += -lelf
+HOSTLOADLIBES_trace_event += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index bbdf62a1e45e..90f44bd2045e 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -55,6 +55,8 @@ static int (*bpf_skb_get_tunnel_opt)(void *ctx, void *md, int 
size) =
(void *) BPF_FUNC_skb_get_tunnel_opt;
 static int (*bpf_skb_set_tunnel_opt)(void *ctx, void *md, int size) =
(void *) BPF_FUNC_skb_set_tunnel_opt;
+static unsigned long long (*bpf_get_prandom_u32)(void) =
+   (void *) BPF_FUNC_get_prandom_u32;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 0cfda2320320..97913e109b14 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -51,6 +51,7 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
bool is_xdp = strncmp(event, "xdp", 3) == 0;
+   bool is_perf_event = strncmp(event, "perf_event", 10) == 0;
enum bpf_prog_type prog_type;
char buf[256];
int fd, efd, err, id;
@@ -69,6 +70,8 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
prog_type = BPF_PROG_TYPE_TRACEPOINT;
} else if (is_xdp) {
prog_type = BPF_PROG_TYPE_XDP;
+   } else if (is_perf_event) {
+   prog_type = BPF_PROG_TYPE_PERF_EVENT;
} else {
printf("Unknown event '%s'\n", event);
return -1;
@@ -82,7 +85,7 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
 
prog_fd[prog_cnt++] = fd;
 
-   if (is_xdp)
+   if (is_xdp || is_perf_event)
return 0;
 
if (is_socket) {
@@ -326,6 +329,7 @@ int load_bpf_file(char *path)
memcmp(shname_prog, "kretprobe/", 10) == 0 ||
memcmp(shname_prog, "tracepoint/", 11) == 0 ||
memcmp(shname_prog, "xdp", 3) == 0 ||
+   memcmp(shname_prog, "perf_event", 10) == 0 ||
memcmp(shname_prog, "socket", 6) == 0)
load_and_attach(shname_prog, insns, 
data_prog->d_size);
}
@@ -344,6 +348,7 @@ int load_bpf_file(char *path)
memcmp(shname, "kretprobe/", 10) == 0 ||
memcmp(shname, "tracepoint/&

[PATCH v2 net-next 4/6] perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs

Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
via overflow_handler mechanism.
When program is attached the overflow_handlers become stacked.
The program acts as a filter.
Returning zero from the program means that the normal perf_event_output handler
will not be called and sampling event won't be stored in the ring buffer.

The overflow_handler_context==NULL is an additional safety check
to make sure programs are not attached to hw breakpoints and watchdog
in case other checks (that prevent that now anyway) get accidentally
relaxed in the future.

The program refcnt is incremented in case perf_events are inhereted
when target task is forked.
Similar to kprobe and tracepoint programs there is no ioctl to
detach the program or swap already attached program. The user space
expected to close(perf_event_fd) like it does right now for kprobe+bpf.
That restriction simplifies the code quite a bit.

The invocation of overflow_handler in __perf_event_overflow() is now
done via READ_ONCE, since that pointer can be replaced when the program
is attached while perf_event itself could have been active already.
There is no need to do similar treatment for event->prog, since it's
assigned only once before it's accessed.

Signed-off-by: Alexei Starovoitov <a...@kernel.org>
---
 include/linux/bpf.h|  4 +++
 include/linux/perf_event.h |  2 ++
 kernel/events/core.c   | 85 +-
 3 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 11134238417d..9a904f63f8c1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -297,6 +297,10 @@ static inline struct bpf_prog *bpf_prog_add(struct 
bpf_prog *prog, int i)
 static inline void bpf_prog_put(struct bpf_prog *prog)
 {
 }
+static inline struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog)
+{
+   return ERR_PTR(-EOPNOTSUPP);
+}
 #endif /* CONFIG_BPF_SYSCALL */
 
 /* verifier prototypes for helper functions called from eBPF programs */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 97bfe62f30d7..dcaaaf3ec8e6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -679,6 +679,8 @@ struct perf_event {
u64 (*clock)(void);
perf_overflow_handler_t overflow_handler;
void*overflow_handler_context;
+   perf_overflow_handler_t orig_overflow_handler;
+   struct bpf_prog *prog;
 
 #ifdef CONFIG_EVENT_TRACING
struct trace_event_call *tp_event;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3cfabdf7b942..305433ab2447 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7022,7 +7022,7 @@ static int __perf_event_overflow(struct perf_event *event,
irq_work_queue(>pending);
}
 
-   event->overflow_handler(event, data, regs);
+   READ_ONCE(event->overflow_handler)(event, data, regs);
 
if (*perf_event_fasync(event) && event->pending_kill) {
event->pending_wakeup = 1;
@@ -7637,11 +7637,75 @@ static void perf_event_free_filter(struct perf_event 
*event)
ftrace_profile_free_filter(event);
 }
 
+static void bpf_overflow_handler(struct perf_event *event,
+struct perf_sample_data *data,
+struct pt_regs *regs)
+{
+   struct bpf_perf_event_data_kern ctx = {
+   .data = data,
+   .regs = regs,
+   };
+   int ret = 0;
+
+#ifdef CONFIG_BPF_SYSCALL
+   preempt_disable();
+   if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))
+   goto out;
+   rcu_read_lock();
+   ret = BPF_PROG_RUN(event->prog, (void *));
+   rcu_read_unlock();
+ out:
+   __this_cpu_dec(bpf_prog_active);
+   preempt_enable();
+#endif
+   if (!ret)
+   return;
+
+   event->orig_overflow_handler(event, data, regs);
+}
+
+static int perf_event_set_bpf_handler(struct perf_event *event, u32 prog_fd)
+{
+   struct bpf_prog *prog;
+
+   if (event->overflow_handler_context)
+   /* hw breakpoint or kernel counter */
+   return -EINVAL;
+
+   if (event->prog)
+   return -EEXIST;
+
+   prog = bpf_prog_get_type(prog_fd, BPF_PROG_TYPE_PERF_EVENT);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   event->prog = prog;
+   event->orig_overflow_handler = READ_ONCE(event->overflow_handler);
+   WRITE_ONCE(event->overflow_handler, bpf_overflow_handler);
+   return 0;
+}
+
+static void perf_event_free_bpf_handler(struct perf_event *event)
+{
+   struct bpf_prog *prog = event->prog;
+
+   if (!prog)
+   return;
+
+   WRITE_ONCE(event->overflow_handler, event->orig_overflow_handler);
+

[PATCH v2 net-next 0/6] perf, bpf: add support for bpf in sw/hw perf_events

Hi Peter, Dave,

this patch set is a follow up to the discussion:
https://lkml.kernel.org/r/20160804142853.GO6862%20()%20twins%20!%20programming%20!%20kicks-ass%20!%20net
It turned out to be simpler than what we discussed.

Patches 1-3 is bpf-side prep for the main patch 4
that adds bpf program as an overflow_handler to sw and hw perf_events.
Peter, please review.

Patches 5 and 6 are examples from myself and Brendan.

v1-v2: fixed issues spotted by Peter and Daniel.

Thanks!

Alexei Starovoitov (5):
  bpf: support 8-byte metafield access
  bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type
  bpf: perf_event progs should only use preallocated maps
  perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT
programs
  samples/bpf: add perf_event+bpf example

Brendan Gregg (1):
  samples/bpf: add sampleip example

 include/linux/bpf.h |   4 +
 include/linux/perf_event.h  |   7 ++
 include/uapi/linux/Kbuild   |   1 +
 include/uapi/linux/bpf.h|   1 +
 include/uapi/linux/bpf_perf_event.h |  18 +++
 kernel/bpf/verifier.c   |  31 +-
 kernel/events/core.c|  85 +-
 kernel/trace/bpf_trace.c|  60 ++
 samples/bpf/Makefile|   8 ++
 samples/bpf/bpf_helpers.h   |   2 +
 samples/bpf/bpf_load.c  |   7 +-
 samples/bpf/sampleip_kern.c |  38 +++
 samples/bpf/sampleip_user.c | 196 +
 samples/bpf/trace_event_kern.c  |  65 +++
 samples/bpf/trace_event_user.c  | 213 
 15 files changed, 730 insertions(+), 6 deletions(-)
 create mode 100644 include/uapi/linux/bpf_perf_event.h
 create mode 100644 samples/bpf/sampleip_kern.c
 create mode 100644 samples/bpf/sampleip_user.c
 create mode 100644 samples/bpf/trace_event_kern.c
 create mode 100644 samples/bpf/trace_event_user.c

-- 
2.8.0

[PATCH v2 net-next 2/6] bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type

Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
correspondingly in uapi/linux/perf_event.h)

The program visible context meta structure is
struct bpf_perf_event_data {
struct pt_regs regs;
 __u64 sample_period;
};
which is accessible directly from the program:
int bpf_prog(struct bpf_perf_event_data *ctx)
{
  ... ctx->sample_period ...
  ... ctx->regs.ip ...
}

The bpf verifier rewrites the accesses into kernel internal
struct bpf_perf_event_data_kern which allows changing
struct perf_sample_data without affecting bpf programs.
New fields can be added to the end of struct bpf_perf_event_data
in the future.

Signed-off-by: Alexei Starovoitov <a...@kernel.org>
Acked-by: Daniel Borkmann <dan...@iogearbox.net>
---
 include/linux/perf_event.h  |  5 
 include/uapi/linux/Kbuild   |  1 +
 include/uapi/linux/bpf.h|  1 +
 include/uapi/linux/bpf_perf_event.h | 18 +++
 kernel/trace/bpf_trace.c| 60 +
 5 files changed, 85 insertions(+)
 create mode 100644 include/uapi/linux/bpf_perf_event.h

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2b6b43cc0dd5..97bfe62f30d7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -788,6 +788,11 @@ struct perf_output_handle {
int page;
 };
 
+struct bpf_perf_event_data_kern {
+   struct pt_regs *regs;
+   struct perf_sample_data *data;
+};
+
 #ifdef CONFIG_CGROUP_PERF
 
 /*
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 185f8ea2702f..d0352a971ebd 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -71,6 +71,7 @@ header-y += binfmts.h
 header-y += blkpg.h
 header-y += blktrace_api.h
 header-y += bpf_common.h
+header-y += bpf_perf_event.h
 header-y += bpf.h
 header-y += bpqether.h
 header-y += bsg.h
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e4c5a1baa993..f896dfac4ac0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -95,6 +95,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SCHED_ACT,
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
+   BPF_PROG_TYPE_PERF_EVENT,
 };
 
 #define BPF_PSEUDO_MAP_FD  1
diff --git a/include/uapi/linux/bpf_perf_event.h 
b/include/uapi/linux/bpf_perf_event.h
new file mode 100644
index ..067427259820
--- /dev/null
+++ b/include/uapi/linux/bpf_perf_event.h
@@ -0,0 +1,18 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _UAPI__LINUX_BPF_PERF_EVENT_H__
+#define _UAPI__LINUX_BPF_PERF_EVENT_H__
+
+#include 
+#include 
+
+struct bpf_perf_event_data {
+   struct pt_regs regs;
+   __u64 sample_period;
+};
+
+#endif /* _UAPI__LINUX_BPF_PERF_EVENT_H__ */
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ad35213b8405..0ac414abbf68 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1,4 +1,5 @@
 /* Copyright (c) 2011-2015 PLUMgrid, http://plumgrid.com
+ * Copyright (c) 2016 Facebook
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -8,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -552,10 +554,68 @@ static struct bpf_prog_type_list tracepoint_tl = {
.type   = BPF_PROG_TYPE_TRACEPOINT,
 };
 
+static bool pe_prog_is_valid_access(int off, int size, enum bpf_access_type 
type,
+   enum bpf_reg_type *reg_type)
+{
+   if (off < 0 || off >= sizeof(struct bpf_perf_event_data))
+   return false;
+   if (type != BPF_READ)
+   return false;
+   if (off % size != 0)
+   return false;
+   if (off == offsetof(struct bpf_perf_event_data, sample_period)) {
+   if (size != sizeof(u64))
+   return false;
+   } else {
+   if (size != sizeof(long))
+   return false;
+   }
+   return true;
+}
+
+static u32 pe_prog_convert_ctx_access(enum bpf_access_type type, int dst_reg,
+ int src_reg, int ctx_off,
+ struct bpf_insn *insn_buf,
+ struct bpf_prog *prog)
+{
+   struct bpf_insn *insn = insn_buf;
+
+   BUILD_BUG_ON(FIELD_SIZEOF(struct perf_sample_data, period) != 
sizeof(u64));
+   switch (ctx_off) {
+   case offsetof(struct bpf_perf_event_data, sample_period):
+   *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct 
bpf_perf_event_data_kern, data)),
+

[PATCH v2 net-next 3/6] bpf: perf_event progs should only use preallocated maps

Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use
preallocated hash maps, since doing memory allocation
in overflow_handler can crash depending on where nmi got triggered.

Signed-off-by: Alexei Starovoitov <a...@kernel.org>
Acked-by: Daniel Borkmann <dan...@iogearbox.net>
---
 kernel/bpf/verifier.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c1c9e441f0f5..48c2705db22c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2511,6 +2511,20 @@ process_bpf_exit:
return 0;
 }
 
+static int check_map_prog_compatibility(struct bpf_map *map,
+   struct bpf_prog *prog)
+
+{
+   if (prog->type == BPF_PROG_TYPE_PERF_EVENT &&
+   (map->map_type == BPF_MAP_TYPE_HASH ||
+map->map_type == BPF_MAP_TYPE_PERCPU_HASH) &&
+   (map->map_flags & BPF_F_NO_PREALLOC)) {
+   verbose("perf_event programs can only use preallocated hash 
map\n");
+   return -EINVAL;
+   }
+   return 0;
+}
+
 /* look for pseudo eBPF instructions that access map FDs and
  * replace them with actual map pointers
  */
@@ -2518,7 +2532,7 @@ static int replace_map_fd_with_map_ptr(struct 
verifier_env *env)
 {
struct bpf_insn *insn = env->prog->insnsi;
int insn_cnt = env->prog->len;
-   int i, j;
+   int i, j, err;
 
for (i = 0; i < insn_cnt; i++, insn++) {
if (BPF_CLASS(insn->code) == BPF_LDX &&
@@ -2562,6 +2576,12 @@ static int replace_map_fd_with_map_ptr(struct 
verifier_env *env)
return PTR_ERR(map);
}
 
+   err = check_map_prog_compatibility(map, env->prog);
+   if (err) {
+   fdput(f);
+   return err;
+   }
+
/* store map pointer inside BPF_LD_IMM64 instruction */
insn[0].imm = (u32) (unsigned long) map;
insn[1].imm = ((u64) (unsigned long) map) >> 32;
-- 
2.8.0

[PATCH v2 net-next 1/6] bpf: support 8-byte metafield access

The verifier supported only 4-byte metafields in
struct __sk_buff and struct xdp_md. The metafields in upcoming
struct bpf_perf_event are 8-byte to match register width in struct pt_regs.
Teach verifier to recognize 8-byte metafield access.
The patch doesn't affect safety of sockets and xdp programs.
They check for 4-byte only ctx access before these conditions are hit.

Signed-off-by: Alexei Starovoitov <a...@kernel.org>
Acked-by: Daniel Borkmann <dan...@iogearbox.net>
---
 kernel/bpf/verifier.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index abb61f3f6900..c1c9e441f0f5 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2333,7 +2333,8 @@ static int do_check(struct verifier_env *env)
if (err)
return err;
 
-   if (BPF_SIZE(insn->code) != BPF_W) {
+   if (BPF_SIZE(insn->code) != BPF_W &&
+   BPF_SIZE(insn->code) != BPF_DW) {
insn_idx++;
continue;
}
@@ -2642,9 +2643,11 @@ static int convert_ctx_accesses(struct verifier_env *env)
for (i = 0; i < insn_cnt; i++, insn++) {
u32 insn_delta, cnt;
 
-   if (insn->code == (BPF_LDX | BPF_MEM | BPF_W))
+   if (insn->code == (BPF_LDX | BPF_MEM | BPF_W) ||
+   insn->code == (BPF_LDX | BPF_MEM | BPF_DW))
type = BPF_READ;
-   else if (insn->code == (BPF_STX | BPF_MEM | BPF_W))
+   else if (insn->code == (BPF_STX | BPF_MEM | BPF_W) ||
+insn->code == (BPF_STX | BPF_MEM | BPF_DW))
type = BPF_WRITE;
else
continue;
-- 
2.8.0

Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-05 Thread Alexei Starovoitov


On 9/5/16 10:09 AM, Daniel Borkmann wrote:

On 09/05/2016 04:09 PM, Daniel Mack wrote:

On 09/05/2016 03:56 PM, Daniel Borkmann wrote:

On 09/05/2016 02:54 PM, Daniel Mack wrote:

On 08/30/2016 01:00 AM, Daniel Borkmann wrote:

On 08/26/2016 09:58 PM, Daniel Mack wrote:



enum bpf_map_type {
@@ -147,6 +149,13 @@ union bpf_attr {
__aligned_u64pathname;
__u32bpf_fd;
};
+
+struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH
commands */
+__u32target_fd;/* container object to attach
to */
+__u32attach_bpf_fd;/* eBPF program to attach */
+__u32attach_type;/* BPF_ATTACH_TYPE_* */
+__u64attach_flags;


Could we just do ...

__u32 dst_fd;
__u32 src_fd;
__u32 attach_type;

... and leave flags out, since unused anyway? Also see below.


I'd really like to keep the flags, even if they're unused right now.
This only adds 8 bytes during the syscall operation, so it doesn't
harm.
However, we cannot change the userspace API after the fact, and who
knows what this (rather generic) interface will be used for later on.


With the below suggestion added, then flags doesn't need to be
added currently as it can be done safely at a later point in time
with respecting old binaries. See also the syscall handling code
in kernel/bpf/syscall.c +825 and the CHECK_ATTR() macro. The
underlying idea of this was taken from perf_event_open() syscall
back then, see [1] for a summary.

[1] https://lkml.org/lkml/2014/8/26/116


Yes, I know that's possible, and I like the idea, but I don't think any
new interface should come without flags really, as flags are something
that will most certainly be needed at some point anyway. I didn't have
them in my first shot, but Alexei pointed out that they should be added,
and I agree.

Also, this optimization wouldn't make the transported struct payload any
smaller anyway, because the member of that union used by BPF_PROG_LOAD
is still by far the biggest.

I really don't think it's worth sparing 8 bytes here and then do the
binary compat dance after flags are added, for no real gain.


Sure, but there's not much of a dance needed, see for example how map_flags
were added some time ago. So, iff there's really no foreseeable use-case in
sight and since we have this flexibility in place already, then I don't
quite
follow why it's needed, if there's zero pain to add it later on. I would
understand it of course, if it cannot be handled later on anymore.


I agree with Daniel B. Since flags are completely unused right now,
there is no plan to use it for anything in the coming months and
even worse they make annoying hole in the struct, let's not
add them. We can safely do that later. CHECK_ATTR() allows us to
do it easily. It's not like syscall where flags are must have,
since we cannot add it later. Here it's done trivially.

Re: [PATCH v3 2/6] cgroup: add support for eBPF programs

2016-09-05 Thread Alexei Starovoitov


On 9/5/16 2:40 PM, Sargun Dhillon wrote:

On Mon, Sep 05, 2016 at 04:49:26PM +0200, Daniel Mack wrote:

Hi,

On 08/30/2016 01:04 AM, Sargun Dhillon wrote:

On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote:

This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

   A - B - C
 \ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.


How does this work when running and orchestrator within an orchestrator? The
Docker in Docker / Mesos in Mesos use case, where the top level orchestrator is
observing the traffic, and there is an orchestrator within that also need to run
it.

In this case, I'd like to run E's filter, then if it returns 0, D's, and B's,
and so on.


Running multiple programs was an idea I had in one of my earlier drafts,
but after some discussion, I refrained from it again because potentially
walking the cgroup hierarchy on every packet is just too expensive.


I think you're correct here. Maybe this is something I do with the LSM-attached
filters, and not for skb filters. Do you think there might be a way to opt-in to
this option?


Is it possible to allow this, either by flattening out the
datastructure (copy a ref to the bpf programs to C and E) or
something similar?


That would mean we carry a list of eBPF program pointers of dynamic
size. IOW, the deeper inside the cgroup hierarchy, the bigger the list,
so it can store a reference to all programs of all of its ancestor.

While I think that would be possible, even at some later point, I'd
really like to avoid it for the sake of simplicity.

Is there any reason why this can't be done in userspace? Compile a
program X for A, and overload it with Y, with Y doing the same than X
but add some extra checks? Note that all users of the bpf(2) syscall API
will need CAP_NET_ADMIN anyway, so there is no delegation to
unprivileged sub-orchestators or anything alike really.


One of the use-cases that's becoming more and more common are
containers-in-containers. In this, you have a privileged container that's
running something like build orchestration, and you want to do macro-isolation
(say limit access to only that tennant's infrastructure). Then, when the build
orchestrator runs a build, it may want to monitor, and further isolate the tasks
that run in the build job. This is a side-effect of composing different
container technologies. Typically you use one system for images, then another
for orchestration, and the actual program running inside of it can also leverage
containerization.

Example:
K8s->Docker->Jenkins Agent->Jenkins Build Job


frankly I don't buy this argument, since above
and other 'examples' of container-in-container look
fake to me. There is a ton work to be done for such
scheme to be even remotely feasible. The cgroup+bpf
stuff would be the last on my list to 'fix' for such
deployments. I don't think we should worry about it
at present.

Re: Centralizing support for TCAM?

2016-09-05 Thread Alexei Starovoitov

On Sat, Sep 03, 2016 at 09:09:50AM +0200, Jiri Pirko wrote:
> Fri, Sep 02, 2016 at 08:49:34PM CEST, john.fastab...@gmail.com wrote:
> >On 16-09-02 10:18 AM, Florian Fainelli wrote:
> >> Hi all,
> >> 
> >
> >Hi Florian,
> >
> >> (apologies for the long CC list and the fact that I can't type correctly
> >> email addresses)
> >> 
> >
> >My favorite topic ;)
> >
> >> While working on adding support for the Broadcom Ethernet switches
> >> Compact Field Processor (which is essentially a TCAM,
> >> action/policer/rate meter RAMs, 256 entries), I started working with the
> >> ethtool::rxnfc API which is actually kind of nice in that it fits nicely
> >> with my use simple case of being able to insert rules at a given or
> >> driver selected location and has a pretty good flow representation for
> >> common things you may match: TCP/UDP v4/v6 (not so much for non-IP, or
> >> L2 stuff though you can use the extension flow representation). It lacks
> >> support for more complex actions other than redirect to a particular
> >> port/queue.
> >
> >When I was doing this for one of the products I work on I decided that
> >extending ethtool was likely not a good approach and building a netlink
> >interface would be a better choice. My reasons were mainly extending
> >ethtool is a bit painful to keep structure compatibility across versions
> >and I also had use cases that wanted to get notifications both made
> >easier when using netlink. However my netlink port+extensions were not
> >accepted and were called a "kernel bypass" and the general opinion was
> >that it was not going to be accepted upstream. Hence the 'tc' effort.
> 
> Ethtool should die peacefully. Don't poke in it in the process...
> 
> 
> >
> >> 
> >> Now ethtool::rxnfc is one possible user, but tc and netfiler also are,
> >> more powerful and extensible, but since this is a resource constrained
> >> piece of hardware, and it would suck for people to have to implement
> >> these 3 APIs if we could come up with a central one that satisfies the
> >> superset offered by tc + netfilter. We can surely imagine an use case we
> >
> >My opinion is that tc and netfilter are sufficiently different that
> >building a common layer is challenging and is actually more complex vs
> >just implementing two interfaces. Always happy to review code though.
> 
> In february, Pablo did some work on finding the common intermediate
> layer for classifier-action subsystem. It was rejected with the argument
> of unnecessary overhead. Makes sense to me. After that, you introduced
> u32 tc offload. Since that, couple more tc classifiers and actions were
> offloaded.
> 
> I believe that for Florian's usecase, TC is a great fit. You can just use
> cls_flower with couple of actions.
> 
> My colleagues are working hard on enabling cls_flower offload. You can
> easily benefit that. In mlxsw we also plan to use that for our TCAM ACLs
> offloading.
> 
> 
> >
> >There is also an already established packet flow through tc, netfilter,
> >fdb, l3 in linux that folks want to maintain. At the moment I just don't
> >see the need for a common layer IMO.
> >
> >Also adding another layer of abstraction so we end up doing multiple
> >translations into and out of these layers adds overhead. Eventually
> >I need to get reasonable operations per second on the TCAM tables.
> >Reasonable for me being somewhere in the 50k to 100k add/del/update
> >commands per second. I'm hesitant to create more abstractions then
> >are actually needed.
> >
> >> centralize the whole matching + action into a Domain Specific Language
> >> that we compiled into eBPF and then translate into whatever the HW
> >> understands, although that raises the question of where do we put the
> >> translation tool in user space or kernel space.
> >
> >The eBPF to HW translation I started to look at but gave up. The issue
> >was the program space of eBPF is much larger than any traditional
> >parser, table hardware implementation can support so most programs get
> >rejected (obvious observation right?). I'm more inclined to build
> >hardware that can support eBPF vs restricting eBPF to fit into a
> >parser/table model.
> 
> +1
> I have been thinging a lot about this and I believe that parsing bpf
> program in drivers into some pre-defined tables is quite complex. I
> think that bpf is just very unsuitable to offload, if you don't have a
> hw which could directly interpret it.
> I know that Alexei disagrees :)

lol :)
compiling bpf into fixed pipeline asic is definitely not easy.
The problem with adding new cls classifieris and actions to match
what configurable hw does isn't pretty either. The fixed pipeline
isn't interesting beyond l2/l3 and flow-based hw features are mostly
useless in the tor. I'm not against adding new classifiers, since it's
better than sdk, but we won't be using such tc features either.
Since this thread about tcam... my 0.02 here is it's pretty bad in
the nic(host) due to power consumption and in the tor it's only good as

Re: [PATCH, net-next] perf, bpf: fix conditional call to bpf_overflow_handler

2016-09-06 Thread Alexei Starovoitov

On Tue, Sep 06, 2016 at 03:10:22PM +0200, Arnd Bergmann wrote:
> The newly added bpf_overflow_handler function is only built of both
> CONFIG_EVENT_TRACING and CONFIG_BPF_SYSCALL are enabled, but the caller
> only checks the latter:
> 
> kernel/events/core.c: In function 'perf_event_alloc':
> kernel/events/core.c:9106:27: error: 'bpf_overflow_handler' undeclared (first 
> use in this function)
> 
> This changes the caller so we also skip this call if CONFIG_EVENT_TRACING
> is disabled entirely.
> 
> Signed-off-by: Arnd Bergmann <a...@arndb.de>
> Fixes: aa6a5f3cb2b2 ("perf, bpf: add perf events core support for 
> BPF_PROG_TYPE_PERF_EVENT programs")
> ---
>  kernel/events/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> I'm not entirely sure if this is the correct solution, please check before 
> applying

Acked-by: Alexei Starovoitov <a...@kernel.org>

Thanks for the fix. Just saw build bot complaining last night and
by the morning your fix is already here. Thanks!

[PATCH v3 net-next 5/6] samples/bpf: add perf_event+bpf example

The bpf program is called 50 times a second and does 
hashmap[kern_stackid]++
It's primary purpose to check that key bpf helpers like map lookup, update,
get_stackid, trace_printk and ctx access are all working.
It checks:
- PERF_COUNT_HW_CPU_CYCLES on all cpus
- PERF_COUNT_HW_CPU_CYCLES for current process and inherited perf_events to 
children
- PERF_COUNT_SW_CPU_CLOCK on all cpus
- PERF_COUNT_SW_CPU_CLOCK for current process

Signed-off-by: Alexei Starovoitov <a...@kernel.org>
---
 samples/bpf/Makefile   |   4 +
 samples/bpf/bpf_helpers.h  |   2 +
 samples/bpf/bpf_load.c |   7 +-
 samples/bpf/trace_event_kern.c |  65 +
 samples/bpf/trace_event_user.c | 213 +
 5 files changed, 290 insertions(+), 1 deletion(-)
 create mode 100644 samples/bpf/trace_event_kern.c
 create mode 100644 samples/bpf/trace_event_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index db3cb061bfcd..a69cf9045285 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -25,6 +25,7 @@ hostprogs-y += test_cgrp2_array_pin
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
+hostprogs-y += trace_event
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -52,6 +53,7 @@ xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
 test_current_task_under_cgroup-objs := bpf_load.o libbpf.o \
   test_current_task_under_cgroup_user.o
+trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -79,6 +81,7 @@ always += test_cgrp2_tc_kern.o
 always += xdp1_kern.o
 always += xdp2_kern.o
 always += test_current_task_under_cgroup_kern.o
+always += trace_event_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -103,6 +106,7 @@ HOSTLOADLIBES_test_overhead += -lelf -lrt
 HOSTLOADLIBES_xdp1 += -lelf
 HOSTLOADLIBES_xdp2 += -lelf
 HOSTLOADLIBES_test_current_task_under_cgroup += -lelf
+HOSTLOADLIBES_trace_event += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index bbdf62a1e45e..90f44bd2045e 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -55,6 +55,8 @@ static int (*bpf_skb_get_tunnel_opt)(void *ctx, void *md, int 
size) =
(void *) BPF_FUNC_skb_get_tunnel_opt;
 static int (*bpf_skb_set_tunnel_opt)(void *ctx, void *md, int size) =
(void *) BPF_FUNC_skb_set_tunnel_opt;
+static unsigned long long (*bpf_get_prandom_u32)(void) =
+   (void *) BPF_FUNC_get_prandom_u32;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 0cfda2320320..97913e109b14 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -51,6 +51,7 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
bool is_xdp = strncmp(event, "xdp", 3) == 0;
+   bool is_perf_event = strncmp(event, "perf_event", 10) == 0;
enum bpf_prog_type prog_type;
char buf[256];
int fd, efd, err, id;
@@ -69,6 +70,8 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
prog_type = BPF_PROG_TYPE_TRACEPOINT;
} else if (is_xdp) {
prog_type = BPF_PROG_TYPE_XDP;
+   } else if (is_perf_event) {
+   prog_type = BPF_PROG_TYPE_PERF_EVENT;
} else {
printf("Unknown event '%s'\n", event);
return -1;
@@ -82,7 +85,7 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
 
prog_fd[prog_cnt++] = fd;
 
-   if (is_xdp)
+   if (is_xdp || is_perf_event)
return 0;
 
if (is_socket) {
@@ -326,6 +329,7 @@ int load_bpf_file(char *path)
memcmp(shname_prog, "kretprobe/", 10) == 0 ||
memcmp(shname_prog, "tracepoint/", 11) == 0 ||
memcmp(shname_prog, "xdp", 3) == 0 ||
+   memcmp(shname_prog, "perf_event", 10) == 0 ||
memcmp(shname_prog, "socket", 6) == 0)
load_and_attach(shname_prog, insns, 
data_prog->d_size);
}
@@ -344,6 +348,7 @@ int load_bpf_file(char *path)
memcmp(shname, "kretprobe/", 10) == 0 ||
memcmp(shname, "tracepoint/&

[PATCH v3 net-next 6/6] samples/bpf: add sampleip example

From: Brendan Gregg <bgr...@netflix.com>

sample instruction pointer and frequency count in a BPF map

Signed-off-by: Brendan Gregg <bgr...@netflix.com>
Signed-off-by: Alexei Starovoitov <a...@kernel.org>
---
 samples/bpf/Makefile|   4 +
 samples/bpf/sampleip_kern.c |  38 +
 samples/bpf/sampleip_user.c | 196 
 3 files changed, 238 insertions(+)
 create mode 100644 samples/bpf/sampleip_kern.c
 create mode 100644 samples/bpf/sampleip_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index a69cf9045285..12b7304d55dc 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -26,6 +26,7 @@ hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
 hostprogs-y += trace_event
+hostprogs-y += sampleip
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -54,6 +55,7 @@ xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
 test_current_task_under_cgroup-objs := bpf_load.o libbpf.o \
   test_current_task_under_cgroup_user.o
 trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
+sampleip-objs := bpf_load.o libbpf.o sampleip_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -82,6 +84,7 @@ always += xdp1_kern.o
 always += xdp2_kern.o
 always += test_current_task_under_cgroup_kern.o
 always += trace_event_kern.o
+always += sampleip_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -107,6 +110,7 @@ HOSTLOADLIBES_xdp1 += -lelf
 HOSTLOADLIBES_xdp2 += -lelf
 HOSTLOADLIBES_test_current_task_under_cgroup += -lelf
 HOSTLOADLIBES_trace_event += -lelf
+HOSTLOADLIBES_sampleip += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/sampleip_kern.c b/samples/bpf/sampleip_kern.c
new file mode 100644
index ..774a681f374a
--- /dev/null
+++ b/samples/bpf/sampleip_kern.c
@@ -0,0 +1,38 @@
+/* Copyright 2016 Netflix, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+#define MAX_IPS8192
+
+struct bpf_map_def SEC("maps") ip_map = {
+   .type = BPF_MAP_TYPE_HASH,
+   .key_size = sizeof(u64),
+   .value_size = sizeof(u32),
+   .max_entries = MAX_IPS,
+};
+
+SEC("perf_event")
+int do_sample(struct bpf_perf_event_data *ctx)
+{
+   u64 ip;
+   u32 *value, init_val = 1;
+
+   ip = ctx->regs.ip;
+   value = bpf_map_lookup_elem(_map, );
+   if (value)
+   *value += 1;
+   else
+   /* E2BIG not tested for this example only */
+   bpf_map_update_elem(_map, , _val, BPF_NOEXIST);
+
+   return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/sampleip_user.c b/samples/bpf/sampleip_user.c
new file mode 100644
index ..260a6bdd6413
--- /dev/null
+++ b/samples/bpf/sampleip_user.c
@@ -0,0 +1,196 @@
+/*
+ * sampleip: sample instruction pointer and frequency count in a BPF map.
+ *
+ * Copyright 2016 Netflix, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define DEFAULT_FREQ   99
+#define DEFAULT_SECS   5
+#define MAX_IPS8192
+#define PAGE_OFFSET0x8800
+
+static int nr_cpus;
+
+static void usage(void)
+{
+   printf("USAGE: sampleip [-F freq] [duration]\n");
+   printf("   -F freq# sample frequency (Hertz), default 99\n");
+   printf("   duration   # sampling duration (seconds), default 5\n");
+}
+
+static int sampling_start(int *pmu_fd, int freq)
+{
+   int i;
+
+   struct perf_event_attr pe_sample_attr = {
+   .type = PERF_TYPE_SOFTWARE,
+   .freq = 1,
+   .sample_period = freq,
+   .config = PERF_COUNT_SW_CPU_CLOCK,
+   .inherit = 1,
+   };
+
+   for (i = 0; i < nr_cpus; i++) {
+   pmu_fd[i] = perf_event_open(_sample_attr, -1 /* pid */, i,
+   -1 /* group_fd */, 0 /* flags */);
+   if (pmu_fd[i] < 0) {
+   fprintf(stderr, "ERROR: Initializing perf sampling\n");
+   return 1;
+   }
+   assert(ioctl(pmu_fd[

[PATCH v3 net-next 3/6] bpf: perf_event progs should only use preallocated maps

Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use
preallocated hash maps, since doing memory allocation
in overflow_handler can crash depending on where nmi got triggered.

Signed-off-by: Alexei Starovoitov <a...@kernel.org>
Acked-by: Daniel Borkmann <dan...@iogearbox.net>
---
 kernel/bpf/verifier.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c1c9e441f0f5..48c2705db22c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2511,6 +2511,20 @@ static int do_check(struct verifier_env *env)
return 0;
 }
 
+static int check_map_prog_compatibility(struct bpf_map *map,
+   struct bpf_prog *prog)
+
+{
+   if (prog->type == BPF_PROG_TYPE_PERF_EVENT &&
+   (map->map_type == BPF_MAP_TYPE_HASH ||
+map->map_type == BPF_MAP_TYPE_PERCPU_HASH) &&
+   (map->map_flags & BPF_F_NO_PREALLOC)) {
+   verbose("perf_event programs can only use preallocated hash 
map\n");
+   return -EINVAL;
+   }
+   return 0;
+}
+
 /* look for pseudo eBPF instructions that access map FDs and
  * replace them with actual map pointers
  */
@@ -2518,7 +2532,7 @@ static int replace_map_fd_with_map_ptr(struct 
verifier_env *env)
 {
struct bpf_insn *insn = env->prog->insnsi;
int insn_cnt = env->prog->len;
-   int i, j;
+   int i, j, err;
 
for (i = 0; i < insn_cnt; i++, insn++) {
if (BPF_CLASS(insn->code) == BPF_LDX &&
@@ -2562,6 +2576,12 @@ static int replace_map_fd_with_map_ptr(struct 
verifier_env *env)
return PTR_ERR(map);
}
 
+   err = check_map_prog_compatibility(map, env->prog);
+   if (err) {
+   fdput(f);
+   return err;
+   }
+
/* store map pointer inside BPF_LD_IMM64 instruction */
insn[0].imm = (u32) (unsigned long) map;
insn[1].imm = ((u64) (unsigned long) map) >> 32;
-- 
2.8.0

[PATCH v3 net-next 4/6] perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs

Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
via overflow_handler mechanism.
When program is attached the overflow_handlers become stacked.
The program acts as a filter.
Returning zero from the program means that the normal perf_event_output handler
will not be called and sampling event won't be stored in the ring buffer.

The overflow_handler_context==NULL is an additional safety check
to make sure programs are not attached to hw breakpoints and watchdog
in case other checks (that prevent that now anyway) get accidentally
relaxed in the future.

The program refcnt is incremented in case perf_events are inhereted
when target task is forked.
Similar to kprobe and tracepoint programs there is no ioctl to
detach the program or swap already attached program. The user space
expected to close(perf_event_fd) like it does right now for kprobe+bpf.
That restriction simplifies the code quite a bit.

The invocation of overflow_handler in __perf_event_overflow() is now
done via READ_ONCE, since that pointer can be replaced when the program
is attached while perf_event itself could have been active already.
There is no need to do similar treatment for event->prog, since it's
assigned only once before it's accessed.

Signed-off-by: Alexei Starovoitov <a...@kernel.org>
---
 include/linux/bpf.h|  4 +++
 include/linux/perf_event.h |  4 +++
 kernel/events/core.c   | 89 +-
 3 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 11134238417d..9a904f63f8c1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -297,6 +297,10 @@ static inline struct bpf_prog *bpf_prog_add(struct 
bpf_prog *prog, int i)
 static inline void bpf_prog_put(struct bpf_prog *prog)
 {
 }
+static inline struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog)
+{
+   return ERR_PTR(-EOPNOTSUPP);
+}
 #endif /* CONFIG_BPF_SYSCALL */
 
 /* verifier prototypes for helper functions called from eBPF programs */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 97bfe62f30d7..ccb73a58113d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -679,6 +679,10 @@ struct perf_event {
u64 (*clock)(void);
perf_overflow_handler_t overflow_handler;
void*overflow_handler_context;
+#ifdef CONFIG_BPF_SYSCALL
+   perf_overflow_handler_t orig_overflow_handler;
+   struct bpf_prog *prog;
+#endif
 
 #ifdef CONFIG_EVENT_TRACING
struct trace_event_call *tp_event;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3cfabdf7b942..85bf4c37911f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7022,7 +7022,7 @@ static int __perf_event_overflow(struct perf_event *event,
irq_work_queue(>pending);
}
 
-   event->overflow_handler(event, data, regs);
+   READ_ONCE(event->overflow_handler)(event, data, regs);
 
if (*perf_event_fasync(event) && event->pending_kill) {
event->pending_wakeup = 1;
@@ -7637,11 +7637,83 @@ static void perf_event_free_filter(struct perf_event 
*event)
ftrace_profile_free_filter(event);
 }
 
+#ifdef CONFIG_BPF_SYSCALL
+static void bpf_overflow_handler(struct perf_event *event,
+struct perf_sample_data *data,
+struct pt_regs *regs)
+{
+   struct bpf_perf_event_data_kern ctx = {
+   .data = data,
+   .regs = regs,
+   };
+   int ret = 0;
+
+   preempt_disable();
+   if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))
+   goto out;
+   rcu_read_lock();
+   ret = BPF_PROG_RUN(event->prog, (void *));
+   rcu_read_unlock();
+out:
+   __this_cpu_dec(bpf_prog_active);
+   preempt_enable();
+   if (!ret)
+   return;
+
+   event->orig_overflow_handler(event, data, regs);
+}
+
+static int perf_event_set_bpf_handler(struct perf_event *event, u32 prog_fd)
+{
+   struct bpf_prog *prog;
+
+   if (event->overflow_handler_context)
+   /* hw breakpoint or kernel counter */
+   return -EINVAL;
+
+   if (event->prog)
+   return -EEXIST;
+
+   prog = bpf_prog_get_type(prog_fd, BPF_PROG_TYPE_PERF_EVENT);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   event->prog = prog;
+   event->orig_overflow_handler = READ_ONCE(event->overflow_handler);
+   WRITE_ONCE(event->overflow_handler, bpf_overflow_handler);
+   return 0;
+}
+
+static void perf_event_free_bpf_handler(struct perf_event *event)
+{
+   struct bpf_prog *prog = event->prog;
+
+   if (!prog)
+   return;
+
+   WRITE_ONCE(event->overflow_handler, event->orig_overflow_handler);

[PATCH v3 net-next 0/6] perf, bpf: add support for bpf in sw/hw perf_events