from:"Tycho Andersen"

Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-04-01 Thread Tycho Andersen

Hi Mickaël,

On Mon, Apr 02, 2018 at 12:04:36AM +0200, Mickaël Salaün wrote:
> >> vDSO is a code mapped for all processes. As you said, these processes
> >> may use it or not. What I was thinking about is to use the same concept,
> >> i.e. map a "shim" code into each processes pertaining to a particular
> >> hierarchy (the same way seccomp filters are inherited across processes).
> >> With a seccomp filter matching some syscall (e.g. mount, open), it is
> >> possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This
> >> shim code should then be able to emulate/patch what is needed, even
> >> faking a file opening by receiving a file descriptor through a UNIX
> >> socket. As did the Chrome sandbox, the seccomp filter may look at the
> >> calling address to allow the shim code to call syscalls without being
> >> catched, if needed. However, relying on SIGSYS may not fit with
> >> arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump
> >> to a specific process address, to emulate the syscall in an easier way
> >> than only relying on a {c,e}BPF program.
> >>
> > 
> > This could indeed be done, but I think that Tycho's approach is much
> > cleaner and probably faster.
> > 
> 
> I like it too but how does this handle file descriptors?

I think it could be done fairly simply, the most complicated part is
probably designing an API that doesn't suck. But the basic idea would
be:

struct seccomp_notif_resp {
__u64 id;
__s32 error;
__s64 val;
__s32 fd;
};

if the handler responds with fd >= 0, we grab the tracer's fd,
duplicate it, and install it somewhere in the tracee's fd table. Since
things like socket() will want to return the fd number as its
installed and the handler doesn't know that, we'll probably want some
way to indicate that the kernel should return this value. We could
either mandate that if fd >= 0, that's the value that will be returned
from the syscall, or add another flag that says "no, install the fd,
but really return what's in val instead).

I guess we can't mandate that we return fd, because e.g. netlink
sockets can sometimes return fds as part of the netlink messages, and
not as the return value from the syscall.

Tycho

Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-06 Thread Tycho Andersen

On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote:
> >> Suppose I'm writing a container manager.  I want to run "mount" in the
> >> container, but I don't want to allow moun() in general and I want to
> >> emulate certain mount() actions.  I can write a filter that catches
> >> mount using seccomp and calls out to the container manager for help.
> >> This isn't theoretical -- Tycho wants *exactly* this use case to be
> >> supported.
> >
> > Well, I think this use case should be handled with something like
> > LD_PRELOAD and a helper library. FYI, I did something like this:
> > https://github.com/stemjail/stemshim
> 
> I doubt that will work for containers.  Containers that use user
> namespaces and, for example, setuid programs aren't going to honor
> LD_PRELOAD.

Or anything that calls syscalls directly, like go programs.

Tycho

Re: [net-next v3 1/2] bpf, seccomp: Add eBPF filter capabilities

2018-03-05 Thread Tycho Andersen

Hi Andy,

On Thu, Mar 01, 2018 at 10:05:47PM +, Andy Lutomirski wrote:
> But Tycho: would hooking user notifiers in right here work for you?
> As I see it, this would be the best justification for seccomp eBPF.

Sorry for the delay; Sargun had declared on irc that he was going to
implement it, so I didn't look into it. I think the basics will work,
but I haven't invested the time to look into it given the above.

Sargun, are you still planning to look at this? What's your timeline?

Cheers,

Tycho

Re: [net-next v3 0/2] eBPF seccomp filters

2018-02-26 Thread Tycho Andersen

On Mon, Feb 26, 2018 at 07:46:19PM -0800, Sargun Dhillon wrote:
> On Mon, Feb 26, 2018 at 5:01 PM, Tycho Andersen <ty...@tycho.ws> wrote:
> > On Mon, Feb 26, 2018 at 03:20:15PM -0800, Kees Cook wrote:
> >> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov
> >> <alexei.starovoi...@gmail.com> wrote:
> >> > On Mon, Feb 26, 2018 at 07:26:54AM +, Sargun Dhillon wrote:
> >> >> This patchset enables seccomp filters to be written in eBPF. Although, 
> >> >> this
> >> >> [...]
> >> > The main statement I want to hear from seccomp maintainers before
> >> > proceeding any further on this that enabling eBPF in seccomp won't lead
> >> > to seccomp folks arguing against changes in bpf core (like verifier)
> >> > just because it's used by seccomp.
> >> > It must be spelled out in the commit log with explicit Ack.
> >>
> >> The primary thing I'm concerned about with eBPF and seccomp is
> >> side-effects from eBPF programs running at syscall time. This is an
> >> extremely sensitive area, and I want to be sure there won't be
> >> feature-creep here that leads to seccomp getting into a bad state.
> >>
> >> As long as seccomp can continue have its own verifier,
> >
> > I guess these patches should introduce some additional restrictions in
> > kernel/seccomp.c then? Based on my reading now, it's whatever the eBPF
> > verifier allows.
> >
> Like what? The helpers allowed are listed in seccomp.c. You have the
> same restrictions as the traditional eBPF verifier (no unsafe memory
> access, jumps backwards, etc..). I'm not sure which built-in eBPF
> functionality presents risk.

I think that's the $64,000 question that Kees is trying to answer r.e.
maps, etc.

There's also the possibility that eBPF grows something new
that's unsafe for seccomp.

Cheers,

Tycho

Re: [net-next v3 1/2] bpf, seccomp: Add eBPF filter capabilities

2018-02-26 Thread Tycho Andersen

On Mon, Feb 26, 2018 at 07:49:48PM -0800, Sargun Dhillon wrote:
> On Mon, Feb 26, 2018 at 4:54 PM, Tycho Andersen <ty...@tycho.ws> wrote:
> > On Mon, Feb 26, 2018 at 07:27:05AM +, Sargun Dhillon wrote:
> >> +config SECCOMP_FILTER_EXTENDED
> >> + bool "Extended BPF seccomp filters"
> >> + depends on SECCOMP_FILTER && BPF_SYSCALL
> >> + depends on !CHECKPOINT_RESTORE
> >
> > Why not just give -EINVAL or something in case one of these is
> > requested, instead of making them incompatible at compile time?
> >
> > Tycho
> There's already code to return -EMEDIUMTYPE if it's a non-classic, or
> non-saved filter. Under the normal case, with CHECKPOINT_RESTORE
> enabled, you should never be able to get that. I think it makes sense
> to preserve this behaviour.

Oh, right. So can't we just drop this, and the existing code will
DTRT, i.e. give you -EMEDIUMTYPE because the new filters aren't
supported, until they are?

Tycho

Re: [net-next v3 0/2] eBPF seccomp filters

2018-02-26 Thread Tycho Andersen

On Mon, Feb 26, 2018 at 03:20:15PM -0800, Kees Cook wrote:
> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov
>  wrote:
> > On Mon, Feb 26, 2018 at 07:26:54AM +, Sargun Dhillon wrote:
> >> This patchset enables seccomp filters to be written in eBPF. Although, this
> >> [...]
> > The main statement I want to hear from seccomp maintainers before
> > proceeding any further on this that enabling eBPF in seccomp won't lead
> > to seccomp folks arguing against changes in bpf core (like verifier)
> > just because it's used by seccomp.
> > It must be spelled out in the commit log with explicit Ack.
> 
> The primary thing I'm concerned about with eBPF and seccomp is
> side-effects from eBPF programs running at syscall time. This is an
> extremely sensitive area, and I want to be sure there won't be
> feature-creep here that leads to seccomp getting into a bad state.
> 
> As long as seccomp can continue have its own verifier,

I guess these patches should introduce some additional restrictions in
kernel/seccomp.c then? Based on my reading now, it's whatever the eBPF
verifier allows.

> I *think* this will be fine, though, again I remain concerned about
> maps, etc. I'm still reviewing these patches and how they might
> provide overlap with Tycho's needs too, etc.

Yes, it's on my TODO list to take a look at how to do it as suggested
by Alexi on top of this set before posting a v2. Haven't had time
recently, though.

Cheers,

Tycho

Re: [net-next v3 1/2] bpf, seccomp: Add eBPF filter capabilities

2018-02-26 Thread Tycho Andersen

On Mon, Feb 26, 2018 at 07:27:05AM +, Sargun Dhillon wrote:
> +config SECCOMP_FILTER_EXTENDED
> + bool "Extended BPF seccomp filters"
> + depends on SECCOMP_FILTER && BPF_SYSCALL
> + depends on !CHECKPOINT_RESTORE

Why not just give -EINVAL or something in case one of these is
requested, instead of making them incompatible at compile time?

Tycho

Re: [PATCH net-next 0/3] eBPF Seccomp filters

2018-02-14 Thread Tycho Andersen

On Wed, Feb 14, 2018 at 05:25:00PM +, Andy Lutomirski wrote:
> On Tue, Feb 13, 2018 at 3:47 PM, Kees Cook  wrote:
> > On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon  wrote:
> >> This patchset enables seccomp filters to be written in eBPF. Although,
> >> this patchset doesn't introduce much of the functionality enabled by
> >> eBPF, it lays the ground work for it.
> >>
> >> It also introduces the capability to dump eBPF filters via the PTRACE
> >> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
> >> In the attached samples, there's an example of this. One can then use
> >> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
> >> and use that at reload time.
> >>
> >> The primary reason for not adding maps support in this patchset is
> >> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
> >> If we have a map that the BPF program can read, it can potentially
> >> "change" privileges after running. It seems like doing writes only
> >> is safe, because it can be pure, and side effect free, and therefore
> >> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
> >> to an agreement, this can be in a follow-up patchset.
> >
> > What's the reason for adding eBPF support? seccomp shouldn't need it,
> > and it only makes the code more complex. I'd rather stick with cBPF
> > until we have an overwhelmingly good reason to use eBPF as a "native"
> > seccomp filter language.
> >
> 
> I can think of two fairly strong use cases for eBPF's ability to call
> functions: logging and Tycho's user notifier thing.

Worth noting that there is one additional thing that I didn't
implement, but which would be nice and is probably not possible with
eBPF (at least, not without a bunch of additional infrastructure):
passing fds back to the tracee from the manager if you intercept
socket(), or accept() or something.

This could again be accomplished via other means, though it would be a
lot nicer to have a primitive for it.

That said, I think it's more important that something like this gets
in, vs. that it gets in with some approach like I've posted. So if we
go with eBPF and some wait functions and acknowledge that you have to
do some ptrace surgery, that is better than nothing.

Tycho

Re: [PATCH net-next 0/3] eBPF Seccomp filters

2018-02-13 Thread Tycho Andersen

On Tue, Feb 13, 2018 at 12:16:42PM -0800, Kees Cook wrote:
> If the needs Tycho outlined[1] could be addressed fully with eBPF, and
> we can very narrowly scope the use of the "extra" eBPF features, I
> might be more inclined to merge something like this, but I want to
> take it very carefully. Besides creating a dependency on the bpf()
> syscall, this would create side channels (via maps) that make me very
> uncomfortable when dealing with process isolation. (Though, in theory,
> this is already correctly constrained by no-new-privs...)
> 
> Tycho, could you get what you needed from eBPF?

We could get almost all the way there, I think. We could pass the
event via a bpf map, and then have a userspace daemon do:

while (1) {
bpf(BPF_MAP_LOOKUP_ELEM, , sizeof(attr));
if (!syscall_queued())
continue;

do_stuff();
set_done();
bpf(BPF_MAP_UPDATE_ELEM, , sizeof(attr));
}

but as you say,

> My impression would be that you'd still need a user notification
> mechanism to stop the process, as the decisions about how to rewrite
> arguments likely cannot be fully characterized by the internal eBPF
> filter.

...there's no way to stop the seccomp'd task until userspace is
finished with whatever thing it needs to do on behalf of the seccomp'd
task (at least, IIUC).

That's of course ignoring the ergonomics from userspace: bpf_map_fops
doesn't implement poll() or anything, so we really do have to use a
while(1), if we want to allow more than one syscall queuing at a time,
we need to poll multiple map elements.

One of the extensions I had been considering floating for v2 of my set
was allowing users to pass fds back across (again, to make userspace
ergonomics a little better), which would be impossible via ebpf.

Cheers,

Tycho

Re: [PATCH] ebpf: verify the output of the JIT

2017-04-04 Thread Tycho Andersen

Hi Kees,

On Tue, Apr 04, 2017 at 03:17:57PM -0700, Kees Cook wrote:
> On Tue, Apr 4, 2017 at 3:08 PM, Tycho Andersen <ty...@docker.com> wrote:
> > The goal of this patch is to protect the JIT against an attacker with a
> > write-in-memory primitive. The JIT allocates a buffer which will eventually
> > be marked +x, so we need to make sure that what was written to this buffer
> > is what was intended.
> >
> > We acheive this by building a hash of the instruction buffer as
> > instructions are emittted and then comparing that to a hash at the end of
> > the JIT compile after the buffer has been marked read-only.
> >
> > Signed-off-by: Tycho Andersen <ty...@docker.com>
> > CC: Daniel Borkmann <dan...@iogearbox.net>
> > CC: Alexei Starovoitov <a...@kernel.org>
> > CC: Kees Cook <keesc...@chromium.org>
> > CC: Mickaël Salaün <m...@digikod.net>
> 
> Cool! This closes the race condition on producing the JIT vs going
> read-only. I wonder if it might be possible to make this a more
> generic interface to the BPF which would be allocate the hash, provide
> the update callback during emit, and then do the hash check itself at
> the end of bpf_jit_binary_lock_ro()?

Yes, probably so. I can look into that for the next version.

Tycho

[PATCH] ebpf: verify the output of the JIT

2017-04-04 Thread Tycho Andersen

The goal of this patch is to protect the JIT against an attacker with a
write-in-memory primitive. The JIT allocates a buffer which will eventually
be marked +x, so we need to make sure that what was written to this buffer
is what was intended.

We acheive this by building a hash of the instruction buffer as
instructions are emittted and then comparing that to a hash at the end of
the JIT compile after the buffer has been marked read-only.

Signed-off-by: Tycho Andersen <ty...@docker.com>
CC: Daniel Borkmann <dan...@iogearbox.net>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Kees Cook <keesc...@chromium.org>
CC: Mickaël Salaün <m...@digikod.net>
---
 arch/x86/Kconfig|  11 
 arch/x86/net/bpf_jit_comp.c | 147 
 2 files changed, 147 insertions(+), 11 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc98d5a..7b2db2c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2789,6 +2789,17 @@ config X86_DMA_REMAP
 
 source "net/Kconfig"
 
+config EBPF_JIT_HASH_OUTPUT
+   def_bool y
+   depends on HAVE_EBPF_JIT
+   depends on BPF_JIT
+   select CRYPTO_SHA256
+   ---help---
+ Enables a double check of the JIT's output after it is marked 
read-only to
+ ensure that it matches what the JIT generated.
+
+ Note, only applies when /proc/sys/net/core/bpf_jit_harden > 0.
+
 source "drivers/Kconfig"
 
 source "drivers/firmware/Kconfig"
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 32322ce..be1271e 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -13,9 +13,15 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 int bpf_jit_enable __read_mostly;
 
+#ifdef CONFIG_EBPF_JIT_HASH_OUTPUT
+struct crypto_shash *tfm __read_mostly;
+#endif
+
 /*
  * assembly code in arch/x86/net/bpf_jit.S
  */
@@ -25,7 +31,8 @@ extern u8 sk_load_byte_positive_offset[];
 extern u8 sk_load_word_negative_offset[], sk_load_half_negative_offset[];
 extern u8 sk_load_byte_negative_offset[];
 
-static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
+static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len,
+struct shash_desc *hash)
 {
if (len == 1)
*ptr = bytes;
@@ -35,11 +42,15 @@ static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
*(u32 *)ptr = bytes;
barrier();
}
+
+   if (IS_ENABLED(CONFIG_EBPF_JIT_HASH_OUTPUT) && hash)
+   crypto_shash_update(hash, (u8 *) , len);
+
return ptr + len;
 }
 
 #define EMIT(bytes, len) \
-   do { prog = emit_code(prog, bytes, len); cnt += len; } while (0)
+   do { prog = emit_code(prog, bytes, len, hash); cnt += len; } while (0)
 
 #define EMIT1(b1)  EMIT(b1, 1)
 #define EMIT2(b1, b2)  EMIT((b1) + ((b2) << 8), 2)
@@ -206,7 +217,7 @@ struct jit_context {
 /* emit x64 prologue code for BPF program and check it's size.
  * bpf_tail_call helper will skip it while jumping into another program
  */
-static void emit_prologue(u8 **pprog)
+static void emit_prologue(u8 **pprog, struct shash_desc *hash)
 {
u8 *prog = *pprog;
int cnt = 0;
@@ -264,7 +275,7 @@ static void emit_prologue(u8 **pprog)
  *   goto *(prog->bpf_func + prologue_size);
  * out:
  */
-static void emit_bpf_tail_call(u8 **pprog)
+static void emit_bpf_tail_call(u8 **pprog, struct shash_desc *hash)
 {
u8 *prog = *pprog;
int label1, label2, label3;
@@ -328,7 +339,7 @@ static void emit_bpf_tail_call(u8 **pprog)
 }
 
 
-static void emit_load_skb_data_hlen(u8 **pprog)
+static void emit_load_skb_data_hlen(u8 **pprog, struct shash_desc *hash)
 {
u8 *prog = *pprog;
int cnt = 0;
@@ -348,7 +359,8 @@ static void emit_load_skb_data_hlen(u8 **pprog)
 }
 
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
- int oldproglen, struct jit_context *ctx)
+ int oldproglen, struct jit_context *ctx,
+ struct shash_desc *hash)
 {
struct bpf_insn *insn = bpf_prog->insnsi;
int insn_cnt = bpf_prog->len;
@@ -360,10 +372,10 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
int proglen = 0;
u8 *prog = temp;
 
-   emit_prologue();
+   emit_prologue(, hash);
 
if (seen_ld_abs)
-   emit_load_skb_data_hlen();
+   emit_load_skb_data_hlen(, hash);
 
for (i = 0; i < insn_cnt; i++, insn++) {
const s32 imm32 = insn->imm;
@@ -875,7 +887,7 @@ xadd:   if (is_imm8(insn->off))
if (seen_ld_abs) {
if (reload_skb_data) {
EMIT1(0x5F); /* pop %rdi */
-   emit_load_skb_data_hlen();
+

[PATCH v3] openvswitch: allow management from inside user namespaces

2016-02-05 Thread Tycho Andersen

Operations with the GENL_ADMIN_PERM flag fail permissions checks because
this flag means we call netlink_capable, which uses the init user ns.

Instead, let's introduce a new flag, GENL_UNS_ADMIN_PERM for operations
which should be allowed inside a user namespace.

The motivation for this is to be able to run openvswitch in unprivileged
containers. I've tested this and it seems to work, but I really have no
idea about the security consequences of this patch, so thoughts would be
much appreciated.

v2: use the GENL_UNS_ADMIN_PERM flag instead of a check in each function
v3: use separate ifs for UNS_ADMIN_PERM and ADMIN_PERM, instead of one
massive one

Reported-by: James Page <james.p...@canonical.com>
Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Eric Biederman <ebied...@xmission.com>
CC: Pravin Shelar <pshe...@ovn.org>
CC: Justin Pettit <jpet...@nicira.com>
CC: "David S. Miller" <da...@davemloft.net>
---
 include/uapi/linux/genetlink.h |  1 +
 net/netlink/genetlink.c|  4 
 net/openvswitch/datapath.c | 20 ++--
 3 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/include/uapi/linux/genetlink.h b/include/uapi/linux/genetlink.h
index c3363ba..5512c90 100644
--- a/include/uapi/linux/genetlink.h
+++ b/include/uapi/linux/genetlink.h
@@ -21,6 +21,7 @@ struct genlmsghdr {
 #define GENL_CMD_CAP_DO0x02
 #define GENL_CMD_CAP_DUMP  0x04
 #define GENL_CMD_CAP_HASPOL0x08
+#define GENL_UNS_ADMIN_PERM0x10
 
 /*
  * List of reserved static generic netlink identifiers:
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index f830326..0ffd721 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -580,6 +580,10 @@ static int genl_family_rcv_msg(struct genl_family *family,
!netlink_capable(skb, CAP_NET_ADMIN))
return -EPERM;
 
+   if ((ops->flags & GENL_UNS_ADMIN_PERM) &&
+   !netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
+   return -EPERM;
+
if ((nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP) {
int rc;
 
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index deadfda..d6f7fe9 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -654,7 +654,7 @@ static const struct nla_policy 
packet_policy[OVS_PACKET_ATTR_MAX + 1] = {
 
 static const struct genl_ops dp_packet_genl_ops[] = {
{ .cmd = OVS_PACKET_CMD_EXECUTE,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .policy = packet_policy,
  .doit = ovs_packet_cmd_execute
}
@@ -1391,12 +1391,12 @@ static const struct nla_policy 
flow_policy[OVS_FLOW_ATTR_MAX + 1] = {
 
 static const struct genl_ops dp_flow_genl_ops[] = {
{ .cmd = OVS_FLOW_CMD_NEW,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .policy = flow_policy,
  .doit = ovs_flow_cmd_new
},
{ .cmd = OVS_FLOW_CMD_DEL,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .policy = flow_policy,
  .doit = ovs_flow_cmd_del
},
@@ -1407,7 +1407,7 @@ static const struct genl_ops dp_flow_genl_ops[] = {
  .dumpit = ovs_flow_cmd_dump
},
{ .cmd = OVS_FLOW_CMD_SET,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .policy = flow_policy,
  .doit = ovs_flow_cmd_set,
},
@@ -1777,12 +1777,12 @@ static const struct nla_policy 
datapath_policy[OVS_DP_ATTR_MAX + 1] = {
 
 static const struct genl_ops dp_datapath_genl_ops[] = {
{ .cmd = OVS_DP_CMD_NEW,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .policy = datapath_policy,
  .doit = ovs_dp_cmd_new
},
{ .cmd = OVS_DP_CMD_DEL,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .policy = datapath_policy,
  .doit = ovs_dp_cmd_del
},
@@ -1793,7 +1793,7 @@ static const struct genl_ops dp_datapath_genl_ops[] = {
  .dumpit = ovs_dp_cmd_dump
},
{ .cmd = OVS_DP_CMD_SET,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .policy = datapath_policy,
  .doit = ovs_dp_cmd_set,
},
@@ -2158,12 +2158,12

[PATCH v2] openvswitch: allow management from inside user namespaces

2016-02-01 Thread Tycho Andersen

Operations with the GENL_ADMIN_PERM flag fail permissions checks because
this flag means we call netlink_capable, which uses the init user ns.

Instead, let's introduce a new flag, GENL_UNS_ADMIN_PERM for operations
which should be allowed inside a user namespace.

The motivation for this is to be able to run openvswitch in unprivileged
containers. I've tested this and it seems to work, but I really have no
idea about the security consequences of this patch, so thoughts would be
much appreciated.

v2: use the GENL_UNS_ADMIN_PERM flag instead of a check in each function

Reported-by: James Page <james.p...@canonical.com>
Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Eric Biederman <ebied...@xmission.com>
CC: Pravin Shelar <pshe...@ovn.org>
CC: Justin Pettit <jpet...@nicira.com>
CC: "David S. Miller" <da...@davemloft.net>
---
 include/uapi/linux/genetlink.h |  1 +
 net/netlink/genetlink.c|  6 --
 net/openvswitch/datapath.c | 20 ++--
 3 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/include/uapi/linux/genetlink.h b/include/uapi/linux/genetlink.h
index c3363ba..5512c90 100644
--- a/include/uapi/linux/genetlink.h
+++ b/include/uapi/linux/genetlink.h
@@ -21,6 +21,7 @@ struct genlmsghdr {
 #define GENL_CMD_CAP_DO0x02
 #define GENL_CMD_CAP_DUMP  0x04
 #define GENL_CMD_CAP_HASPOL0x08
+#define GENL_UNS_ADMIN_PERM0x10
 
 /*
  * List of reserved static generic netlink identifiers:
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index f830326..6bbb3eb 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -576,8 +576,10 @@ static int genl_family_rcv_msg(struct genl_family *family,
if (ops == NULL)
return -EOPNOTSUPP;
 
-   if ((ops->flags & GENL_ADMIN_PERM) &&
-   !netlink_capable(skb, CAP_NET_ADMIN))
+   if (((ops->flags & GENL_ADMIN_PERM) &&
+   !netlink_capable(skb, CAP_NET_ADMIN)) ||
+   ((ops->flags & GENL_UNS_ADMIN_PERM) &&
+   !netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN)))
return -EPERM;
 
if ((nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP) {
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index deadfda..d6f7fe9 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -654,7 +654,7 @@ static const struct nla_policy 
packet_policy[OVS_PACKET_ATTR_MAX + 1] = {
 
 static const struct genl_ops dp_packet_genl_ops[] = {
{ .cmd = OVS_PACKET_CMD_EXECUTE,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .policy = packet_policy,
  .doit = ovs_packet_cmd_execute
}
@@ -1391,12 +1391,12 @@ static const struct nla_policy 
flow_policy[OVS_FLOW_ATTR_MAX + 1] = {
 
 static const struct genl_ops dp_flow_genl_ops[] = {
{ .cmd = OVS_FLOW_CMD_NEW,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .policy = flow_policy,
  .doit = ovs_flow_cmd_new
},
{ .cmd = OVS_FLOW_CMD_DEL,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .policy = flow_policy,
  .doit = ovs_flow_cmd_del
},
@@ -1407,7 +1407,7 @@ static const struct genl_ops dp_flow_genl_ops[] = {
  .dumpit = ovs_flow_cmd_dump
},
{ .cmd = OVS_FLOW_CMD_SET,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .policy = flow_policy,
  .doit = ovs_flow_cmd_set,
},
@@ -1777,12 +1777,12 @@ static const struct nla_policy 
datapath_policy[OVS_DP_ATTR_MAX + 1] = {
 
 static const struct genl_ops dp_datapath_genl_ops[] = {
{ .cmd = OVS_DP_CMD_NEW,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .policy = datapath_policy,
  .doit = ovs_dp_cmd_new
},
{ .cmd = OVS_DP_CMD_DEL,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .policy = datapath_policy,
  .doit = ovs_dp_cmd_del
},
@@ -1793,7 +1793,7 @@ static const struct genl_ops dp_datapath_genl_ops[] = {
  .dumpit = ovs_dp_cmd_dump
},
{ .cmd = OVS_DP_CMD_SET,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .

[PATCH] openvswitch: allow management from inside user namespaces

2016-01-29 Thread Tycho Andersen

Operations with the GENL_ADMIN_PERM flag fail permissions checks because
this flag means we call netlink_capable, which uses the init user ns.

Instead, let's do permissions checks in each function, but use the netlink
socket's user ns instead of the initial one, to allow management of
openvswitch resources from inside a user ns.

The motivation for this is to be able to run openvswitch in unprivileged
containers. I've tested this and it seems to work, but I really have no
idea about the security consequences of this patch, so thoughts would be
much appreciated.

Reported-by: James Page <james.p...@canonical.com>
Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Eric Biederman <ebied...@xmission.com>
CC: Pravin Shelar <pshe...@ovn.org>
CC: Justin Pettit <jpet...@nicira.com>
CC: "David S. Miller" <da...@davemloft.net>
---
 net/openvswitch/datapath.c | 63 ++
 1 file changed, 53 insertions(+), 10 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index deadfda..aacfb11 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -557,6 +557,10 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, 
struct genl_info *info)
int err;
bool log = !a[OVS_PACKET_ATTR_PROBE];
 
+   err = -EPERM;
+   if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+   goto err;
+
err = -EINVAL;
if (!a[OVS_PACKET_ATTR_PACKET] || !a[OVS_PACKET_ATTR_KEY] ||
!a[OVS_PACKET_ATTR_ACTIONS])
@@ -654,7 +658,7 @@ static const struct nla_policy 
packet_policy[OVS_PACKET_ATTR_MAX + 1] = {
 
 static const struct genl_ops dp_packet_genl_ops[] = {
{ .cmd = OVS_PACKET_CMD_EXECUTE,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = 0,
  .policy = packet_policy,
  .doit = ovs_packet_cmd_execute
}
@@ -920,6 +924,10 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
int error;
bool log = !a[OVS_FLOW_ATTR_PROBE];
 
+   error = -EPERM;
+   if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+   goto error;
+
/* Must have key and actions. */
error = -EINVAL;
if (!a[OVS_FLOW_ATTR_KEY]) {
@@ -1104,6 +1112,10 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct 
genl_info *info)
bool log = !a[OVS_FLOW_ATTR_PROBE];
bool ufid_present;
 
+   error = -EPERM;
+   if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+   goto error;
+
/* Extract key. */
error = -EINVAL;
if (!a[OVS_FLOW_ATTR_KEY]) {
@@ -1274,6 +1286,9 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct 
genl_info *info)
bool log = !a[OVS_FLOW_ATTR_PROBE];
bool ufid_present;
 
+   if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+   return -EPERM;
+
ufid_present = ovs_nla_get_ufid(, a[OVS_FLOW_ATTR_UFID], log);
if (a[OVS_FLOW_ATTR_KEY]) {
ovs_match_init(, , NULL);
@@ -1391,12 +1406,12 @@ static const struct nla_policy 
flow_policy[OVS_FLOW_ATTR_MAX + 1] = {
 
 static const struct genl_ops dp_flow_genl_ops[] = {
{ .cmd = OVS_FLOW_CMD_NEW,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = 0,
  .policy = flow_policy,
  .doit = ovs_flow_cmd_new
},
{ .cmd = OVS_FLOW_CMD_DEL,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = 0,
  .policy = flow_policy,
  .doit = ovs_flow_cmd_del
},
@@ -1407,7 +1422,7 @@ static const struct genl_ops dp_flow_genl_ops[] = {
  .dumpit = ovs_flow_cmd_dump
},
{ .cmd = OVS_FLOW_CMD_SET,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = 0,
  .policy = flow_policy,
  .doit = ovs_flow_cmd_set,
},
@@ -1530,8 +1545,12 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
struct datapath *dp;
struct vport *vport;
struct ovs_net *ovs_net;
+   struct net *net = sock_net(skb->sk);
int err, i;
 
+   if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+   return -EPERM;
+
err = -EINVAL;
if (!a[OVS_DP_ATTR_NAME] || !a[OVS_DP_ATTR_UPCALL_PID])
goto err;
@@ -1655,8 +1674,12 @@ static int ovs_dp_cmd_del(struct sk_buff *skb, struct 
genl_info *info)
 {
struct sk_buff *reply;
struct datapath *dp;
+   struct net *net = sock_net(skb->sk);
int err;
 
+   if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+   return -EPERM;
+
reply = ovs_dp_cmd_alloc_info(info);
if (!reply)
return -ENOMEM;
@@ -1688,8 +1711,12 @@ static int ovs_dp_cmd_set(struct sk_buff *skb, struct 
genl_inf

Re: [PATCH] openvswitch: allow management from inside user namespaces

2016-01-29 Thread Tycho Andersen

Hi Eric,

Thanks for the review.

On Fri, Jan 29, 2016 at 08:29:55AM -0600, Eric W. Biederman wrote:
> Tycho Andersen <tycho.ander...@canonical.com> writes:
> 
> > Operations with the GENL_ADMIN_PERM flag fail permissions checks because
> > this flag means we call netlink_capable, which uses the init user ns.
> >
> > Instead, let's do permissions checks in each function, but use the netlink
> > socket's user ns instead of the initial one, to allow management of
> > openvswitch resources from inside a user ns.
> >
> > The motivation for this is to be able to run openvswitch in unprivileged
> > containers. I've tested this and it seems to work, but I really have no
> > idea about the security consequences of this patch, so thoughts would be
> > much appreciated.
> 
> So at a quick look using ns_capable this way is probably buggy.
> 
> netlink is goofy (because historically we got this wrong), and I forget
> what the specific rules are.  The general rule is that you need to do
> your permission checks on open/create/connect and not inside send/write
> while processing data.  Otherwise there is a class of privileged
> applications where you can set their stdout to some precreated file
> descriptor and their output can be made to act as a command, bypassing
> your permission checks.

It's worth noting that this patch doesn't move the checks (i.e., they
are still done at write time currently in the kernel), it just relaxes
them to root in the user ns instead of real root. This means I can
currently exploit netlink this way as an unprivileged, just not from
within an unprivileged container.

An alternate version of this patch below might be more favorable, as
we may want to do something like this elsewhere in netlink. I think it
also clarifies the situation a bit, at the cost of adding another
flag.

A third option would be to move this check to connect time, but that
would force everything in the family (since that's the only thing you
connect /to/ in netlink) to have the same required permissions, which
might (probably?) break stuff; e.g. you can call OVS_FLOW_CMD_GET
without CAP_NET_ADMIN, but if we changed everything in the family to
require it, that would break.

Tycho
---
 include/uapi/linux/genetlink.h |  1 +
 net/netlink/genetlink.c|  6 --
 net/openvswitch/datapath.c | 20 ++--
 3 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/include/uapi/linux/genetlink.h b/include/uapi/linux/genetlink.h
index c3363ba..5512c90 100644
--- a/include/uapi/linux/genetlink.h
+++ b/include/uapi/linux/genetlink.h
@@ -21,6 +21,7 @@ struct genlmsghdr {
 #define GENL_CMD_CAP_DO0x02
 #define GENL_CMD_CAP_DUMP  0x04
 #define GENL_CMD_CAP_HASPOL0x08
+#define GENL_UNS_ADMIN_PERM0x10
 
 /*
  * List of reserved static generic netlink identifiers:
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index f830326..6bbb3eb 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -576,8 +576,10 @@ static int genl_family_rcv_msg(struct genl_family *family,
if (ops == NULL)
return -EOPNOTSUPP;
 
-   if ((ops->flags & GENL_ADMIN_PERM) &&
-   !netlink_capable(skb, CAP_NET_ADMIN))
+   if (((ops->flags & GENL_ADMIN_PERM) &&
+   !netlink_capable(skb, CAP_NET_ADMIN)) ||
+   ((ops->flags & GENL_UNS_ADMIN_PERM) &&
+   !netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN)))
return -EPERM;
 
if ((nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP) {
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index deadfda..d6f7fe9 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -654,7 +654,7 @@ static const struct nla_policy 
packet_policy[OVS_PACKET_ATTR_MAX + 1] = {
 
 static const struct genl_ops dp_packet_genl_ops[] = {
{ .cmd = OVS_PACKET_CMD_EXECUTE,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .policy = packet_policy,
  .doit = ovs_packet_cmd_execute
}
@@ -1391,12 +1391,12 @@ static const struct nla_policy 
flow_policy[OVS_FLOW_ATTR_MAX + 1] = {
 
 static const struct genl_ops dp_flow_genl_ops[] = {
{ .cmd = OVS_FLOW_CMD_NEW,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .policy = flow_policy,
  .doit = ovs_flow_cmd_new
},
{ .cmd = OVS_FLOW_CMD_DEL,
- .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+ .flags = GENL_UNS_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
  .policy = flow_policy,
  .doit = ovs_flow_cmd_del

seccomp c/r patch

2015-10-26 Thread Tycho Andersen

Hi all,

Here is a patch that we'd like to go via net-next, as it depends on previous
changes in that tree.

Thanks,

Tycho

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8] seccomp, ptrace: add support for dumping seccomp filters

2015-10-26 Thread Tycho Andersen

Hi David,

On Mon, Oct 26, 2015 at 04:07:01PM +0900, Kees Cook wrote:
> On Mon, Oct 26, 2015 at 3:46 PM, Kees Cook <keesc...@chromium.org> wrote:
> > Cool, thanks. I'll get this into my tree after kernel summit. Thanks
> > for suffering through all this Tycho!
> 
> Actually, since this depends on changes in net, could this get pulled
> in from that direction?
> 
> Acked-by: Kees Cook <keesc...@chromium.org>

Can we get the attached patch into net-next?

Thanks,

Tycho
>From 5d9be66e4f48e0882a5546376380147f2f711bec Mon Sep 17 00:00:00 2001
From: Tycho Andersen <tycho.ander...@canonical.com>
Date: Fri, 2 Oct 2015 18:49:43 -0600
Subject: [PATCH] seccomp, ptrace: add support for dumping seccomp filters

This patch adds support for dumping a process' (classic BPF) seccomp
filters via ptrace.

PTRACE_SECCOMP_GET_FILTER allows the tracer to dump the user's classic BPF
seccomp filters. addr should be an integer which represents the ith seccomp
filter (0 is the most recently installed filter). data should be a struct
sock_filter * with enough room for the ith filter, or NULL, in which case
the filter is not saved. The return value for this command is the number of
BPF instructions the program represents, or negative in the case of errors.
Command specific errors are ENOENT: which indicates that there is no ith
filter in this seccomp tree, and EMEDIUMTYPE, which indicates that the ith
filter was not installed as a classic BPF filter.

A caveat with this approach is that there is no way to get explicitly at
the heirarchy of seccomp filters, and users need to memcmp() filters to
decide which are inherited. This means that a task which installs two of
the same filter can potentially confuse users of this interface.

v2: * make save_orig const
* check that the orig_prog exists (not necessary right now, but when
   grows eBPF support it will be)
* s/n/filter_off and make it an unsigned long to match ptrace
* count "down" the tree instead of "up" when passing a filter offset

v3: * don't take the current task's lock for inspecting its seccomp mode
* use a 0x42** constant for the ptrace command value

v4: * don't copy to userspace while holding spinlocks

v5: * add another condition to WARN_ON

v6: * rebase on net-next

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
Acked-by: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
Reviewed-by: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 include/linux/seccomp.h | 11 +++
 include/uapi/linux/ptrace.h |  2 ++
 kernel/ptrace.c |  5 +++
 kernel/seccomp.c| 76 -
 4 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index f426503..2296e6b 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -95,4 +95,15 @@ static inline void get_seccomp_filter(struct task_struct *tsk)
 	return;
 }
 #endif /* CONFIG_SECCOMP_FILTER */
+
+#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
+extern long seccomp_get_filter(struct task_struct *task,
+			   unsigned long filter_off, void __user *data);
+#else
+static inline long seccomp_get_filter(struct task_struct *task,
+  unsigned long n, void __user *data)
+{
+	return -EINVAL;
+}
+#endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index a7a6979..fb81065 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -64,6 +64,8 @@ struct ptrace_peeksiginfo_args {
 #define PTRACE_GETSIGMASK	0x420a
 #define PTRACE_SETSIGMASK	0x420b
 
+#define PTRACE_SECCOMP_GET_FILTER	0x420c
+
 /* Read signals from a shared (process wide) queue */
 #define PTRACE_PEEKSIGINFO_SHARED	(1 << 0)
 
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 787320d..b760bae 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -1016,6 +1016,11 @@ int ptrace_request(struct task_struct *child, long request,
 		break;
 	}
 #endif
+
+	case PTRACE_SECCOMP_GET_FILTER:
+		ret = seccomp_get_filter(child, addr, datavp);
+		break;
+
 	default:
 		break;
 	}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 06858a7..580ac2d 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -347,6 +347,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 {
 	struct seccomp_filter *sfilter;
 	int ret;
+	const bool save_orig = config_enabled(CONFIG_CHECKPOINT_RESTORE);
 
 	if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
 		return ERR_PTR(-EINVAL);
@@ -37

v5 of seccomp filter c/r patches

2015-10-02 Thread Tycho Andersen

Hi all,

Here's v5 of the seccomp filter c/r set. The individual patch notes have
changes, but two highlights are:

* This series is now based on http://patchwork.ozlabs.org/patch/525492/ and
  will need to be built with that patch applied. This gets rid of two incorrect
  patches in the previous series and is a nicer API.

* I couldn't figure out a nice way to have SECCOMP_GET_FILTER_FD return the
  same struct file across calls, so we still need a kcmp command. I've narrowed
  the scope of the one being added to only compare seccomp fds.

Thoughts welcome,

Tycho

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 1/3] seccomp: add the concept of a seccomp filter FD

2015-10-02 Thread Tycho Andersen

This patch introduces the concept of a seccomp fd, with a similar interface
and usage to ebpf fds. Initially, one is allowed to create, install, and
dump these fds. Any manipulation of seccomp fds requires users to be root
in their own user namespace, matching the checks done for
SECCOMP_SET_MODE_FILTER.

v2: Force users to specify the parent (as another fd) during fd creation,
and don't allow them to install filters when a previously installed
filter is not the parent of the to be installed filter. This avoids
"re-parenting" scenarios, which can be racy or perhaps insecure.

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 include/linux/seccomp.h  |   5 ++
 include/uapi/linux/seccomp.h |  27 ++
 kernel/seccomp.c | 203 ++-
 3 files changed, 232 insertions(+), 3 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index f426503..4253579 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -85,6 +85,7 @@ static inline int seccomp_mode(struct seccomp *s)
 #ifdef CONFIG_SECCOMP_FILTER
 extern void put_seccomp_filter(struct task_struct *tsk);
 extern void get_seccomp_filter(struct task_struct *tsk);
+extern struct seccomp_filter *seccomp_filter_from_file(struct file *f);
 #else  /* CONFIG_SECCOMP_FILTER */
 static inline void put_seccomp_filter(struct task_struct *tsk)
 {
@@ -94,5 +95,9 @@ static inline void get_seccomp_filter(struct task_struct *tsk)
 {
return;
 }
+static inline struct seccomp_filter *seccomp_filter_from_file(struct file *f)
+{
+   return ERR_PTR(-EINVAL);
+}
 #endif /* CONFIG_SECCOMP_FILTER */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 0f238a4..e9f3660 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -13,10 +13,16 @@
 /* Valid operations for seccomp syscall. */
 #define SECCOMP_SET_MODE_STRICT0
 #define SECCOMP_SET_MODE_FILTER1
+#define SECCOMP_FILTER_FD  2
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
 #define SECCOMP_FILTER_FLAG_TSYNC  1
 
+/* Valid commands for SECCOMP_FILTER_FD */
+#define SECCOMP_FD_NEW 0
+#define SECCOMP_FD_INSTALL 1
+#define SECCOMP_FD_DUMP2
+
 /*
  * All BPF programs must return a 32-bit value.
  * The bottom 16-bits are for optional return data.
@@ -51,4 +57,25 @@ struct seccomp_data {
__u64 args[6];
 };
 
+struct seccomp_fd {
+   __u32   size;
+
+   union {
+   /* SECCOMP_FD_NEW */
+   struct {
+   struct sock_fprog __user*new_prog;
+   int new_parent;
+   };
+
+   /* SECCOMP_FD_INSTALL */
+   int install_fd;
+
+   /* SECCOMP_FD_DUMP */
+   struct {
+   int dump_fd;
+   struct sock_filter __user   *insns;
+   };
+   };
+};
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 06858a7..ea3337d 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -26,6 +26,8 @@
 #endif
 
 #ifdef CONFIG_SECCOMP_FILTER
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -474,10 +476,8 @@ static inline void seccomp_filter_free(struct 
seccomp_filter *filter)
}
 }
 
-/* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */
-void put_seccomp_filter(struct task_struct *tsk)
+static void seccomp_filter_decref(struct seccomp_filter *orig)
 {
-   struct seccomp_filter *orig = tsk->seccomp.filter;
/* Clean up single-reference branches iteratively. */
while (orig && atomic_dec_and_test(>usage)) {
struct seccomp_filter *freeme = orig;
@@ -486,6 +486,12 @@ void put_seccomp_filter(struct task_struct *tsk)
}
 }
 
+/* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */
+void put_seccomp_filter(struct task_struct *tsk)
+{
+   seccomp_filter_decref(tsk->seccomp.filter);
+}
+
 /**
  * seccomp_send_sigsys - signals the task to allow in-process syscall emulation
  * @syscall: syscall number to send to userland
@@ -804,12 +810,201 @@ out_free:
seccomp_filter_free(prepared);
return ret;
 }
+
+int seccomp_fd_release(struct inode *ino, struct file *f)
+{
+   seccomp_filter_decref(f->private_data);
+   return 0;
+}
+
+static const struct file_opera

[PATCH v5 3/3] kcmp: add KCMP_SECCOMP_FD

2015-10-02 Thread Tycho Andersen

This command allows for comparing the filters pointed to by two seccomp
fds. This is useful e.g. to find out if a seccomp filter is inherited,
since struct seccomp_filter are unique across tasks and are the
private_data seccomp fds.

v2: switch to KCMP_SECCOMP_FD instead of KCMP_FILE_PRIVATE_DATA

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 include/uapi/linux/kcmp.h |  1 +
 kernel/kcmp.c | 27 +++
 2 files changed, 28 insertions(+)

diff --git a/include/uapi/linux/kcmp.h b/include/uapi/linux/kcmp.h
index 84df14b..cd7870b 100644
--- a/include/uapi/linux/kcmp.h
+++ b/include/uapi/linux/kcmp.h
@@ -10,6 +10,7 @@ enum kcmp_type {
KCMP_SIGHAND,
KCMP_IO,
KCMP_SYSVSEM,
+   KCMP_SECCOMP_FD,
 
KCMP_TYPES,
 };
diff --git a/kernel/kcmp.c b/kernel/kcmp.c
index 0aa69ea..d53db53 100644
--- a/kernel/kcmp.c
+++ b/kernel/kcmp.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -165,6 +166,32 @@ SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
ret = -EOPNOTSUPP;
 #endif
break;
+   case KCMP_SECCOMP_FD: {
+   struct file *filp1, *filp2;
+
+   filp1 = get_file_raw_ptr(task1, idx1);
+   filp2 = get_file_raw_ptr(task2, idx2);
+
+   if (filp1 && filp2) {
+   struct seccomp_filter *filter1, *filter2;
+
+   filter1 = seccomp_filter_from_file(filp1);
+   if (IS_ERR(filter1)) {
+   ret = PTR_ERR(filter1);
+   break;
+   }
+
+   filter2 = seccomp_filter_from_file(filp2);
+   if (IS_ERR(filter2)) {
+   ret = PTR_ERR(filter2);
+   break;
+   }
+
+   ret = kcmp_ptr(filter1, filter2, KCMP_SECCOMP_FD);
+   } else
+   ret = -EBADF;
+   break;
+   }
default:
ret = -EINVAL;
break;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 2/3] seccomp: add a ptrace command to get seccomp filter fds

2015-10-02 Thread Tycho Andersen

I just picked 40 for the constant out of thin air, but there may be a more
appropriate value for this. Also, we return EINVAL when there is no filter
for the index the user requested, but ptrace also returns EINVAL for
invalid commands, making it slightly awkward to test whether or not the
kernel supports this feature. It can still be done via,

if (is_in_mode_filter(pid)) {
int fd;

fd = ptrace(PTRACE_SECCOMP_GET_FILTER_FD, pid, NULL, 0);
if (fd < 0 && errno == -EINVAL)
/* not supported */

...
}

since being in SECCOMP_MODE_FILTER implies that there is at least one
filter. If there is a more appropriate errno (ESRCH collides as well with
ptrace) to give here that may be better.

v2: use new bpf interface save_orig to save the original filter when
necessary

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 include/linux/seccomp.h |  9 +
 include/uapi/linux/ptrace.h |  2 ++
 kernel/ptrace.c |  4 
 kernel/seccomp.c| 31 ++-
 4 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 4253579..b0b1a52 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -100,4 +100,13 @@ static inline struct seccomp_filter 
*seccomp_filter_from_file(struct file *f)
return ERR_PTR(-EINVAL);
 }
 #endif /* CONFIG_SECCOMP_FILTER */
+
+#if defined(CONFIG_CHECKPOINT_RESTORE) && defined(CONFIG_SECCOMP_FILTER)
+extern long seccomp_get_filter_fd(struct task_struct *task, long data);
+#else
+static inline long seccomp_get_filter_fd(struct task_struct *task, long data)
+{
+   return -EINVAL;
+}
+#endif /* CONFIG_CHECKPOINT_RESTORE && CONFIG_SECCOMP_FILTER */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index a7a6979..3271f5a 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -23,6 +23,8 @@
 
 #define PTRACE_SYSCALL   24
 
+#define PTRACE_SECCOMP_GET_FILTER_FD 40
+
 /* 0x4200-0x4300 are reserved for architecture-independent additions.  */
 #define PTRACE_SETOPTIONS  0x4200
 #define PTRACE_GETEVENTMSG 0x4201
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 787320d..aede440 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -1016,6 +1016,10 @@ int ptrace_request(struct task_struct *child, long 
request,
break;
}
 #endif
+
+   case PTRACE_SECCOMP_GET_FILTER_FD:
+   return seccomp_get_filter_fd(child, data);
+
default:
break;
}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ea3337d..4d2b8f1 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -349,6 +349,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct 
sock_fprog *fprog)
 {
struct seccomp_filter *sfilter;
int ret;
+   bool save_orig = config_enabled(CONFIG_CHECKPOINT_RESTORE);
 
if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
return ERR_PTR(-EINVAL);
@@ -372,7 +373,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct 
sock_fprog *fprog)
return ERR_PTR(-ENOMEM);
 
ret = bpf_prog_create_from_user(>prog, fprog,
-   seccomp_check_filter, false);
+   seccomp_check_filter, save_orig);
if (ret < 0) {
kfree(sfilter);
return ERR_PTR(ret);
@@ -1064,3 +1065,31 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char 
__user *filter)
/* prctl interface doesn't have flags, so they are always zero. */
return do_seccomp(op, 0, uargs);
 }
+
+#if defined(CONFIG_CHECKPOINT_RESTORE) && defined(CONFIG_SECCOMP_FILTER)
+long seccomp_get_filter_fd(struct task_struct *task, long n)
+{
+   struct seccomp_filter *filter;
+   long fd;
+
+   if (task->seccomp.mode != SECCOMP_MODE_FILTER)
+   return -EINVAL;
+
+   filter = task->seccomp.filter;
+   while (n > 0 && filter) {
+   filter = filter->prev;
+   n--;
+   }
+
+   if (!filter)
+   return -EINVAL;
+
+   atomic_inc(>usage);
+   fd = anon_inode_getfd("seccomp", _fops, filter,
+ O_RDONLY | O_CLOEXEC);
+   if (fd < 0)
+   seccomp_filter_decref(filter);
+
+   return fd;
+}
+#endif
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] bpf, seccomp: prepare for upcoming criu support

2015-10-02 Thread Tycho Andersen

Hi Daniel,

On Fri, Oct 02, 2015 at 03:17:33PM +0200, Daniel Borkmann wrote:
> The current ongoing effort to dump existing cBPF seccomp filters back
> to user space requires to hold the pre-transformed instructions like
> we do in case of socket filters from sk_attach_filter() side, so they
> can be reloaded in original form at a later point in time by utilities
> such as criu.
> 
> To prepare for this, simply extend the bpf_prog_create_from_user()
> API to hold a flag that tells whether we should store the original
> or not. Also, fanout filters could make use of that in future for
> things like diag. While fanout filters already use bpf_prog_destroy(),
> move seccomp over to them as well to handle original programs when
> present.
> 
> Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
> Cc: Tycho Andersen <tycho.ander...@canonical.com>

Thanks for this patch. When I rebase my tree on top of it it works
fine,

Tested-by: Tycho Andersen <tycho.ander...@canonical.com>

> Cc: Pavel Emelyanov <xe...@parallels.com>
> Cc: Kees Cook <keesc...@chromium.org>
> Cc: Andy Lutomirski <l...@amacapital.net>
> Cc: Alexei Starovoitov <a...@plumgrid.com>
> ---
>  This is in realtion to Tycho's latest patch set under [1]. The BPF
>  handling is unfortunately not correct (triggering a crash on kernels
>  that can set pages as ro).
> 
>  This patch here provides a minimal, simple interface from BPF API side
>  as a possible step forward, so that the focus can then be on seccomp
>  side wrt criu. F.e., dumping could happen similarly as in Pavel's
>  sk_get_filter().
> 
>  I have tested/based this against net-next, but have no issues whether
>  Kees wants to take it, or whether it should go through both trees to
>  reduce merge issues as once the case with 0fc174dea545 ("ebpf: make
>  internal bpf API independent of CONFIG_BPF_SYSCALL ifdefs").

I'll send out a revised version of my set with Andy's comments later
today and not include this patch. Let me know if I should do something
differently.

Thanks,

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: v5 of seccomp filter c/r patches

2015-10-02 Thread Tycho Andersen

On Fri, Oct 02, 2015 at 03:52:03PM -0700, Andy Lutomirski wrote:
> On Fri, Oct 2, 2015 at 3:44 PM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > On Fri, Oct 02, 2015 at 02:10:24PM -0700, Kees Cook wrote:
> >> On Fri, Oct 2, 2015 at 9:27 AM, Tycho Andersen
> >> <tycho.ander...@canonical.com> wrote:
> >> > Hi all,
> >> >
> >> > Here's v5 of the seccomp filter c/r set. The individual patch notes have
> >> > changes, but two highlights are:
> >> >
> >> > * This series is now based on http://patchwork.ozlabs.org/patch/525492/ 
> >> > and
> >> >   will need to be built with that patch applied. This gets rid of two 
> >> > incorrect
> >> >   patches in the previous series and is a nicer API.
> >> >
> >> > * I couldn't figure out a nice way to have SECCOMP_GET_FILTER_FD return 
> >> > the
> >> >   same struct file across calls, so we still need a kcmp command. I've 
> >> > narrowed
> >> >   the scope of the one being added to only compare seccomp fds.
> >> >
> >> > Thoughts welcome,
> >>
> >> Hi, sorry I've been slow/busy. I'm finally reading through these threads.
> >>
> >> Happy bit:
> >> - avoiding eBPF and just saving the original filters makes things much 
> >> easier.
> >>
> >> Sad bit:
> >> - inventing a new interface for seccompfds feels like massive overkill to 
> >> me.
> >>
> >> While Andy has big dreams, we're not presently doing seccompfd
> >> monitoring, etc. There's no driving user for that kind of interface,
> >> and accepting the maintenance burden of it only for CRIU seems unwise.
> >>
> >> So, I'll go back to what I originally proposed at LSS (which it looks
> >> like we're half way there now):
> >>
> >> - save the original filter (done!)
> >> - extract filters through a single special-purpose interface (looks
> >> like ptrace is the way to go: root-only, stopped process, etc)
> >> - compare filter content and issue TSYNCs to merge detected sibling
> >> threads, since merging things that weren't merged before creates no
> >> problems.
> >>
> >> This means the parenting logic is heuristic, but it's entirely in
> >> userspace, so the complexity burden doesn't live in seccomp which we,
> >> by design, want to keep as simple as possible.
> >
> > Ok, how about,
> >
> > struct sock_filter insns[BPF_MAXINSNS];
> > insn_cnt = ptrace(PTRACE_SECCOMP_GET_FILTER, pid, insns, i);
> >
> > when asking for the ith filter? It returns either the number of
> > instructions, -EINVAL if something was wrong (i, pid,
> > CONFIG_CHECKPOINT_RESTORE isn't enabled). While it would always
> > succeed now, if/when the underlying filter was not created from a bpf
> > classic filter, we can return -EMEDIUMTYPE? (Suggestions welcome, I
> > picked this mostly based on what sounds nice.)
> >
> 
> Are we still requiring global permissions or that the caller isn't
> seccomped at all?  I've not lost track of how we're resolving the case
> where the caller and the tracee have exactly the same seccomp state
> (or the tracee is derived from the caller's state or they're totally
> unrelated states).

At least for now, I think requiring real root and no seccomp is fine,
so I'll do that.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: v5 of seccomp filter c/r patches

2015-10-02 Thread Tycho Andersen

On Sat, Oct 03, 2015 at 12:57:49AM +0200, Daniel Borkmann wrote:
> On 10/03/2015 12:44 AM, Tycho Andersen wrote:
> >On Fri, Oct 02, 2015 at 02:10:24PM -0700, Kees Cook wrote:
> ...
> >Ok, how about,
> >
> >struct sock_filter insns[BPF_MAXINSNS];
> >insn_cnt = ptrace(PTRACE_SECCOMP_GET_FILTER, pid, insns, i);
> 
> Would also be good that when the storage buffer (insns) is NULL,
> it just returns you the number of sock_filter insns (or 0 when
> nothing attached).
> 
> That would be consistent with classic socket filters (see
> sk_get_filter()), and user space could allocate a specific
> size instead of always passing in max insns.

Yep, the current set does this with SECCOMP_FD_DUMP and I agree that
it's nice behavior, so I'll plan on preserving it.

Thanks,

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: v5 of seccomp filter c/r patches

2015-10-02 Thread Tycho Andersen

On Fri, Oct 02, 2015 at 02:10:24PM -0700, Kees Cook wrote:
> On Fri, Oct 2, 2015 at 9:27 AM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > Hi all,
> >
> > Here's v5 of the seccomp filter c/r set. The individual patch notes have
> > changes, but two highlights are:
> >
> > * This series is now based on http://patchwork.ozlabs.org/patch/525492/ and
> >   will need to be built with that patch applied. This gets rid of two 
> > incorrect
> >   patches in the previous series and is a nicer API.
> >
> > * I couldn't figure out a nice way to have SECCOMP_GET_FILTER_FD return the
> >   same struct file across calls, so we still need a kcmp command. I've 
> > narrowed
> >   the scope of the one being added to only compare seccomp fds.
> >
> > Thoughts welcome,
> 
> Hi, sorry I've been slow/busy. I'm finally reading through these threads.
> 
> Happy bit:
> - avoiding eBPF and just saving the original filters makes things much easier.
> 
> Sad bit:
> - inventing a new interface for seccompfds feels like massive overkill to me.
> 
> While Andy has big dreams, we're not presently doing seccompfd
> monitoring, etc. There's no driving user for that kind of interface,
> and accepting the maintenance burden of it only for CRIU seems unwise.
> 
> So, I'll go back to what I originally proposed at LSS (which it looks
> like we're half way there now):
> 
> - save the original filter (done!)
> - extract filters through a single special-purpose interface (looks
> like ptrace is the way to go: root-only, stopped process, etc)
> - compare filter content and issue TSYNCs to merge detected sibling
> threads, since merging things that weren't merged before creates no
> problems.
> 
> This means the parenting logic is heuristic, but it's entirely in
> userspace, so the complexity burden doesn't live in seccomp which we,
> by design, want to keep as simple as possible.

Ok, how about,

struct sock_filter insns[BPF_MAXINSNS];
insn_cnt = ptrace(PTRACE_SECCOMP_GET_FILTER, pid, insns, i);

when asking for the ith filter? It returns either the number of
instructions, -EINVAL if something was wrong (i, pid,
CONFIG_CHECKPOINT_RESTORE isn't enabled). While it would always
succeed now, if/when the underlying filter was not created from a bpf
classic filter, we can return -EMEDIUMTYPE? (Suggestions welcome, I
picked this mostly based on what sounds nice.)

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA

2015-10-01 Thread Tycho Andersen

On Wed, Sep 30, 2015 at 02:48:47PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 30, 2015 at 2:39 PM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > On Wed, Sep 30, 2015 at 11:56:25AM -0700, Andy Lutomirski wrote:
> >> On Wed, Sep 30, 2015 at 11:55 AM, Tycho Andersen
> >> <tycho.ander...@canonical.com> wrote:
> >> > On Wed, Sep 30, 2015 at 11:47:05AM -0700, Andy Lutomirski wrote:
> >> >> On Wed, Sep 30, 2015 at 11:41 AM, Tycho Andersen
> >> >> <tycho.ander...@canonical.com> wrote:
> >> >> > On Wed, Sep 30, 2015 at 11:25:41AM -0700, Andy Lutomirski wrote:
> >> >> >> On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen
> >> >> >> <tycho.ander...@canonical.com> wrote:
> >> >> >> > This command allows comparing the underling private data of two 
> >> >> >> > fds. This
> >> >> >> > is useful e.g. to find out if a seccomp filter is inherited, since 
> >> >> >> > struct
> >> >> >> > seccomp_filter are unique across tasks and are the private_data 
> >> >> >> > seccomp
> >> >> >> > fds.
> >> >> >>
> >> >> >> This is very implementation-specific and may have nasty ABI
> >> >> >> consequences far outside seccomp.  Let's do something specific to
> >> >> >> seccomp and/or eBPF.
> >> >> >
> >> >> > We could change the name to a less generic KCMP_SECCOMP_FD or
> >> >> > something, but without some sort of GUID on each struct
> >> >> > seccomp_filter, the implementation would be effectively the same as it
> >> >> > is today. Is that enough, or do we need a GUID?
> >> >> >
> >> >>
> >> >> I don't care about the GUID.  I think we should name it
> >> >> KCMP_SECCOMP_FD and make it only work on seccomp fds.
> >> >
> >> > Ok, I can do that.
> >> >
> >> >> Alternatively, we could figure out why KCMP_FILE doesn't do the trick
> >> >> and consider fixing it.  IMO it's really too bad that struct file is
> >> >> so heavyweight that we can't really just embed one in all kinds of
> >> >> structures.
> >> >
> >> > The problem is that KCMP_FILE compares the file objects themselves,
> >> > instead of the underlying data. If I ask for a seccomp fd for filter 0
> >> > twice, I'll have two different file objects and they won't be equal. I
> >> > suppose we could add some special logic inside KCMP_FILE to compare
> >> > the underlying data in special cases (seccomp, ebpf, others?), but it
> >> > seems cleaner to have a separate command as you described above.
> >> >
> >>
> >> What I meant was that maybe we could get the two requests to actually
> >> produce the same struct file.  But that could get very messy
> >> memory-wise.
> >
> > I see. The attached patch seems to work with KCMP_FILE and doesn't
> > look too bad if you don't mind the circular references. What do you
> > think?
> >
> 
> Could be reasonable.  I'm not that well versed on what fd_release
> does.  Are we guaranteed that it's called when the last fd goes away
> even if the struct file is pinned by the struct seccomp filter but is
> otherwise unreferenced?

After thinking about the patch a bit more, I think it's not safe.
There is a race where the last referenced is closed at the same time
someone else asks for the strcut file via ptrace(). I thought the lock
would fix this, but it doesn't. I'll see if I can find another way.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/5] seccomp: add the concept of a seccomp filter FD

2015-09-30 Thread Tycho Andersen

On Wed, Sep 30, 2015 at 11:27:34AM -0700, Andy Lutomirski wrote:
> On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > This patch introduces the concept of a seccomp fd, with a similar interface
> > and usage to ebpf fds. Initially, one is allowed to create, install, and
> > dump these fds. Any manipulation of seccomp fds requires users to be root
> > in their own user namespace, matching the checks done for
> > SECCOMP_SET_MODE_FILTER.
> >
> > Installing a filterfd has some gotchas, though. Andy mentioned previously
> > that we should restrict installation to filter fds whose parent is already
> > in the filter tree. This doesn't quite work in the case of created seccomp
> > fds, since once you install a filter fd, you can't install any other filter
> > fd since it has no parent and there is no way to "pre-chain" filters before
> > installing them.
> 
> ISTM, if we like the seccomp fd approach, we should have them be
> created with a parent already set.  IOW the default should be that
> their parent is the creator's seccomp fd and, if needed, creators
> could specify a different parent.

Allowing people doing SECCOMP_FD_NEW to specify a parent fd would
work. Then we can disallow installing a seccomp fd if its parent is
not the current filter, and get rid of the whole mess with prev
locking and all that.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

checkpoint/restore of seccomp filters v3

2015-09-30 Thread Tycho Andersen

Hi all,

Here's a re-worked set for c/r of seccomp filters which keeps around the
original bpf program passed to the kernel instead of trying to dump the
ebpf version. There are various comments/questions in the individual patch
notes.

I'm not sure this needs to go via net-next any more, as the impact in net/
is fairly minimal, and it seems more seccomp heavy. As such, this set is
based on seccomp/tip.

Thoughts welcome,

Tycho

P.S. Man page patches to come once we agree on the API :)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 1/5] seccomp: save the original filter

2015-09-30 Thread Tycho Andersen

In order to implement checkpoint of seccomp filters, we need to keep track
of the original filter as the user gave it to us. Since we're doing this,
we need to also use bpf_prog_destroy to free the struct bpf_brogs so we
don't leak this memory.

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 include/linux/filter.h |  2 ++
 kernel/seccomp.c   | 24 
 net/core/filter.c  |  4 ++--
 3 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index fa2cab9..6c045ba 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -410,6 +410,8 @@ int bpf_prog_create(struct bpf_prog **pfp, struct 
sock_fprog_kern *fprog);
 int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog,
  bpf_aux_classic_check_t trans);
 void bpf_prog_destroy(struct bpf_prog *fp);
+int bpf_prog_store_orig_filter(struct bpf_prog *fp,
+  const struct sock_fprog *fprog);
 
 int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
 int sk_attach_bpf(u32 ufd, struct sock *sk);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 5bd4779..09f3769 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -337,6 +337,14 @@ static inline void seccomp_sync_threads(void)
}
 }
 
+static inline void seccomp_filter_free(struct seccomp_filter *filter)
+{
+   if (filter) {
+   bpf_prog_destroy(filter->prog);
+   kfree(filter);
+   }
+}
+
 /**
  * seccomp_prepare_filter: Prepares a seccomp filter for use.
  * @fprog: BPF program to install
@@ -376,6 +384,14 @@ static struct seccomp_filter 
*seccomp_prepare_filter(struct sock_fprog *fprog)
return ERR_PTR(ret);
}
 
+   if (config_enabled(CONFIG_CHECKPOINT_RESTORE)) {
+   ret = bpf_prog_store_orig_filter(sfilter->prog, fprog);
+   if (ret < 0) {
+   seccomp_filter_free(sfilter);
+   return ERR_PTR(ret);
+   }
+   }
+
atomic_set(>usage, 1);
 
return sfilter;
@@ -466,14 +482,6 @@ void get_seccomp_filter(struct task_struct *tsk)
atomic_inc(>usage);
 }
 
-static inline void seccomp_filter_free(struct seccomp_filter *filter)
-{
-   if (filter) {
-   bpf_prog_free(filter->prog);
-   kfree(filter);
-   }
-}
-
 /* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */
 void put_seccomp_filter(struct task_struct *tsk)
 {
diff --git a/net/core/filter.c b/net/core/filter.c
index 13079f0..70995dd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -832,8 +832,8 @@ static int bpf_check_classic(const struct sock_filter 
*filter,
return -EINVAL;
 }
 
-static int bpf_prog_store_orig_filter(struct bpf_prog *fp,
- const struct sock_fprog *fprog)
+int bpf_prog_store_orig_filter(struct bpf_prog *fp,
+  const struct sock_fprog *fprog)
 {
unsigned int fsize = bpf_classic_proglen(fprog);
struct sock_fprog_kern *fkprog;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 2/5] seccomp: add the concept of a seccomp filter FD

2015-09-30 Thread Tycho Andersen

This patch introduces the concept of a seccomp fd, with a similar interface
and usage to ebpf fds. Initially, one is allowed to create, install, and
dump these fds. Any manipulation of seccomp fds requires users to be root
in their own user namespace, matching the checks done for
SECCOMP_SET_MODE_FILTER.

Installing a filterfd has some gotchas, though. Andy mentioned previously
that we should restrict installation to filter fds whose parent is already
in the filter tree. This doesn't quite work in the case of created seccomp
fds, since once you install a filter fd, you can't install any other filter
fd since it has no parent and there is no way to "pre-chain" filters before
installing them. To work around this, we allow installing filters who have
no parent. If the filter has a parent, we require the current filter try to
be an ancestor of it.

I'm not quite sure that the ancestor restriction is correct, since it can
still allow for "re-parenting" of filters, potentially introducing new
filters to a task. However, since these operations are limited to root in
the user ns, perhaps it is ok. There is also some potentially racy
behavior where a task re-parents a filter that another task has installed.

One option to work around this is to keep a bit on struct seccomp_filter to
allow each filter to have its parent set exactly once. (This would still
allow you to install a filter multiple times, as long as the parent was the
same in each case.)

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 include/uapi/linux/seccomp.h |  24 ++
 kernel/seccomp.c | 189 ++-
 2 files changed, 210 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 0f238a4..4ee8770 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -13,10 +13,16 @@
 /* Valid operations for seccomp syscall. */
 #define SECCOMP_SET_MODE_STRICT0
 #define SECCOMP_SET_MODE_FILTER1
+#define SECCOMP_FILTER_FD  2
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
 #define SECCOMP_FILTER_FLAG_TSYNC  1
 
+/* Valid commands for SECCOMP_FILTER_FD */
+#define SECCOMP_FD_NEW 0
+#define SECCOMP_FD_INSTALL 1
+#define SECCOMP_FD_DUMP2
+
 /*
  * All BPF programs must return a 32-bit value.
  * The bottom 16-bits are for optional return data.
@@ -51,4 +57,22 @@ struct seccomp_data {
__u64 args[6];
 };
 
+struct seccomp_fd {
+   __u32   size;
+
+   union {
+   /* SECCOMP_FD_NEW */
+   struct sock_fprog __user*new_prog;
+
+   /* SECCOMP_FD_INSTALL */
+   int install_fd;
+
+   /* SECCOMP_FD_DUMP */
+   struct {
+   int dump_fd;
+   struct sock_filter __user   *insns;
+   };
+   };
+};
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 09f3769..6f0465c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -26,6 +26,8 @@
 #endif
 
 #ifdef CONFIG_SECCOMP_FILTER
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -58,6 +60,7 @@ struct seccomp_filter {
atomic_t usage;
struct seccomp_filter *prev;
struct bpf_prog *prog;
+   spinlock_t prev_lock;
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -393,6 +396,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct 
sock_fprog *fprog)
}
 
atomic_set(>usage, 1);
+   sfilter->prev_lock = __SPIN_LOCK_UNLOCKED(>prev_lock);
 
return sfilter;
 }
@@ -441,6 +445,7 @@ static long seccomp_attach_filter(unsigned int flags,
struct seccomp_filter *walker;
 
assert_spin_locked(>sighand->siglock);
+   assert_spin_locked(>prev_lock);
 
/* Validate resulting filter length. */
total_insns = filter->prog->len;
@@ -482,10 +487,8 @@ void get_seccomp_filter(struct task_struct *tsk)
atomic_inc(>usage);
 }
 
-/* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */
-void put_seccomp_filter(struct task_struct *tsk)
+static void seccomp_filter_decref(struct seccomp_filter *orig)
 {
-   struct seccomp_filter *orig = tsk->seccomp.filter;
/* Clean up single-reference branches iteratively. */
while (orig && atomic_dec_and_test(>usage)) {
struct seccomp_filter *freeme = orig;
@@ -494,6 +497,1

[PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA

2015-09-30 Thread Tycho Andersen

This command allows comparing the underling private data of two fds. This
is useful e.g. to find out if a seccomp filter is inherited, since struct
seccomp_filter are unique across tasks and are the private_data seccomp
fds.

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 include/uapi/linux/kcmp.h |  1 +
 kernel/kcmp.c | 14 ++
 2 files changed, 15 insertions(+)

diff --git a/include/uapi/linux/kcmp.h b/include/uapi/linux/kcmp.h
index 84df14b..ed389d2 100644
--- a/include/uapi/linux/kcmp.h
+++ b/include/uapi/linux/kcmp.h
@@ -10,6 +10,7 @@ enum kcmp_type {
KCMP_SIGHAND,
KCMP_IO,
KCMP_SYSVSEM,
+   KCMP_FILE_PRIVATE_DATA,
 
KCMP_TYPES,
 };
diff --git a/kernel/kcmp.c b/kernel/kcmp.c
index 0aa69ea..9ae673b 100644
--- a/kernel/kcmp.c
+++ b/kernel/kcmp.c
@@ -165,6 +165,20 @@ SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
ret = -EOPNOTSUPP;
 #endif
break;
+   case KCMP_FILE_PRIVATE_DATA: {
+   struct file *filp1, *filp2;
+
+   filp1 = get_file_raw_ptr(task1, idx1);
+   filp2 = get_file_raw_ptr(task2, idx2);
+
+   if (filp1 && filp2)
+   ret = kcmp_ptr(filp1->private_data,
+  filp2->private_data,
+  KCMP_FILE_PRIVATE_DATA);
+   else
+   ret = -EBADF;
+   break;
+   }
default:
ret = -EINVAL;
break;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA

2015-09-30 Thread Tycho Andersen

On Wed, Sep 30, 2015 at 11:25:41AM -0700, Andy Lutomirski wrote:
> On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > This command allows comparing the underling private data of two fds. This
> > is useful e.g. to find out if a seccomp filter is inherited, since struct
> > seccomp_filter are unique across tasks and are the private_data seccomp
> > fds.
> 
> This is very implementation-specific and may have nasty ABI
> consequences far outside seccomp.  Let's do something specific to
> seccomp and/or eBPF.

We could change the name to a less generic KCMP_SECCOMP_FD or
something, but without some sort of GUID on each struct
seccomp_filter, the implementation would be effectively the same as it
is today. Is that enough, or do we need a GUID?

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA

2015-09-30 Thread Tycho Andersen

On Wed, Sep 30, 2015 at 11:47:05AM -0700, Andy Lutomirski wrote:
> On Wed, Sep 30, 2015 at 11:41 AM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > On Wed, Sep 30, 2015 at 11:25:41AM -0700, Andy Lutomirski wrote:
> >> On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen
> >> <tycho.ander...@canonical.com> wrote:
> >> > This command allows comparing the underling private data of two fds. This
> >> > is useful e.g. to find out if a seccomp filter is inherited, since struct
> >> > seccomp_filter are unique across tasks and are the private_data seccomp
> >> > fds.
> >>
> >> This is very implementation-specific and may have nasty ABI
> >> consequences far outside seccomp.  Let's do something specific to
> >> seccomp and/or eBPF.
> >
> > We could change the name to a less generic KCMP_SECCOMP_FD or
> > something, but without some sort of GUID on each struct
> > seccomp_filter, the implementation would be effectively the same as it
> > is today. Is that enough, or do we need a GUID?
> >
> 
> I don't care about the GUID.  I think we should name it
> KCMP_SECCOMP_FD and make it only work on seccomp fds.

Ok, I can do that.

> Alternatively, we could figure out why KCMP_FILE doesn't do the trick
> and consider fixing it.  IMO it's really too bad that struct file is
> so heavyweight that we can't really just embed one in all kinds of
> structures.

The problem is that KCMP_FILE compares the file objects themselves,
instead of the underlying data. If I ask for a seccomp fd for filter 0
twice, I'll have two different file objects and they won't be equal. I
suppose we could add some special logic inside KCMP_FILE to compare
the underlying data in special cases (seccomp, ebpf, others?), but it
seems cleaner to have a separate command as you described above.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 5/5] bpf: save the program the user actually supplied

2015-09-30 Thread Tycho Andersen

In some cases (e.g. seccomp) the program result might be translated from
the original program the user supplied. If we're saving the result for
checkpoint/restore, we should save exactly the program the user initially
supplied.

This causes problems when the translations seccomp makes are not allowed by
bpf_check_classic.

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 net/core/filter.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 70995dd..5a4596b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -845,8 +845,7 @@ int bpf_prog_store_orig_filter(struct bpf_prog *fp,
fkprog = fp->orig_prog;
fkprog->len = fprog->len;
 
-   fkprog->filter = kmemdup(fp->insns, fsize,
-GFP_KERNEL | __GFP_NOWARN);
+   fkprog->filter = memdup_user(fprog->filter, fsize);
if (!fkprog->filter) {
kfree(fp->orig_prog);
return -ENOMEM;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 3/5] seccomp: add a ptrace command to get seccomp filter fds

2015-09-30 Thread Tycho Andersen

I just picked 40 for the constant out of thin air, but there may be a more
appropriate value for this. Also, we return EINVAL when there is no filter
for the index the user requested, but ptrace also returns EINVAL for
invalid commands, making it slightly awkward to test whether or not the
kernel supports this feature. It can still be done via,

if (is_in_mode_filter(pid)) {
int fd;

fd = ptrace(PTRACE_SECCOMP_GET_FILTER_FD, pid, NULL, 0);
if (fd < 0 && errno == -EINVAL)
/* not supported */

...
}

since being in SECCOMP_MODE_FILTER implies that there is at least one
filter. If there is a more appropriate errno (ESRCH collides as well with
ptrace) to give here that may be better.

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 include/linux/seccomp.h |  9 +
 include/uapi/linux/ptrace.h |  2 ++
 kernel/ptrace.c |  4 
 kernel/seccomp.c| 28 
 4 files changed, 43 insertions(+)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index f426503..637d91f 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -95,4 +95,13 @@ static inline void get_seccomp_filter(struct task_struct 
*tsk)
return;
 }
 #endif /* CONFIG_SECCOMP_FILTER */
+
+#if defined(CONFIG_CHECKPOINT_RESTORE) && defined(CONFIG_SECCOMP_FILTER)
+extern long seccomp_get_filter_fd(struct task_struct *task, long data);
+#else
+static inline long seccomp_get_filter_fd(struct task_struct *task, long data)
+{
+   return -EINVAL;
+}
+#endif /* CONFIG_CHECKPOINT_RESTORE && CONFIG_SECCOMP_FILTER */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index a7a6979..3271f5a 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -23,6 +23,8 @@
 
 #define PTRACE_SYSCALL   24
 
+#define PTRACE_SECCOMP_GET_FILTER_FD 40
+
 /* 0x4200-0x4300 are reserved for architecture-independent additions.  */
 #define PTRACE_SETOPTIONS  0x4200
 #define PTRACE_GETEVENTMSG 0x4201
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 787320d..aede440 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -1016,6 +1016,10 @@ int ptrace_request(struct task_struct *child, long 
request,
break;
}
 #endif
+
+   case PTRACE_SECCOMP_GET_FILTER_FD:
+   return seccomp_get_filter_fd(child, data);
+
default:
break;
}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 6f0465c..7275ce0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1058,3 +1058,31 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char 
__user *filter)
/* prctl interface doesn't have flags, so they are always zero. */
return do_seccomp(op, 0, uargs);
 }
+
+#if defined(CONFIG_CHECKPOINT_RESTORE) && defined(CONFIG_SECCOMP_FILTER)
+long seccomp_get_filter_fd(struct task_struct *task, long n)
+{
+   struct seccomp_filter *filter;
+   long fd;
+
+   if (task->seccomp.mode != SECCOMP_MODE_FILTER)
+   return -EINVAL;
+
+   filter = task->seccomp.filter;
+   while (n > 0 && filter) {
+   filter = filter->prev;
+   n--;
+   }
+
+   if (!filter)
+   return -EINVAL;
+
+   atomic_inc(>usage);
+   fd = anon_inode_getfd("seccomp", _fops, filter,
+ O_RDONLY | O_CLOEXEC);
+   if (fd < 0)
+   seccomp_filter_decref(filter);
+
+   return fd;
+}
+#endif
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA

2015-09-30 Thread Tycho Andersen

On Wed, Sep 30, 2015 at 11:56:25AM -0700, Andy Lutomirski wrote:
> On Wed, Sep 30, 2015 at 11:55 AM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > On Wed, Sep 30, 2015 at 11:47:05AM -0700, Andy Lutomirski wrote:
> >> On Wed, Sep 30, 2015 at 11:41 AM, Tycho Andersen
> >> <tycho.ander...@canonical.com> wrote:
> >> > On Wed, Sep 30, 2015 at 11:25:41AM -0700, Andy Lutomirski wrote:
> >> >> On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen
> >> >> <tycho.ander...@canonical.com> wrote:
> >> >> > This command allows comparing the underling private data of two fds. 
> >> >> > This
> >> >> > is useful e.g. to find out if a seccomp filter is inherited, since 
> >> >> > struct
> >> >> > seccomp_filter are unique across tasks and are the private_data 
> >> >> > seccomp
> >> >> > fds.
> >> >>
> >> >> This is very implementation-specific and may have nasty ABI
> >> >> consequences far outside seccomp.  Let's do something specific to
> >> >> seccomp and/or eBPF.
> >> >
> >> > We could change the name to a less generic KCMP_SECCOMP_FD or
> >> > something, but without some sort of GUID on each struct
> >> > seccomp_filter, the implementation would be effectively the same as it
> >> > is today. Is that enough, or do we need a GUID?
> >> >
> >>
> >> I don't care about the GUID.  I think we should name it
> >> KCMP_SECCOMP_FD and make it only work on seccomp fds.
> >
> > Ok, I can do that.
> >
> >> Alternatively, we could figure out why KCMP_FILE doesn't do the trick
> >> and consider fixing it.  IMO it's really too bad that struct file is
> >> so heavyweight that we can't really just embed one in all kinds of
> >> structures.
> >
> > The problem is that KCMP_FILE compares the file objects themselves,
> > instead of the underlying data. If I ask for a seccomp fd for filter 0
> > twice, I'll have two different file objects and they won't be equal. I
> > suppose we could add some special logic inside KCMP_FILE to compare
> > the underlying data in special cases (seccomp, ebpf, others?), but it
> > seems cleaner to have a separate command as you described above.
> >
> 
> What I meant was that maybe we could get the two requests to actually
> produce the same struct file.  But that could get very messy
> memory-wise.

I see. The attached patch seems to work with KCMP_FILE and doesn't
look too bad if you don't mind the circular references. What do you
think?

Tycho
>From 5c410df2df219dc9a68074afe5458b5563b89940 Mon Sep 17 00:00:00 2001
From: Tycho Andersen <tycho.ander...@canonical.com>
Date: Wed, 30 Sep 2015 14:53:48 -0600
Subject: [PATCH] use exactly one seccomp file per filter object

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
---
 kernel/seccomp.c | 37 -
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index af58c49..ff3b1bd 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -60,6 +60,13 @@ struct seccomp_filter {
 	atomic_t usage;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
+
+	/* The file representing this seccomp_filter, if there is one. A 1:1
+	 * file:seccomp_filter mapping allows us to compare seccomp_filters via
+	 * kcmp(KCMP_FILE, ...).
+	 */
+	struct file *seccomp_file;
+	struct mutex file_lock;
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -395,6 +402,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 	}
 
 	atomic_set(>usage, 1);
+	mutex_init(>file_lock);
 
 	return sfilter;
 }
@@ -821,7 +829,14 @@ out_free:
 
 int seccomp_fd_release(struct inode *ino, struct file *f)
 {
-	seccomp_filter_decref(f->private_data);
+	struct seccomp_filter *filter = f->private_data;
+
+	mutex_lock(>file_lock);
+	filter->seccomp_file = NULL;
+	mutex_unlock(>file_lock);
+
+	seccomp_filter_decref(filter);
+
 	return 0;
 }
 
@@ -1073,7 +1088,9 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
 long seccomp_get_filter_fd(struct task_struct *task, long n)
 {
 	struct seccomp_filter *filter;
+	struct file *file;
 	long fd;
+	int flags = O_RDONLY | O_CLOEXEC;
 
 	if (task->seccomp.mode != SECCOMP_MODE_FILTER)
 		return -EINVAL;
@@ -1087,11 +1104,21 @@ long seccomp_get_filter_fd(struct task_struct *task, long n)
 	if (!filter)
 		return -EINVAL;
 
-	atomic_inc(>usage);
-	fd = anon_inode_getfd("seccomp", _fops, filter,
-			  O_RDONLY | O_CLOEXEC);
+	fd = get_unused_fd_flags(flags);
 	if (fd < 0)
-		seccomp_filter_decref(filter);
+		return fd;
+
+	mutex_lock(>file_lock);
+	file = filter->seccomp_file;
+	if (!file) {
+		atomic_inc(>usage);
+		file = anon_inode_getfile("seccomp", _fops, filter,
+	  flags);
+		filter->seccomp_file = file;
+	}
+	mutex_unlock(>file_lock);
+
+	fd_install(fd, file);
 
 	return fd;
 }
-- 
2.5.0

Re: [PATCH v3 4/5] kcmp: add KCMP_FILE_PRIVATE_DATA

2015-09-30 Thread Tycho Andersen

On Wed, Sep 30, 2015 at 02:48:47PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 30, 2015 at 2:39 PM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > On Wed, Sep 30, 2015 at 11:56:25AM -0700, Andy Lutomirski wrote:
> >> On Wed, Sep 30, 2015 at 11:55 AM, Tycho Andersen
> >> <tycho.ander...@canonical.com> wrote:
> >> > On Wed, Sep 30, 2015 at 11:47:05AM -0700, Andy Lutomirski wrote:
> >> >> On Wed, Sep 30, 2015 at 11:41 AM, Tycho Andersen
> >> >> <tycho.ander...@canonical.com> wrote:
> >> >> > On Wed, Sep 30, 2015 at 11:25:41AM -0700, Andy Lutomirski wrote:
> >> >> >> On Wed, Sep 30, 2015 at 11:13 AM, Tycho Andersen
> >> >> >> <tycho.ander...@canonical.com> wrote:
> >> >> >> > This command allows comparing the underling private data of two 
> >> >> >> > fds. This
> >> >> >> > is useful e.g. to find out if a seccomp filter is inherited, since 
> >> >> >> > struct
> >> >> >> > seccomp_filter are unique across tasks and are the private_data 
> >> >> >> > seccomp
> >> >> >> > fds.
> >> >> >>
> >> >> >> This is very implementation-specific and may have nasty ABI
> >> >> >> consequences far outside seccomp.  Let's do something specific to
> >> >> >> seccomp and/or eBPF.
> >> >> >
> >> >> > We could change the name to a less generic KCMP_SECCOMP_FD or
> >> >> > something, but without some sort of GUID on each struct
> >> >> > seccomp_filter, the implementation would be effectively the same as it
> >> >> > is today. Is that enough, or do we need a GUID?
> >> >> >
> >> >>
> >> >> I don't care about the GUID.  I think we should name it
> >> >> KCMP_SECCOMP_FD and make it only work on seccomp fds.
> >> >
> >> > Ok, I can do that.
> >> >
> >> >> Alternatively, we could figure out why KCMP_FILE doesn't do the trick
> >> >> and consider fixing it.  IMO it's really too bad that struct file is
> >> >> so heavyweight that we can't really just embed one in all kinds of
> >> >> structures.
> >> >
> >> > The problem is that KCMP_FILE compares the file objects themselves,
> >> > instead of the underlying data. If I ask for a seccomp fd for filter 0
> >> > twice, I'll have two different file objects and they won't be equal. I
> >> > suppose we could add some special logic inside KCMP_FILE to compare
> >> > the underlying data in special cases (seccomp, ebpf, others?), but it
> >> > seems cleaner to have a separate command as you described above.
> >> >
> >>
> >> What I meant was that maybe we could get the two requests to actually
> >> produce the same struct file.  But that could get very messy
> >> memory-wise.
> >
> > I see. The attached patch seems to work with KCMP_FILE and doesn't
> > look too bad if you don't mind the circular references. What do you
> > think?
> >
> 
> Could be reasonable.  I'm not that well versed on what fd_release
> does.  Are we guaranteed that it's called when the last fd goes away
> even if the struct file is pinned by the struct seccomp filter but is
> otherwise unreferenced?

I think so; my understanding is that the vfs doesn't know anything
about the seccomp_filter reference, so when the last fd referring to
the struct file is closed that's when release is called.

> Kees?
> 
> How many of you will be at KS?

I will not.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: v2 of seccomp filter c/r patches

2015-09-15 Thread Tycho Andersen

Hi Andy,

On Tue, Sep 15, 2015 at 11:13:51AM -0700, Andy Lutomirski wrote:
> On Tue, Sep 15, 2015 at 9:07 AM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > Hi Andy,
> >
> > On Mon, Sep 14, 2015 at 10:52:46AM -0700, Andy Lutomirski wrote:
> >>
> >> I'm not sure I entirely like this solution...
> >
> > Ok. Since we also aren't going to do all the eBPF stuff now, how about
> > something that looks like this:
> >
> > struct seccomp_layer {
> >   unsigned int size;
> >   unsigned int type; /* SECCOMP_BPF_CLASSIC or SECCOMP_EBPF or ... */
> >   bool inherited;
> >   union {
> > unsigned int insn_cnt;
> > struct bpf_insn *insns;
> >   };
> > };
> >
> > with a ptrace command:
> >
> > ptrace(PTRACE_SECCOMP_DUMP_LAYER, pid, i, );
> >
> > If we save a pointer to the current seccomp filter on fork (if there
> > is one), then I think the inherited flag is just,
> >
> > inherited = is_ancestor(child->seccomp.filter, 
> > child->seccomp.inherited_filter)
> >
> 
> I'm lost.  What is the inherited flag for?

We need some way to expose the seccomp hierarchy, specifically which
filters are inherited, so that we can correctly restore the filter
tree for tasks that may use TSYNC in the future. You've mentioned that
you don't like kcmp, so this is an alternative to that.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: v2 of seccomp filter c/r patches

2015-09-15 Thread Tycho Andersen

Hi Andy,

On Mon, Sep 14, 2015 at 10:52:46AM -0700, Andy Lutomirski wrote:
>
> I'm not sure I entirely like this solution...

Ok. Since we also aren't going to do all the eBPF stuff now, how about
something that looks like this:

struct seccomp_layer {
  unsigned int size;
  unsigned int type; /* SECCOMP_BPF_CLASSIC or SECCOMP_EBPF or ... */
  bool inherited;
  union {
unsigned int insn_cnt;
struct bpf_insn *insns;
  };
};

with a ptrace command:

ptrace(PTRACE_SECCOMP_DUMP_LAYER, pid, i, );

If we save a pointer to the current seccomp filter on fork (if there
is one), then I think the inherited flag is just,

inherited = is_ancestor(child->seccomp.filter, child->seccomp.inherited_filter)

In order to restore this (so it can be checkpointed again), we need a
command that looks like:

seccomp(SECCOMP_INHERIT_FILTER);

which sets the current and inherited filter to that of the parent
process. (Optionally we could have seccomp(SECCOMP_INHERIT_FILTER, i)
to inherit the ith filter from the parent, but we can coordinate this
via userpace so it's not strictly necessary.) So the whole c/r process
looks something like:

--- dump ---

for (i = 0; true; i++) {
  ret = ptrace(PTRACE_SECCOMP_DUMP_FILTER, pid, i, );
  if (ret == -ESRCH)
break;
  if (ret < 0)
/* real error */

  /* save the filter if it's not inherited, if it is, mark the filter
   * to be inherited from ppid; note that this index is walking up the
   * tree following filter->prev, and the index we want to reason
   * about on restore is walking down, so we should reverse the whole
   * array.
   */
}

--- restore ---

if (have_inherited_filters) {
  wait_for_ppid_seccomp_restore(n_inherited);
  seccomp(SECCOMP_INHERIT_FILTER);
  signal_done_inheriting();
}

for (i = 0; i < n_filters; i++) {
  seccomp(SECCOMP_SET_MODE_FILTER, ...);
  if (child_inherited_filter(i))
signal_children_filter(i);
}

I played around with an implementation of SECCOMP_INHERIT_FILTER last
night and I think I have one that might work. Thoughts?

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: v2 of seccomp filter c/r patches

2015-09-15 Thread Tycho Andersen

Hi Andy,

On Tue, Sep 15, 2015 at 01:01:23PM -0700, Andy Lutomirski wrote:
> On Tue, Sep 15, 2015 at 11:26 AM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > Hi Andy,
> >
> > On Tue, Sep 15, 2015 at 11:13:51AM -0700, Andy Lutomirski wrote:
> >> On Tue, Sep 15, 2015 at 9:07 AM, Tycho Andersen
> >> <tycho.ander...@canonical.com> wrote:
> >> > Hi Andy,
> >> >
> >> > On Mon, Sep 14, 2015 at 10:52:46AM -0700, Andy Lutomirski wrote:
> >> >>
> >> >> I'm not sure I entirely like this solution...
> >> >
> >> > Ok. Since we also aren't going to do all the eBPF stuff now, how about
> >> > something that looks like this:
> >> >
> >> > struct seccomp_layer {
> >> >   unsigned int size;
> >> >   unsigned int type; /* SECCOMP_BPF_CLASSIC or SECCOMP_EBPF or ... */
> >> >   bool inherited;
> >> >   union {
> >> > unsigned int insn_cnt;
> >> > struct bpf_insn *insns;
> >> >   };
> >> > };
> >> >
> >> > with a ptrace command:
> >> >
> >> > ptrace(PTRACE_SECCOMP_DUMP_LAYER, pid, i, );
> >> >
> >> > If we save a pointer to the current seccomp filter on fork (if there
> >> > is one), then I think the inherited flag is just,
> >> >
> >> > inherited = is_ancestor(child->seccomp.filter, 
> >> > child->seccomp.inherited_filter)
> >> >
> >>
> >> I'm lost.  What is the inherited flag for?
> >
> > We need some way to expose the seccomp hierarchy, specifically which
> > filters are inherited, so that we can correctly restore the filter
> > tree for tasks that may use TSYNC in the future. You've mentioned that
> > you don't like kcmp, so this is an alternative to that.
> >
> 
> My only objection to kcmp is that IMO it's a suboptimal interface and
> could be better.  I have no problem with the general principle of
> asking to compare two objects.

Ok, in that case I think we can get rid of all the inherited stuff,
and use kcmp to figure it out.

> The thing I really don't have a good handle on is whether the seccomp
> filter hierarchy should look more like A:
> 
> struct seccomp_filter {
> ...;
> struct seccomp_filter *prev;
> };
> 
> with the seccomp_filter being the user-visible object
> 
> Or B:
> 
> struct seccomp_layer {
>...;  /* BPF program, etc. */
> }
> 
> struct seccomp_filter {
>struct seccomp_layer *layer;
>struct seccomp_filter *prev;
> };  /* or equivalent */
> 
> with seccomp_layer being the user-visible object.
> 
> A is simpler to implement in a memory-efficient way, but it's less
> flexible.  I haven't come up with a compelling use case for B where A
> doesn't work, with the caveat that, if an fd points to a
> seccomp_filter in model A, you can't attach it unless your current
> state matches its "prev" state (or an ancestor thereof), which might
> be a little bit awkward.

Perhaps, although I don't think it would be an issue for c/r.

> Am I making more sense now?

Yes, thanks for the clarifications. I guess personally I'd probably
choose option A. If this (using kcmp and one of A/B) sounds good to
you, I'll start working on a set to do c/r that way.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/5] seccomp: make underlying bpf ref counted as well

2015-09-14 Thread Tycho Andersen

Hi Daniel,

On Fri, Sep 11, 2015 at 08:28:19PM +0200, Daniel Borkmann wrote:
> I think due to the given insns restrictions on classic seccomp, this
> could work for "most cases" (see below) for the time being until pointer
> sanitation is resolved and that seccomp-only restriction from the dump
> could be removed,

Ok, thanks.

> BUT there's one more stone in the road which you still
> need to take care of with this whole 'giving classic seccomp-BPF -> eBPF
> transforms an fd, dumping and restoring that via bpf(2)' approach:
> 
> If you have JIT enabled on ARM32, and add a classic seccomp-BPF filter,
> and dump that via your bpf(2) interface based on the current patches, what
> you'll get is not eBPF opcodes but classic (!) BPF opcodes as ARM32 classic
> JIT supports compilation of seccomp, since commit 24e737c1ebac ("ARM: net:
> add JIT support for loads from struct seccomp_data.").
> 
> So in that case, bpf_prepare_filter() will not call into bpf_migrate_filter()
> as there's simply no need for it, because the classic code could already
> be JITed there. I guess other archs where JIT support for eBPF in not yet
> within near sight might sooner or later support this insn for their classic
> JITs, too ...

Thanks for pointing this out.

What if we legislate that the output of bpf(BPF_PROG_DUMP, ...) is
always eBPF? As near as I can tell there is no way to determine if a
struct bpf_prog is classic or eBPF, so we'd need to add a bit to
indicate whether or not the prog has been converted so that
BPF_PROG_DUMP knows when to convert it.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/5] seccomp: make underlying bpf ref counted as well

2015-09-14 Thread Tycho Andersen

On Mon, Sep 14, 2015 at 06:48:43PM +0200, Daniel Borkmann wrote:
> On 09/14/2015 06:00 PM, Tycho Andersen wrote:
> >On Fri, Sep 11, 2015 at 08:28:19PM +0200, Daniel Borkmann wrote:
> >>I think due to the given insns restrictions on classic seccomp, this
> >>could work for "most cases" (see below) for the time being until pointer
> >>sanitation is resolved and that seccomp-only restriction from the dump
> >>could be removed,
> >
> >Ok, thanks.
> >
> >>BUT there's one more stone in the road which you still
> >>need to take care of with this whole 'giving classic seccomp-BPF -> eBPF
> >>transforms an fd, dumping and restoring that via bpf(2)' approach:
> >>
> >>If you have JIT enabled on ARM32, and add a classic seccomp-BPF filter,
> >>and dump that via your bpf(2) interface based on the current patches, what
> >>you'll get is not eBPF opcodes but classic (!) BPF opcodes as ARM32 classic
> >>JIT supports compilation of seccomp, since commit 24e737c1ebac ("ARM: net:
> >>add JIT support for loads from struct seccomp_data.").
> >>
> >>So in that case, bpf_prepare_filter() will not call into 
> >>bpf_migrate_filter()
> >>as there's simply no need for it, because the classic code could already
> >>be JITed there. I guess other archs where JIT support for eBPF in not yet
> >>within near sight might sooner or later support this insn for their classic
> >>JITs, too ...
> >
> >Thanks for pointing this out.
> >
> >What if we legislate that the output of bpf(BPF_PROG_DUMP, ...) is
> >always eBPF? As near as I can tell there is no way to determine if a
> >struct bpf_prog is classic or eBPF, so we'd need to add a bit to
> >indicate whether or not the prog has been converted so that
> >BPF_PROG_DUMP knows when to convert it.
> 
> As I said, you have bpf_prog_was_classic() function to determine exactly
> this (so without your type re-assignment you have a way to distinguish it).

I don't think this is the same thing, though. IIUC, when the classic
jit succeeds, bpf_prog_was_classic() will still return true even
though prog->insnsi points to classic instructions instead of eBPF
ones, and (I think) this situation is impossible to distinguish.
Anyway, it sounds like this doesn't matter, as we have...

> Wouldn't it be much easier to rip this set apart into multiple ones, solving
> one individual thing at a time, f.e. starting out simple and 1) only add
> native eBPF support to seccomp, after that 2) add a method to dump native-only
> eBPF programs for criu, then 3) think about a right interface for classic
> BPF seccomp dumping, etc, etc? Currently, it tries to solve everything at
> once, and with some early assumptions that have non-trivial side-effects.

The primary motivation for this set is your bullet 3, c/r of programs
with classic bpf programs (i.e. what seccomp supports now). Initially,
I thought it was best to try and dump the eBPFs directly, but it seems
there are a lot of complications I wasn't aware of. Perhaps I'll look
at a bpf_prog_store_orig_filter() style approach.

Thanks,

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/5] seccomp: make underlying bpf ref counted as well

2015-09-11 Thread Tycho Andersen

On Fri, Sep 11, 2015 at 06:03:59PM +0200, Daniel Borkmann wrote:
> On 09/11/2015 04:44 PM, Tycho Andersen wrote:
> >On Fri, Sep 11, 2015 at 03:02:36PM +0200, Daniel Borkmann wrote:
> >>On 09/11/2015 02:20 AM, Tycho Andersen wrote:
> >>>In the next patch, we're going to add a way to access the underlying
> >>>filters via bpf fds. This means that we need to ref-count both the
> >>>struct seccomp_filter objects and the struct bpf_prog objects separately,
> >>>in case a process dies but a filter is still referred to by another
> >>>process.
> >>>
> >>>Additionally, we mark classic converted seccomp filters as seccomp eBPF
> >>>programs, since they are a subset of what is supported in seccomp eBPF.
> >>>
> >>>Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
> >>>CC: Kees Cook <keesc...@chromium.org>
> >>>CC: Will Drewry <w...@chromium.org>
> >>>CC: Oleg Nesterov <o...@redhat.com>
> >>>CC: Andy Lutomirski <l...@amacapital.net>
> >>>CC: Pavel Emelyanov <xe...@parallels.com>
> >>>CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
> >>>CC: Alexei Starovoitov <a...@kernel.org>
> >>>CC: Daniel Borkmann <dan...@iogearbox.net>
> >>>---
> >>>  kernel/seccomp.c | 4 +++-
> >>>  1 file changed, 3 insertions(+), 1 deletion(-)
> >>>
> >>>diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> >>>index 245df6b..afaeddf 100644
> >>>--- a/kernel/seccomp.c
> >>>+++ b/kernel/seccomp.c
> >>>@@ -378,6 +378,8 @@ static struct seccomp_filter 
> >>>*seccomp_prepare_filter(struct sock_fprog *fprog)
> >>>   }
> >>>
> >>>   atomic_set(>usage, 1);
> >>>+  atomic_set(>prog->aux->refcnt, 1);
> >>>+  sfilter->prog->type = BPF_PROG_TYPE_SECCOMP;
> >>
> >>So, if you do this, then this breaks the assumption of eBPF JITs
> >>that, currently, all classic converted BPF programs always have a
> >>prog->type of BPF_PROG_TYPE_UNSPEC (see: bpf_prog_was_classic()).
> >>
> >>Currently, JITs make use of this information to determine whether
> >>A and X mappings for such programs should or should not be cleared
> >>in the prologue (s390 currently).
> >>
> >>In the seccomp_prepare_filter() stage, we're already past that, so
> >>it will not cause an issue, but we certainly would need to be very
> >>careful in future, if bpf_prog_was_classic() is then used at a later
> >>stage when we already have a generated bpf_prog somewhere, as then
> >>this assumption will break.
> >
> >The only reason we need to do this is to allow BPF_DUMP_PROG to work,
> >since we were restricting it to only allow dumping of seccomp
> >programs, since those don't have maps. Instead, perhaps we could allow
> >dumping of BPF_PROG_TYPE_SECCOMP and BPF_PROG_TYPE_UNSPEC?
> 
> There are possibilities that BPF_PROG_TYPE_UNSPEC is calling helpers
> already today, at least in networking case, not seccomp. So, since
> you want to export [classic -> eBPF] only for seccomp, put fds on them
> and dump these via bpf(2), you could allow that (with a big comment
> stating why it's safe), but mid-term we really need to sanitize all
> this stuff properly as this is needed for other types, too.

Sorry, just to be clear, you're suggesting that the patch is ok modulo
a comment describing the jit issues?

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: v2 of seccomp filter c/r patches

2015-09-11 Thread Tycho Andersen

On Fri, Sep 11, 2015 at 10:00:22AM -0700, Andy Lutomirski wrote:
> On Fri, Sep 11, 2015 at 9:30 AM, Andy Lutomirski <l...@amacapital.net> wrote:
> > On Sep 10, 2015 5:22 PM, "Tycho Andersen" <tycho.ander...@canonical.com> 
> > wrote:
> >>
> >> Hi all,
> >>
> >> Here is v2 of the seccomp filter c/r set. The patch notes have individual
> >> changes from the last series, but there are two points not noted:
> >>
> >> * The series still does not allow us to correctly restore state for 
> >> programs
> >>   that will use SECCOMP_FILTER_FLAG_TSYNC in the future. Given that we 
> >> want to
> >>   keep seccomp_filter's identity, I think something along the lines of 
> >> another
> >>   seccomp command like SECCOMP_INHERIT_PARENT is needed (although I'm not 
> >> sure
> >>   if this can even be done yet). In addition, we'll need a kcmp command for
> >>   figuring out if filters are the same, although this too needs to compare
> >>   seccomp_filter objects, so it's a little screwy. Any thoughts on how to 
> >> do
> >>   this nicely are welcome.
> >
> > Let's add a concept of a seccompfd.
> >
> > For background of what I want to add: I want to be able to create a
> > seccomp monitor.  A seccomp monitor will be, logically, a pair of a
> > struct file that represents the monitor and a seccomp_filter that is
> > controlled by the monitor.  Depending on flags, whoever holds the
> > monitor fd could change the active filter, intercept syscalls, and
> > issue syscalls on behalf of a process that is trapped in an
> > intercepted syscall.
> >
> > Seccomp filters would nest properly.
> >
> > The interface would probably be (extremely pseudocoded):
> >
> > monitor_fd, filter_fd = seccomp(CREATE_MONITOR, flags, ...);
> >
> > Then, later:
> >
> > seccomp(ATTACH_TO_FILTER, filter_fd);  /* now filtered */
> >
> > read(monitor_fd, buf, size); /* returns an intercepted syscall */
> > write(monitor_fd, buf, size); /* issues a syscall or releases the
> > trapped task */
> >
> > This can't be implemented on x86 without either going insane or
> > finishing the massive set of pending cleanups to the x86 entry code.
> > I favor the latter.
> >
> > We could, however, add part of it right now: we could have a way to
> > create a filterfd, we could add kcmp support for it, and we could add
> > the ATTACH_TO_FILTER thing.  I think that would solve your problem.
> >
> > One major open question: does a filter_fd know what its parent is and,
> > if so, will it just refuse to attach if the caller's parent is wrong?
> > Or will a filter_fd attach anywhere.
> >
> 
> Let me add one more thought:
> 
> Currently, struct seccomp_filter encodes a strict tree hierarchy: it
> knows what its parent is.  This only matters as an implementation
> detail and because TSYNC checks for seccomp_filter equality.
> 
> We could change this without user-visible effects.  We could say that,
> for TSYNC purposes, two filter states match if they contain exactly
> the same layers in the same order where a layer does *not* encode a
> concept of parent.  We could then say that attaching a classic bpf
> filter creates a branch new layer that is not equal to any other layer
> that's been created.
> 
> This has no effect whatsoever.  The difference would be that we could
> declare that attaching the same ebpf program twice creates the *same*
> layer so that, if you fork and both children attach the same ebpf
> program, then they match for TSYNC purposes.

Would you keep struct seccomp_filter identity here (meaning that you'd
reach over and grab the seccomp_filter from a sibling thread if it
existed)? Would it only work for the last filter attached to siblings,
or for all the filters? This does make my life easier, but I like the
idea of just using seccompfd directly below as it seems somewhat
easier (for me at least) to understand,

> Similarly, attaching the
> same hypothetical filterfd would create the same layer.

If we change the api of my current set to have the ptrace commands
iterate over seccomp fds, it looks something like:

seccompfd = ptrace(GET_FILTER_FD, pid);
while (ptrace(NEXT_FD, pid, seccompfd) == 0) {
if (seccomp(CHECK_INHERITED, seccompfd))
break;

bpffd = seccomp(GET_BPF_FD, seccompfd);
err = buf(BPF_PROG_DUMP, bpffd, );
/* save the bpf prog */
}

then restore can look like:

while (have_noninherited_filters()) {
filter = load_filter();
bpffd = bpf(BPF_PROG_LOAD, filter);
seccompfd = seccomp(SECCOMP_FD_CR

Re: [PATCH v2 4/5] seccomp: add a way to access filters via bpf fds

2015-09-11 Thread Tycho Andersen

On Fri, Sep 11, 2015 at 09:20:55AM -0700, Andy Lutomirski wrote:
> On Sep 10, 2015 5:22 PM, "Tycho Andersen" <tycho.ander...@canonical.com> 
> wrote:
> >
> > This patch adds a way for a process that is "real root" to access the
> > seccomp filters of another process. The process first does a
> > PTRACE_SECCOMP_GET_FILTER_FD to get an fd with that process' seccomp filter
> > attached, and then iterates on this with PTRACE_SECCOMP_NEXT_FILTER using
> > bpf(BPF_PROG_DUMP) to dump the actual program at each step.
> >
> 
> > +
> > +   fd = bpf_new_fd(filter->prog, O_RDONLY);
> > +   if (fd > 0)
> > +   atomic_inc(>prog->aux->refcnt);
> 
> Why isn't this folded into bpf_new_fd?

No reason it can't be as far as I can see. I'll make the change for
the next version.

> > +
> > +   return fd;
> > +}
> > +
> > +long seccomp_next_filter(struct task_struct *child, u32 fd)
> > +{
> > +   struct seccomp_filter *cur;
> > +   struct bpf_prog *prog;
> > +   long ret = -ESRCH;
> > +
> > +   if (!capable(CAP_SYS_ADMIN))
> > +   return -EACCES;
> > +
> > +   if (child->seccomp.mode != SECCOMP_MODE_FILTER)
> > +   return -EINVAL;
> > +
> > +   prog = bpf_prog_get(fd);
> > +   if (IS_ERR(prog)) {
> > +   ret = PTR_ERR(prog);
> > +   goto out;
> > +   }
> > +
> > +   for (cur = child->seccomp.filter; cur; cur = cur->prev) {
> > +   if (cur->prog == prog) {
> > +   if (!cur->prev)
> > +   ret = -ENOENT;
> > +   else
> > +   ret = bpf_prog_set(fd, cur->prev->prog);
> 
> This lets you take an fd pointing to one prog and point it elsewhere.
> I'm not sure that's a good idea.

That's how the interface was designed (calling ptrace(NEXT_FILTER, fd) and
then doing bpf(DUMP, fd)). I suppose we could have NEXT_FILTER return
a new fd instead if that seems better to you.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/5] seccomp: add a way to attach a filter via eBPF fd

2015-09-11 Thread Tycho Andersen

On Fri, Sep 11, 2015 at 02:37:59PM +0200, Daniel Borkmann wrote:
> On 09/11/2015 02:21 AM, Tycho Andersen wrote:
> >This is the final bit needed to support seccomp filters created via the bpf
> >syscall. The patch adds a new seccomp operation SECCOMP_MODE_FILTER_EBPF,
> >which takes exactly one command (presumably to be expanded upon later when
> >seccomp EBPFs support more interesting things) and an argument struct
> >similar to that of bpf(), although the size is explicit in the struct to
> >avoid changing the signature of seccomp().
> >
> >v2: Don't abuse seccomp's third argument; use a separate command and a
> > pointer to a structure instead.
> 
> Comments below ...
> 
> >Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
> >CC: Kees Cook <keesc...@chromium.org>
> >CC: Will Drewry <w...@chromium.org>
> >CC: Oleg Nesterov <o...@redhat.com>
> >CC: Andy Lutomirski <l...@amacapital.net>
> >CC: Pavel Emelyanov <xe...@parallels.com>
> >CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
> >CC: Alexei Starovoitov <a...@kernel.org>
> >CC: Daniel Borkmann <dan...@iogearbox.net>
> >---
> >  include/uapi/linux/seccomp.h |  16 +
> >  kernel/seccomp.c | 135 
> > ++-
> >  2 files changed, 138 insertions(+), 13 deletions(-)
> >
> >diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> >index 0f238a4..a8694e2 100644
> >--- a/include/uapi/linux/seccomp.h
> >+++ b/include/uapi/linux/seccomp.h
> >@@ -13,10 +13,14 @@
> >  /* Valid operations for seccomp syscall. */
> >  #define SECCOMP_SET_MODE_STRICT0
> >  #define SECCOMP_SET_MODE_FILTER1
> >+#define SECCOMP_MODE_FILTER_EBPF2
> 
> Should this be SECCOMP_SET_MODE_FILTER_EBPF or just SECCOMP_SET_MODE_EBPF?

I just stole the name Kees gave it in the previous thread, but I think
that perhaps there are other plans for manipulating seccomp ebpfs (?).
The command is SECCOMP_EBPF_ADD_FD, so it seems like we could add a
command like SECCOMP_EBPF_SOMETHING in the future.

> >  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> >  #define SECCOMP_FILTER_FLAG_TSYNC  1
> >
> >+/* Valid cmds for SECCOMP_MODE_FILTER_EBPF */
> >+#define SECCOMP_EBPF_ADD_FD 0
> >+
> >  /*
> >   * All BPF programs must return a 32-bit value.
> >   * The bottom 16-bits are for optional return data.
> >@@ -51,4 +55,16 @@ struct seccomp_data {
> > __u64 args[6];
> >  };
> >
> >+struct seccomp_ebpf {
> >+unsigned int size;
> >+
> >+union {
> >+/* SECCOMP_EBPF_ADD_FD */
> >+struct {
> >+unsigned intadd_flags;
> >+__u32   add_fd;
> >+};
> >+};
> >+};
> >+
> >  #endif /* _UAPI_LINUX_SECCOMP_H */
> >diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> >index 1856f69..e78175a 100644
> >--- a/kernel/seccomp.c
> >+++ b/kernel/seccomp.c
> >@@ -65,6 +65,9 @@ struct seccomp_filter {
> >  /* Limit any path through the tree to 256KB worth of instructions. */
> >  #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))
> >
> >+static long seccomp_install_filter(unsigned int flags,
> >+   struct seccomp_filter *prepared);
> >+
> >  /*
> >   * Endianness is explicitly ignored and left for BPF program authors to 
> > manage
> >   * as per the specific architecture.
> >@@ -356,17 +359,6 @@ static struct seccomp_filter 
> >*seccomp_prepare_filter(struct sock_fprog *fprog)
> >
> > BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter));
> >
> >-/*
> >- * Installing a seccomp filter requires that the task has
> >- * CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
> >- * This avoids scenarios where unprivileged tasks can affect the
> >- * behavior of privileged children.
> >- */
> >-if (!task_no_new_privs(current) &&
> >-security_capable_noaudit(current_cred(), current_user_ns(),
> >- CAP_SYS_ADMIN) != 0)
> >-return ERR_PTR(-EACCES);
> >-
> > /* Allocate a new seccomp_filter */
> > sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN);
> > if (!sfilter)
> >@@ -510,8 +502,105 @@ static void seccomp_send_sigsys(int syscall, int 
> >reason)
> > info.si_syscall = syscall;
> > force_sig_info(SIGSYS, , curre

Re: [PATCH v2 2/5] seccomp: make underlying bpf ref counted as well

2015-09-11 Thread Tycho Andersen

On Fri, Sep 11, 2015 at 03:02:36PM +0200, Daniel Borkmann wrote:
> On 09/11/2015 02:20 AM, Tycho Andersen wrote:
> >In the next patch, we're going to add a way to access the underlying
> >filters via bpf fds. This means that we need to ref-count both the
> >struct seccomp_filter objects and the struct bpf_prog objects separately,
> >in case a process dies but a filter is still referred to by another
> >process.
> >
> >Additionally, we mark classic converted seccomp filters as seccomp eBPF
> >programs, since they are a subset of what is supported in seccomp eBPF.
> >
> >Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
> >CC: Kees Cook <keesc...@chromium.org>
> >CC: Will Drewry <w...@chromium.org>
> >CC: Oleg Nesterov <o...@redhat.com>
> >CC: Andy Lutomirski <l...@amacapital.net>
> >CC: Pavel Emelyanov <xe...@parallels.com>
> >CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
> >CC: Alexei Starovoitov <a...@kernel.org>
> >CC: Daniel Borkmann <dan...@iogearbox.net>
> >---
> >  kernel/seccomp.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> >diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> >index 245df6b..afaeddf 100644
> >--- a/kernel/seccomp.c
> >+++ b/kernel/seccomp.c
> >@@ -378,6 +378,8 @@ static struct seccomp_filter 
> >*seccomp_prepare_filter(struct sock_fprog *fprog)
> > }
> >
> > atomic_set(>usage, 1);
> >+atomic_set(>prog->aux->refcnt, 1);
> >+sfilter->prog->type = BPF_PROG_TYPE_SECCOMP;
> 
> So, if you do this, then this breaks the assumption of eBPF JITs
> that, currently, all classic converted BPF programs always have a
> prog->type of BPF_PROG_TYPE_UNSPEC (see: bpf_prog_was_classic()).
> 
> Currently, JITs make use of this information to determine whether
> A and X mappings for such programs should or should not be cleared
> in the prologue (s390 currently).
> 
> In the seccomp_prepare_filter() stage, we're already past that, so
> it will not cause an issue, but we certainly would need to be very
> careful in future, if bpf_prog_was_classic() is then used at a later
> stage when we already have a generated bpf_prog somewhere, as then
> this assumption will break.

The only reason we need to do this is to allow BPF_DUMP_PROG to work,
since we were restricting it to only allow dumping of seccomp
programs, since those don't have maps. Instead, perhaps we could allow
dumping of BPF_PROG_TYPE_SECCOMP and BPF_PROG_TYPE_UNSPEC?

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 3/5] ebpf: add a way to dump an eBPF program

2015-09-11 Thread Tycho Andersen

On Fri, Sep 11, 2015 at 03:39:14PM +0200, Daniel Borkmann wrote:
> On 09/11/2015 02:21 AM, Tycho Andersen wrote:
> >This commit adds a way to dump eBPF programs. The initial implementation
> >doesn't support maps, and therefore only allows dumping seccomp ebpf
> >programs which themselves don't currently support maps.
> >
> >v2: don't export a prog_id for the filter
> >
> >Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
> >CC: Kees Cook <keesc...@chromium.org>
> >CC: Will Drewry <w...@chromium.org>
> >CC: Oleg Nesterov <o...@redhat.com>
> >CC: Andy Lutomirski <l...@amacapital.net>
> >CC: Pavel Emelyanov <xe...@parallels.com>
> >CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
> >CC: Alexei Starovoitov <a...@kernel.org>
> >CC: Daniel Borkmann <dan...@iogearbox.net>
> [...]
> >diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >index dc9b464..58ae9f4 100644
> >--- a/kernel/bpf/syscall.c
> >+++ b/kernel/bpf/syscall.c
> >@@ -586,6 +586,44 @@ free_prog:
> > return err;
> >  }
> >
> >+static int bpf_prog_dump(union bpf_attr *attr, union bpf_attr __user *uattr)
> >+{
> >+int ufd = attr->prog_fd;
> >+struct fd f = fdget(ufd);
> >+struct bpf_prog *prog;
> >+int ret = -EINVAL;
> >+
> >+prog = get_prog(f);
> >+if (IS_ERR(prog))
> >+return PTR_ERR(prog);
> >+
> >+/* For now, let's refuse to dump anything that isn't a seccomp program.
> >+ * Other program types have support for maps, which our current dump
> >+ * code doesn't support.
> >+ */
> >+if (prog->type != BPF_PROG_TYPE_SECCOMP)
> >+goto out;
> 
> Yep, also when you start adding helper calls (next to map objects) you'd
> need to undo kernel pointers that the verifier sets here.

Good point, I'll add that to the comment as well.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 4/5] seccomp: add a way to access filters via bpf fds

2015-09-11 Thread Tycho Andersen

Hi Michael,

On Fri, Sep 11, 2015 at 02:08:50PM +0200, Michael Kerrisk (man-pages) wrote:
> HI Tycho
> 
> On 11 September 2015 at 02:21, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > This patch adds a way for a process that is "real root" to access the
> > seccomp filters of another process. The process first does a
> > PTRACE_SECCOMP_GET_FILTER_FD to get an fd with that process' seccomp filter
> > attached, and then iterates on this with PTRACE_SECCOMP_NEXT_FILTER using
> > bpf(BPF_PROG_DUMP) to dump the actual program at each step.
> 
> Do you have a man- page patch for this change?

Not yet (r.e. all the man page reqs), I can draft them asap, though.
Hopefully the API is mostly stable at this point :).

Thanks,

Tycho

> Cheers,
> 
> Michael
> 
> > Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
> > CC: Kees Cook <keesc...@chromium.org>
> > CC: Will Drewry <w...@chromium.org>
> > CC: Oleg Nesterov <o...@redhat.com>
> > CC: Andy Lutomirski <l...@amacapital.net>
> > CC: Pavel Emelyanov <xe...@parallels.com>
> > CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
> > CC: Alexei Starovoitov <a...@kernel.org>
> > CC: Daniel Borkmann <dan...@iogearbox.net>
> > ---
> >  include/linux/bpf.h | 12 ++
> >  include/linux/seccomp.h | 14 +++
> >  include/uapi/linux/ptrace.h |  3 +++
> >  kernel/bpf/syscall.c| 26 -
> >  kernel/ptrace.c |  7 ++
> >  kernel/seccomp.c| 57 
> > +
> >  6 files changed, 118 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index f57d7fe..bfd9cab 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -162,6 +162,8 @@ void bpf_register_prog_type(struct bpf_prog_type_list 
> > *tl);
> >  void bpf_register_map_type(struct bpf_map_type_list *tl);
> >
> >  struct bpf_prog *bpf_prog_get(u32 ufd);
> > +int bpf_prog_set(u32 ufd, struct bpf_prog *new);
> > +int bpf_new_fd(struct bpf_prog *prog, int flags);
> >  void bpf_prog_put(struct bpf_prog *prog);
> >  void bpf_prog_put_rcu(struct bpf_prog *prog);
> >
> > @@ -180,6 +182,16 @@ static inline struct bpf_prog *bpf_prog_get(u32 ufd)
> > return ERR_PTR(-EOPNOTSUPP);
> >  }
> >
> > +static inline int bpf_prog_set(u32 ufd, struct bpf_prog *new)
> > +{
> > +   return -EINVAL;
> > +}
> > +
> > +static inline int bpf_new_fd(struct bpf_prog *prog, int flags)
> > +{
> > +   return -EINVAL;
> > +}
> > +
> >  static inline void bpf_prog_put(struct bpf_prog *prog)
> >  {
> >  }
> > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> > index a19ddac..41b083c 100644
> > --- a/include/linux/seccomp.h
> > +++ b/include/linux/seccomp.h
> > @@ -95,4 +95,18 @@ static inline void get_seccomp_filter(struct task_struct 
> > *tsk)
> > return;
> >  }
> >  #endif /* CONFIG_SECCOMP_FILTER */
> > +
> > +#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> > +extern long seccomp_get_filter_fd(struct task_struct *child);
> > +extern long seccomp_next_filter(struct task_struct *child, u32 fd);
> > +#else
> > +static inline long seccomp_get_filter_fd(struct task_struct *child)
> > +{
> > +   return -EINVAL;
> > +}
> > +static inline long seccomp_next_filter(struct task_struct *child, u32 fd)
> > +{
> > +   return -EINVAL;
> > +}
> > +#endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
> >  #endif /* _LINUX_SECCOMP_H */
> > diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> > index cf1019e..041c3c3 100644
> > --- a/include/uapi/linux/ptrace.h
> > +++ b/include/uapi/linux/ptrace.h
> > @@ -23,6 +23,9 @@
> >
> >  #define PTRACE_SYSCALL   24
> >
> > +#define PTRACE_SECCOMP_GET_FILTER_FD   40
> > +#define PTRACE_SECCOMP_NEXT_FILTER 41
> > +
> >  /* 0x4200-0x4300 are reserved for architecture-independent additions.  */
> >  #define PTRACE_SETOPTIONS  0x4200
> >  #define PTRACE_GETEVENTMSG 0x4201
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 58ae9f4..ac3ed1c 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -506,6 +506,30 @@ struct bpf_prog *bpf_prog_get(u32 ufd)
> >  }
> >  EXPORT_SYMBOL_GPL(bpf_prog_get

Re: [PATCH v2 3/5] ebpf: add a way to dump an eBPF program

2015-09-11 Thread Tycho Andersen

On Thu, Sep 10, 2015 at 07:29:42PM -0700, Alexei Starovoitov wrote:
> On Thu, Sep 10, 2015 at 06:21:00PM -0600, Tycho Andersen wrote:
> > +static int bpf_prog_dump(union bpf_attr *attr, union bpf_attr __user 
> > *uattr)
> > +{
> > +   int ufd = attr->prog_fd;
> > +   struct fd f = fdget(ufd);
> > +   struct bpf_prog *prog;
> > +   int ret = -EINVAL;
> > +
> > +   prog = get_prog(f);
> > +   if (IS_ERR(prog))
> > +   return PTR_ERR(prog);
> > +
> > +   /* For now, let's refuse to dump anything that isn't a seccomp program.
> > +* Other program types have support for maps, which our current dump
> > +* code doesn't support.
> > +*/
> > +   if (prog->type != BPF_PROG_TYPE_SECCOMP)
> > +   goto out;
> > +
> > +   ret = -EFAULT;
> > +   if (put_user(prog->len, >dump_insn_cnt))
> > +   goto out;
> > +
> > +   if (put_user((u8) prog->gpl_compatible, >gpl_compatible))
> > +   goto out;
> > +
> > +   if (attr->dump_insns) {
> > +   u32 len = prog->len * sizeof(struct bpf_insn);
> > +
> > +   if (copy_to_user(u64_to_ptr(attr->dump_insns),
> > +prog->insns, len) != 0)
> > +   goto out;
> > +   }
> > +
> > +   ret = 0;
> > +out:
> > +   return ret;
> 
> fdput() is missing in all error paths.

So it is, thanks!

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 4/5] seccomp: add a way to access filters via bpf fds

2015-09-11 Thread Tycho Andersen

On Fri, Sep 11, 2015 at 01:47:38PM +0200, Daniel Borkmann wrote:
> On 09/11/2015 02:21 AM, Tycho Andersen wrote:
> >This patch adds a way for a process that is "real root" to access the
> >seccomp filters of another process. The process first does a
> >PTRACE_SECCOMP_GET_FILTER_FD to get an fd with that process' seccomp filter
> >attached, and then iterates on this with PTRACE_SECCOMP_NEXT_FILTER using
> >bpf(BPF_PROG_DUMP) to dump the actual program at each step.
> >
> >Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
> >CC: Kees Cook <keesc...@chromium.org>
> >CC: Will Drewry <w...@chromium.org>
> >CC: Oleg Nesterov <o...@redhat.com>
> >CC: Andy Lutomirski <l...@amacapital.net>
> >CC: Pavel Emelyanov <xe...@parallels.com>
> >CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
> >CC: Alexei Starovoitov <a...@kernel.org>
> >CC: Daniel Borkmann <dan...@iogearbox.net>
> [...]
> >diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >index 58ae9f4..ac3ed1c 100644
> >--- a/kernel/bpf/syscall.c
> >+++ b/kernel/bpf/syscall.c
> >@@ -506,6 +506,30 @@ struct bpf_prog *bpf_prog_get(u32 ufd)
> >  }
> >  EXPORT_SYMBOL_GPL(bpf_prog_get);
> >
> >+int bpf_prog_set(u32 ufd, struct bpf_prog *new)
> >+{
> >+struct fd f;
> >+struct bpf_prog *prog;
> >+
> >+f = fdget(ufd);
> >+
> >+prog = get_prog(f);
> >+if (!IS_ERR(prog) && prog)
> >+bpf_prog_put(prog);
> >+
> >+atomic_inc(>aux->refcnt);
> >+f.file->private_data = new;
> >+fdput(f);
> >+return 0;
> 
> So in case get_prog() fails, and for example f.file is infact NULL,
> you assign the bpf prog then to ERR_PTR(-EBADF)'s private_data? :(

Thanks, I will fix for the next version.

> >+}
> >+EXPORT_SYMBOL_GPL(bpf_prog_set);
> >+
> >+int bpf_new_fd(struct bpf_prog *prog, int flags)
> >+{
> >+return anon_inode_getfd("bpf-prog", _prog_fops, prog, flags);
> >+}
> >+EXPORT_SYMBOL_GPL(bpf_new_fd);
> 
> Any reason why these two need to be exported for modules? Which
> modules are using them?
> 
> I think modules should probably not mess with this.

No reason, I suppose. I was just exporting because bpf_prog_get is;
I'll drop it for the next version.

> If you already name it generic, it would also be good if bpf_new_fd()
> is used in case of maps that call anon_inode_getfd(), too.

I needed to call bpf_new_fd from kernel/seccomp.c, which it seems
shouldn't be able to reference bpf_prog_fops, which is why I added the
little "proxy". If we change the api to something like,

bpf_new_fd("bpf-map", _map_fops, map);
bpf_new_fd("bpf-prog", _prog_fops, prog);

I'd need access to bpf_prog_fops again. What about changing the name
to bpf_new_prog_fd?

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ebpf: emit correct src_reg for conditional jumps

2015-09-11 Thread Tycho Andersen

On Fri, Sep 11, 2015 at 08:40:43AM -0700, Alexei Starovoitov wrote:
> On Fri, Sep 11, 2015 at 11:28:24AM +0200, Daniel Borkmann wrote:
> > [off topic for this patch]
> > 
> > ... this requirement also breaks down for cases where you have a single
> > classic BPF instruction that maps into 2 or more eBPF instructions, hitting
> > BPF_MAXINSNS early at the time when you try to call into bpf(2) again with
> > the dumped result. If I recall correctly, Chrome seems to use up quite a lot
> > of insns space on (classic) seccomp-BPF, so this could potentially be a real
> > issue, next to artificially crafted examples that would fail.
> > 
> > With regards to the latter, also classic programs that could have holes of
> > dead code where you jump over them (see lib/test_bpf.c for examples) are
> > unfortunately allowed on the ABI side, so while the classic -> eBPF 
> > converter
> > may translate this dead region 1:1, it will be rejected by the verifier when
> > you dump and try to reload that. ;) Anyway, it's perhaps a silly example, 
> > but
> > it shows that this use-case can only work on a 'best-effort' basis, and not
> > cover everything of the classic BPF ABI, as opposed to having an interface
> > that can dump/restore classic BPF instructions directly.
> > 
> > Do you need this for checkpoint/restore? Wouldn't this therefore be better
> > done as dump/restore classic<->classic and eBPF<->eBPF directly? In socket
> > filters we do this by keeping orig_prog around, I guess it's better to bite
> > the bullet of additional memory overhead in case of classic seccomp-BPF, 
> > too.
> 
> I don't think so.
> When I played with libseccomp and chrome I saw that browser installed
> several bpf programs for every new tab. The longest program was 275
> classic insns which translated to slightly more eBPF insns
> (because in eBPF we don't have < and <= instructions, so converter
> needs to emit extra jump insns)
> Also they don't produce unreachable code.
> So getting over 4k limit and unreachable are rather hypothetical
> problems. I wouldn't want to have two interfaces to criu seccomp.
> eBPF is going to be used by seccomp as well, so having two is extra
> burden for user space criu.

I think the burden is not so huge once we have eBPF c/r in place (we
could just check the classic flag first, then dump the eBPF version or
something). However, it doesn't seem ideal to have to support two
interfaces.

> If we start hitting 4k limit for eBPF, we can easily bump it
> and/or add < and <= insns to eBPF (which was on my todo list anyway).

I was hoping you might say this (assuming you mean add < BPF_MAXINSNS
to the converter).

The dead code is certainly a problem, but perhaps we can wait on this
until there become some critical application that has this issue. I
was hoping we could get away without extra memory usage on the kernel
side (indeed, this patchset is mostly to try and avoid that).

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 4/5] seccomp: add a way to access filters via bpf fds

2015-09-10 Thread Tycho Andersen

This patch adds a way for a process that is "real root" to access the
seccomp filters of another process. The process first does a
PTRACE_SECCOMP_GET_FILTER_FD to get an fd with that process' seccomp filter
attached, and then iterates on this with PTRACE_SECCOMP_NEXT_FILTER using
bpf(BPF_PROG_DUMP) to dump the actual program at each step.

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 include/linux/bpf.h | 12 ++
 include/linux/seccomp.h | 14 +++
 include/uapi/linux/ptrace.h |  3 +++
 kernel/bpf/syscall.c| 26 -
 kernel/ptrace.c |  7 ++
 kernel/seccomp.c| 57 +
 6 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f57d7fe..bfd9cab 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -162,6 +162,8 @@ void bpf_register_prog_type(struct bpf_prog_type_list *tl);
 void bpf_register_map_type(struct bpf_map_type_list *tl);
 
 struct bpf_prog *bpf_prog_get(u32 ufd);
+int bpf_prog_set(u32 ufd, struct bpf_prog *new);
+int bpf_new_fd(struct bpf_prog *prog, int flags);
 void bpf_prog_put(struct bpf_prog *prog);
 void bpf_prog_put_rcu(struct bpf_prog *prog);
 
@@ -180,6 +182,16 @@ static inline struct bpf_prog *bpf_prog_get(u32 ufd)
return ERR_PTR(-EOPNOTSUPP);
 }
 
+static inline int bpf_prog_set(u32 ufd, struct bpf_prog *new)
+{
+   return -EINVAL;
+}
+
+static inline int bpf_new_fd(struct bpf_prog *prog, int flags)
+{
+   return -EINVAL;
+}
+
 static inline void bpf_prog_put(struct bpf_prog *prog)
 {
 }
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index a19ddac..41b083c 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -95,4 +95,18 @@ static inline void get_seccomp_filter(struct task_struct 
*tsk)
return;
 }
 #endif /* CONFIG_SECCOMP_FILTER */
+
+#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
+extern long seccomp_get_filter_fd(struct task_struct *child);
+extern long seccomp_next_filter(struct task_struct *child, u32 fd);
+#else
+static inline long seccomp_get_filter_fd(struct task_struct *child)
+{
+   return -EINVAL;
+}
+static inline long seccomp_next_filter(struct task_struct *child, u32 fd)
+{
+   return -EINVAL;
+}
+#endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index cf1019e..041c3c3 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -23,6 +23,9 @@
 
 #define PTRACE_SYSCALL   24
 
+#define PTRACE_SECCOMP_GET_FILTER_FD   40
+#define PTRACE_SECCOMP_NEXT_FILTER 41
+
 /* 0x4200-0x4300 are reserved for architecture-independent additions.  */
 #define PTRACE_SETOPTIONS  0x4200
 #define PTRACE_GETEVENTMSG 0x4201
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 58ae9f4..ac3ed1c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -506,6 +506,30 @@ struct bpf_prog *bpf_prog_get(u32 ufd)
 }
 EXPORT_SYMBOL_GPL(bpf_prog_get);
 
+int bpf_prog_set(u32 ufd, struct bpf_prog *new)
+{
+   struct fd f;
+   struct bpf_prog *prog;
+
+   f = fdget(ufd);
+
+   prog = get_prog(f);
+   if (!IS_ERR(prog) && prog)
+   bpf_prog_put(prog);
+
+   atomic_inc(>aux->refcnt);
+   f.file->private_data = new;
+   fdput(f);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(bpf_prog_set);
+
+int bpf_new_fd(struct bpf_prog *prog, int flags)
+{
+   return anon_inode_getfd("bpf-prog", _prog_fops, prog, flags);
+}
+EXPORT_SYMBOL_GPL(bpf_new_fd);
+
 /* last field in 'union bpf_attr' used by this command */
 #defineBPF_PROG_LOAD_LAST_FIELD kern_version
 
@@ -572,7 +596,7 @@ static int bpf_prog_load(union bpf_attr *attr)
if (err < 0)
goto free_used_maps;
 
-   err = anon_inode_getfd("bpf-prog", _prog_fops, prog, O_RDWR | 
O_CLOEXEC);
+   err = bpf_new_fd(prog, O_RDWR | O_CLOEXEC);
if (err < 0)
/* failed to allocate fd */
goto free_used_maps;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index c8e0e05..a151c35 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -1003,6 +1003,13 @@ int ptrace_request(struct task_struct *child, long 
request,
break;
}
 #endif
+
+   case PTRACE_SECCOMP_GET_FILTER_FD:
+   return seccomp_get_filter_fd(child);
+
+   case PT

[PATCH v2 5/5] seccomp: add a way to attach a filter via eBPF fd

2015-09-10 Thread Tycho Andersen

This is the final bit needed to support seccomp filters created via the bpf
syscall. The patch adds a new seccomp operation SECCOMP_MODE_FILTER_EBPF,
which takes exactly one command (presumably to be expanded upon later when
seccomp EBPFs support more interesting things) and an argument struct
similar to that of bpf(), although the size is explicit in the struct to
avoid changing the signature of seccomp().

v2: Don't abuse seccomp's third argument; use a separate command and a
pointer to a structure instead.

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 include/uapi/linux/seccomp.h |  16 +
 kernel/seccomp.c | 135 ++-
 2 files changed, 138 insertions(+), 13 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 0f238a4..a8694e2 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -13,10 +13,14 @@
 /* Valid operations for seccomp syscall. */
 #define SECCOMP_SET_MODE_STRICT0
 #define SECCOMP_SET_MODE_FILTER1
+#define SECCOMP_MODE_FILTER_EBPF   2
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
 #define SECCOMP_FILTER_FLAG_TSYNC  1
 
+/* Valid cmds for SECCOMP_MODE_FILTER_EBPF */
+#define SECCOMP_EBPF_ADD_FD0
+
 /*
  * All BPF programs must return a 32-bit value.
  * The bottom 16-bits are for optional return data.
@@ -51,4 +55,16 @@ struct seccomp_data {
__u64 args[6];
 };
 
+struct seccomp_ebpf {
+   unsigned int size;
+
+   union {
+   /* SECCOMP_EBPF_ADD_FD */
+   struct {
+   unsigned intadd_flags;
+   __u32   add_fd;
+   };
+   };
+};
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 1856f69..e78175a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -65,6 +65,9 @@ struct seccomp_filter {
 /* Limit any path through the tree to 256KB worth of instructions. */
 #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))
 
+static long seccomp_install_filter(unsigned int flags,
+  struct seccomp_filter *prepared);
+
 /*
  * Endianness is explicitly ignored and left for BPF program authors to manage
  * as per the specific architecture.
@@ -356,17 +359,6 @@ static struct seccomp_filter 
*seccomp_prepare_filter(struct sock_fprog *fprog)
 
BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter));
 
-   /*
-* Installing a seccomp filter requires that the task has
-* CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
-* This avoids scenarios where unprivileged tasks can affect the
-* behavior of privileged children.
-*/
-   if (!task_no_new_privs(current) &&
-   security_capable_noaudit(current_cred(), current_user_ns(),
-CAP_SYS_ADMIN) != 0)
-   return ERR_PTR(-EACCES);
-
/* Allocate a new seccomp_filter */
sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN);
if (!sfilter)
@@ -510,8 +502,105 @@ static void seccomp_send_sigsys(int syscall, int reason)
info.si_syscall = syscall;
force_sig_info(SIGSYS, , current);
 }
+
 #endif /* CONFIG_SECCOMP_FILTER */
 
+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_SECCOMP_FILTER)
+static struct seccomp_filter *seccomp_prepare_ebpf(int fd)
+{
+   struct seccomp_filter *ret;
+   struct bpf_prog *prog;
+
+   prog = bpf_prog_get(fd);
+   if (IS_ERR(prog))
+   return (struct seccomp_filter *) prog;
+
+   if (prog->type != BPF_PROG_TYPE_SECCOMP) {
+   bpf_prog_put(prog);
+   return ERR_PTR(-EINVAL);
+   }
+
+   ret = kzalloc(sizeof(*ret), GFP_KERNEL | __GFP_NOWARN);
+   if (!ret) {
+   bpf_prog_put(prog);
+   return ERR_PTR(-ENOMEM);
+   }
+
+   ret->prog = prog;
+   atomic_set(>usage, 1);
+
+   /* Intentionally don't bpf_prog_put() here, because the underlying prog
+* is refcounted too and we're holding a reference from the struct
+* seccomp_filter object.
+*/
+   return ret;
+}
+
+static long seccomp_ebpf_add_fd(struct seccomp_ebpf *ebpf)
+{
+   struct seccomp_filter *prepared;
+
+   prepared = seccomp_prepare_ebpf(ebpf->add_fd);
+   if (IS_ERR(prepared))
+   return PTR_ERR(prepared);
+
+   return seccomp_install_filter(ebpf->add_flags, prepared);
+}
+
+static l

[PATCH v2 3/5] ebpf: add a way to dump an eBPF program

2015-09-10 Thread Tycho Andersen

This commit adds a way to dump eBPF programs. The initial implementation
doesn't support maps, and therefore only allows dumping seccomp ebpf
programs which themselves don't currently support maps.

v2: don't export a prog_id for the filter

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 include/uapi/linux/bpf.h | 14 ++
 kernel/bpf/syscall.c | 41 +
 2 files changed, 55 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 631cdee..e037a76 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -107,6 +107,13 @@ enum bpf_cmd {
 * returns fd or negative error
 */
BPF_PROG_LOAD,
+
+   /* dump an existing bpf
+* err = bpf(BPF_PROG_DUMP, union bpf_attr *attr, u32 size)
+* Using attr->prog_fd, attr->dump_insn_cnt, attr->dump_insns
+* returns zero or negative error
+*/
+   BPF_PROG_DUMP,
 };
 
 enum bpf_map_type {
@@ -161,6 +168,13 @@ union bpf_attr {
__aligned_u64   log_buf;/* user supplied buffer */
__u32   kern_version;   /* checked when 
prog_type=kprobe */
};
+
+   struct { /* anonymous struct used by BPF_PROG_DUMP command */
+   __u32   prog_fd;
+   __u32   dump_insn_cnt;
+   __aligned_u64   dump_insns; /* user supplied buffer */
+   __u8gpl_compatible;
+   };
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index dc9b464..58ae9f4 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -586,6 +586,44 @@ free_prog:
return err;
 }
 
+static int bpf_prog_dump(union bpf_attr *attr, union bpf_attr __user *uattr)
+{
+   int ufd = attr->prog_fd;
+   struct fd f = fdget(ufd);
+   struct bpf_prog *prog;
+   int ret = -EINVAL;
+
+   prog = get_prog(f);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   /* For now, let's refuse to dump anything that isn't a seccomp program.
+* Other program types have support for maps, which our current dump
+* code doesn't support.
+*/
+   if (prog->type != BPF_PROG_TYPE_SECCOMP)
+   goto out;
+
+   ret = -EFAULT;
+   if (put_user(prog->len, >dump_insn_cnt))
+   goto out;
+
+   if (put_user((u8) prog->gpl_compatible, >gpl_compatible))
+   goto out;
+
+   if (attr->dump_insns) {
+   u32 len = prog->len * sizeof(struct bpf_insn);
+
+   if (copy_to_user(u64_to_ptr(attr->dump_insns),
+prog->insns, len) != 0)
+   goto out;
+   }
+
+   ret = 0;
+out:
+   return ret;
+}
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
size)
 {
union bpf_attr attr = {};
@@ -650,6 +688,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
uattr, unsigned int, siz
case BPF_PROG_LOAD:
err = bpf_prog_load();
break;
+   case BPF_PROG_DUMP:
+   err = bpf_prog_dump(, uattr);
+   break;
default:
err = -EINVAL;
break;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/5] ebpf: add a seccomp program type

2015-09-10 Thread Tycho Andersen

seccomp uses eBPF as its underlying storage and execution format, and eBPF
has features that seccomp would like to make use of in the future. This
patch adds a formal seccomp type to the eBPF verifier.

The current implementation of the seccomp eBPF type is very limited, and
doesn't support some interesting features (notably, maps) of eBPF. However,
the primary motivation for this patchset is to enable checkpoint/restore
for seccomp filters later in the series, to this limited feature set is ok
for now.

v2: * don't allow seccomp eBPF programs to call any functions
* get rid of superfluous seccomp_convert_ctx_access

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 include/uapi/linux/bpf.h |  1 +
 net/core/filter.c| 31 +++
 2 files changed, 32 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 92a48e2..631cdee 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -123,6 +123,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_KPROBE,
BPF_PROG_TYPE_SCHED_CLS,
BPF_PROG_TYPE_SCHED_ACT,
+   BPF_PROG_TYPE_SECCOMP,
 };
 
 #define BPF_PSEUDO_MAP_FD  1
diff --git a/net/core/filter.c b/net/core/filter.c
index 13079f0..faaae67 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1612,6 +1612,15 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
}
 }
 
+static const struct bpf_func_proto *
+seccomp_func_proto(enum bpf_func_id func_id)
+{
+   /* At some point in the future seccomp filters may grow support for
+* eBPF functions. For now, these are disabled.
+*/
+   return NULL;
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
/* check bounds */
@@ -1662,6 +1671,17 @@ static bool tc_cls_act_is_valid_access(int off, int size,
return __is_valid_access(off, size, type);
 }
 
+static bool seccomp_is_valid_access(int off, int size,
+   enum bpf_access_type type)
+{
+   if (type == BPF_WRITE)
+   return false;
+
+   if (off < 0 || off >= sizeof(struct seccomp_data) || off & 3)
+   return false;
+
+   return true;
+}
 static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
  int src_reg, int ctx_off,
  struct bpf_insn *insn_buf)
@@ -1795,6 +1815,11 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
.convert_ctx_access = bpf_net_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops seccomp_ops = {
+   .get_func_proto = seccomp_func_proto,
+   .is_valid_access = seccomp_is_valid_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
.ops = _filter_ops,
.type = BPF_PROG_TYPE_SOCKET_FILTER,
@@ -1810,11 +1835,17 @@ static struct bpf_prog_type_list sched_act_type 
__read_mostly = {
.type = BPF_PROG_TYPE_SCHED_ACT,
 };
 
+static struct bpf_prog_type_list seccomp_type __read_mostly = {
+   .ops = _ops,
+   .type = BPF_PROG_TYPE_SECCOMP,
+};
+
 static int __init register_sk_filter_ops(void)
 {
bpf_register_prog_type(_filter_type);
bpf_register_prog_type(_cls_type);
bpf_register_prog_type(_act_type);
+   bpf_register_prog_type(_type);
 
return 0;
 }
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/5] seccomp: make underlying bpf ref counted as well

2015-09-10 Thread Tycho Andersen

In the next patch, we're going to add a way to access the underlying
filters via bpf fds. This means that we need to ref-count both the
struct seccomp_filter objects and the struct bpf_prog objects separately,
in case a process dies but a filter is still referred to by another
process.

Additionally, we mark classic converted seccomp filters as seccomp eBPF
programs, since they are a subset of what is supported in seccomp eBPF.

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 kernel/seccomp.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 245df6b..afaeddf 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -378,6 +378,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct 
sock_fprog *fprog)
}
 
atomic_set(>usage, 1);
+   atomic_set(>prog->aux->refcnt, 1);
+   sfilter->prog->type = BPF_PROG_TYPE_SECCOMP;
 
return sfilter;
 }
@@ -470,7 +472,7 @@ void get_seccomp_filter(struct task_struct *tsk)
 static inline void seccomp_filter_free(struct seccomp_filter *filter)
 {
if (filter) {
-   bpf_prog_free(filter->prog);
+   bpf_prog_put(filter->prog);
kfree(filter);
}
 }
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] ebpf: emit correct src_reg for conditional jumps

2015-09-10 Thread Tycho Andersen

Instead of always emitting BPF_REG_X, let's emit BPF_REG_X only when the
source actually is BPF_X. This causes programs generated by the classic
converter to not be importable via bpf(), as the eBPF verifier checks that
the src_reg is correct or 0. While not a problem yet, this will be a
problem when BPF_PROG_DUMP lands, and we can potentially dump and re-import
programs generated by the converter.

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 net/core/filter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 13079f0..05a04ea 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -478,9 +478,9 @@ do_pass:
bpf_src = BPF_X;
} else {
insn->dst_reg = BPF_REG_A;
-   insn->src_reg = BPF_REG_X;
insn->imm = fp->k;
bpf_src = BPF_SRC(fp->code);
+   insn->src_reg = bpf_src == BPF_X ? BPF_REG_X : 
0;
}
 
/* Common case where 'jump_false' is next insn. */
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

v2 of seccomp filter c/r patches

2015-09-10 Thread Tycho Andersen

Hi all,

Here is v2 of the seccomp filter c/r set. The patch notes have individual
changes from the last series, but there are two points not noted:

* The series still does not allow us to correctly restore state for programs
  that will use SECCOMP_FILTER_FLAG_TSYNC in the future. Given that we want to
  keep seccomp_filter's identity, I think something along the lines of another
  seccomp command like SECCOMP_INHERIT_PARENT is needed (although I'm not sure
  if this can even be done yet). In addition, we'll need a kcmp command for
  figuring out if filters are the same, although this too needs to compare
  seccomp_filter objects, so it's a little screwy. Any thoughts on how to do
  this nicely are welcome.

* I've dropped the bpf converter bug from the set and will submit it
  separately.

Alexei mentioned that this should go via net-next to minimize cross-tree
conflicts. Does that make sense here?

Thanks,

Tycho

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/6] seccomp: add a way to attach a filter via eBPF fd

2015-09-09 Thread Tycho Andersen

On Wed, Sep 09, 2015 at 08:14:04AM -0700, Alexei Starovoitov wrote:
> On Wed, Sep 09, 2015 at 08:47:24AM -0600, Tycho Andersen wrote:
> > On Tue, Sep 08, 2015 at 05:07:03PM -0700, Kees Cook wrote:
> > >
> > > Yeah, bpf's union looks good. Let's add a "command" flag, though:
> > > 
> > > seccomp(SECCOMP_MODE_FILTER_EBPF, int cmd, union, size);
> > > 
> > > And this cmd could be ADD_FD or something?
> > > 
> > > How's that look?
> > 
> > I think we can drop the size (using the same strategy as bpf() and
> > checking for zeroes at the end), and keep the same signature for
> > seccomp(); so:
> > 
> > seccomp(SECCOMP_MODE_FILTER_EBPF, SECCOMP_ADD_BPF_FD, )
> > 
> > Yes, I'll use this in the next version.
> 
> actually bpf() has size as the last argument:
> SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
> size)
> perf_event_open() doesn't and size is embedded as one of the fields.
> Both approaches are equivally powerfull from extensitiblity
> point of view. My preference was to keep size as an explicit
> argument.

Yep, sorry that was poorly written. I meant keeping the size as a
member of the struct as Michael originally suggested, mostly to avoid
having to change the signature of seccomp().

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/6] ebpf: add a seccomp program type

2015-09-09 Thread Tycho Andersen

On Fri, Sep 04, 2015 at 02:08:37PM -0700, Kees Cook wrote:
> On Fri, Sep 4, 2015 at 2:06 PM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > On Fri, Sep 04, 2015 at 01:34:12PM -0700, Kees Cook wrote:
> >> On Fri, Sep 4, 2015 at 9:04 AM, Tycho Andersen
> >> <tycho.ander...@canonical.com> wrote:
> >> > +static const struct bpf_func_proto *
> >> > +seccomp_func_proto(enum bpf_func_id func_id)
> >> > +{
> >> > +   /* Right now seccomp eBPF loading doesn't support maps; seccomp 
> >> > filters
> >> > +* are considered to be read-only after they're installed, so 
> >> > map fds
> >> > +* probably need to be invalidated when a seccomp filter with 
> >> > maps is
> >> > +* installed.
> >> > +*
> >> > +* The rest of these might be reasonable to call from seccomp, 
> >> > so we
> >> > +* export them.
> >> > +*/
> >> > +   switch (func_id) {
> >> > +   case BPF_FUNC_ktime_get_ns:
> >> > +   return _ktime_get_ns_proto;
> >> > +   case BPF_FUNC_trace_printk:
> >> > +   return bpf_get_trace_printk_proto();
> >> > +   case BPF_FUNC_get_prandom_u32:
> >> > +   return _get_prandom_u32_proto;
> >> > +   case BPF_FUNC_get_smp_processor_id:
> >> > +   return _get_smp_processor_id_proto;
> >> > +   case BPF_FUNC_tail_call:
> >> > +   return _tail_call_proto;
> >> > +   case BPF_FUNC_get_current_pid_tgid:
> >> > +   return _get_current_pid_tgid_proto;
> >> > +   case BPF_FUNC_get_current_uid_gid:
> >> > +   return _get_current_uid_gid_proto;
> >> > +   case BPF_FUNC_get_current_comm:
> >> > +   return _get_current_comm_proto;
> >> > +   default:
> >> > +   return NULL;
> >> > +   }
> >> > +}
> >>
> >> While this list is probably fine, I don't want to mix the addition of
> >> eBPF functions to the seccomp ABI with the CRIU changes. No function
> >> calls are currently possible and it should stay that way.
> >
> > Ok, I can remove them.
> >
> >> I was expecting to see a validator, similar to the existing BPF
> >> validator that is called when creating seccomp filters currently. Can
> >> we add a similar validator for new BPF_PROG_TYPE_SECCOMP?
> >
> > That's effectively what this patch does; when the eBPF is loaded via
> > bpf(), you tell bpf() you want a BPF_PROG_TYPE_SECCOMP, and it invokes
> > this validation/translation code, i.e. it uses
> > seccomp_is_valid_access() to check and make sure access are aligned
> > and inside struct seccomp_data.
> 
> What about limiting the possible instructions?

I totally overlooked this. A quick glance through the eBPF verifier
makes me think that we can just add another function to struct
bpf_verifier_ops called valid_instruction, which shouldn't be too
hard. Perhaps a more interesting question is what to allow:

BPF_LD(X) and BPF_ST(X): it looks like all types of stores are
  allowed, and only BPF_MEM and BPF_IMM loads are allowed; I think
  these can stay the same. BPF_XADD is new in eBPF, and I don't think
  we need it for seccomp (yet), since we don't have any shared memory
  via maps.

BPF_ALU: It looks like we're also not allowing regular BPF_ALU
  instruction BPF_MOD; eBPF adds a few ones: BPF_MOV (register move),
  BPF_ARSH (sign extended right shift), and BPF_END (endianness
  conversion), wich I think should all be safe. In particular, we need
  to allow BPF_MOV at least, since that's how the converter implements
  BPF_MISC | BPF_TAX from classic.

BPF_ALU64: I think we can safely allow all these as above, since
  they're just the 64-bit versions.

BPF_JMP: eBPF adds BPF_JNE, BPF_JSGT, BPF_JSGE, BPF_CALL, and
  BPF_EXIT, which I think all should be safe (except maybe BPF_CALL
  since we're not allowing functions really). Again we have to allow
  one of the new eBPF codes, as the converter implements BPF_RET as
  BPF_JMP | BPF_EXIT.

Thoughts?

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/6] seccomp: add a way to attach a filter via eBPF fd

2015-09-09 Thread Tycho Andersen

On Tue, Sep 08, 2015 at 05:07:03PM -0700, Kees Cook wrote:
>
> Yeah, bpf's union looks good. Let's add a "command" flag, though:
> 
> seccomp(SECCOMP_MODE_FILTER_EBPF, int cmd, union, size);
> 
> And this cmd could be ADD_FD or something?
> 
> How's that look?

I think we can drop the size (using the same strategy as bpf() and
checking for zeroes at the end), and keep the same signature for
seccomp(); so:

seccomp(SECCOMP_MODE_FILTER_EBPF, SECCOMP_ADD_BPF_FD, )

Yes, I'll use this in the next version.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] ebpf: add a way to dump an eBPF program

2015-09-09 Thread Tycho Andersen

On Fri, Sep 04, 2015 at 06:27:27PM -0600, Tycho Andersen wrote:
> On Fri, Sep 04, 2015 at 04:08:53PM -0700, Andy Lutomirski wrote:
> > On Fri, Sep 4, 2015 at 3:28 PM, Tycho Andersen
> > <tycho.ander...@canonical.com> wrote:
> > > On Fri, Sep 04, 2015 at 02:48:03PM -0700, Andy Lutomirski wrote:
> > >> On Fri, Sep 4, 2015 at 1:45 PM, Tycho Andersen
> > >> <tycho.ander...@canonical.com> wrote:
> > >> > On Fri, Sep 04, 2015 at 01:17:30PM -0700, Kees Cook wrote:
> > >> >> On Fri, Sep 4, 2015 at 9:04 AM, Tycho Andersen
> > >> >> <tycho.ander...@canonical.com> wrote:
> > >> >> > This commit adds a way to dump eBPF programs. The initial 
> > >> >> > implementation
> > >> >> > doesn't support maps, and therefore only allows dumping seccomp ebpf
> > >> >> > programs which themselves don't currently support maps.
> > >> >> >
> > >> >> > We export the GPL bit as well as a unique ID for the program so that
> > >> >>
> > >> >> This unique ID appears to be the heap address for the prog. That's a
> > >> >> huge leak, and should not be done. We don't want to introduce new
> > >> >> kernel address leaks while we're trying to fix the remaining ones.
> > >> >> Shouldn't the "unique ID" be the fd itself? I imagine KCMP_FILE
> > >> >> could be used, for example.
> > >> >
> > >> > No; we acquire the fd per process, so if a task installs a filter and
> > >> > then forks N times, we'll grab N (+1) copies of the filter from N (+1)
> > >> > different file descriptors. Ideally, we'd have some way to figure out
> > >> > that these were all the same. Some sort of prog_id is one way,
> > >> > although there may be others.
> > >>
> > >> I disagree a bit.  I think we want the actual hierarchy to be a
> > >> well-defined thing, because I have plans to make the hierarchy
> > >> actually do something.  That means that we'll need to have a more
> > >> exact way to dump the hierarchy than "these two filters are identical"
> > >> or "these two filters are not identical".
> > >
> > > Can you elaborate on what this would look like? I think with the
> > > "these two filters are the same" primitive (the same in the sense that
> > > they were inherited during a fork, not just that
> > > memcmp(filter1->insns, filter2->insns) == 0) you can infer the entire
> > > hierarchy, however clunky it may be to do so.
> > >
> > > Another issue is that KCMP_FILE won't work in this case, as it
> > > effectively compares the struct file *, which will be different since
> > > we need to call anon_inode_getfd() for each call of
> > > ptrace(PTRACE_SECCOMP_GET_FILTER_FD). We could add a KCMP_BPF (or just
> > > a KCMP_FILE_PRIVATE_DATA, since that's effectively what it would be).
> > > Does that make sense? [added Cyrill]
> > >
> > 
> > I don't really know what it would look like.  I think we want a way to
> > compare struct seccomp_filter pointers.

Here's a thought,

The set I'm currently proposing effectively separates the ref-counting
of the struct seccomp_filter from the struct bpf_prog (by necessity,
since we're referring to filters from fds). What if we went a little
futher, and made a copy of each seccomp_filter on fork(), keeping it
pointed at the same bpf_prog but adding some metadata about how it was
inherited (tsk->seccomp.filter->inheritence_count++ perhaps). This
would still require this change:

> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 9c6bea6..efc3f36 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -239,7 +239,7 @@ static int is_ancestor(struct seccomp_filter *parent,
>   if (parent == NULL)
>   return 1;
>   for (; child; child = child->prev)
> - if (child == parent)
> + if (child->prog == parent->prog)
>   return 1;
>   return 0;
>  }

to get the ancestry right on restore, but the change would make more
sense given that the seccomp_filter pointers were in fact unique at
this point.

To access it, we can change the current set to instead of iterating on
bpf_prog, to iterate on seccomp_filter, and add a few seccomp
commands, so the whole sequence looks like this:

seccomp_fd = ptrace(PTRACE_SECCOMP_GET_FILTER_FD, pid);
if (seccomp(SECCOMP_MODE_FILTER_EBPF, SECCOMP_EBPF_INHERITANCE, fd) > 0) {
/* mark

Re: [PATCH 3/6] ebpf: add a way to dump an eBPF program

2015-09-09 Thread Tycho Andersen

On Wed, Sep 09, 2015 at 04:44:24PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 9, 2015 at 3:34 PM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> >
> > Here's a thought,
> >
> > The set I'm currently proposing effectively separates the ref-counting
> > of the struct seccomp_filter from the struct bpf_prog (by necessity,
> > since we're referring to filters from fds). What if we went a little
> > futher, and made a copy of each seccomp_filter on fork(), keeping it
> > pointed at the same bpf_prog but adding some metadata about how it was
> > inherited (tsk->seccomp.filter->inheritence_count++ perhaps). This
> > would still require this change:
> 
> Won't that break the tsync mechanism?

We'll need the change I posted (is_ancestor comparing the underlying
bpf_prog instead of the seccomp_filter), but then I think it'll work.
I guess we'll need to do some more bookkeeping when we install filters
via TSYNC since each thread would need its own seccomp_filter, and
we'd also have to decide whether a filter installed via TSYNC was
inherited or not.

Am I missing something?

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] ebpf: add a way to dump an eBPF program

2015-09-09 Thread Tycho Andersen

On Wed, Sep 09, 2015 at 05:44:06PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 9, 2015 at 5:13 PM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > On Wed, Sep 09, 2015 at 04:44:24PM -0700, Andy Lutomirski wrote:
> >> On Wed, Sep 9, 2015 at 3:34 PM, Tycho Andersen
> >> <tycho.ander...@canonical.com> wrote:
> >> >
> >> > Here's a thought,
> >> >
> >> > The set I'm currently proposing effectively separates the ref-counting
> >> > of the struct seccomp_filter from the struct bpf_prog (by necessity,
> >> > since we're referring to filters from fds). What if we went a little
> >> > futher, and made a copy of each seccomp_filter on fork(), keeping it
> >> > pointed at the same bpf_prog but adding some metadata about how it was
> >> > inherited (tsk->seccomp.filter->inheritence_count++ perhaps). This
> >> > would still require this change:
> >>
> >> Won't that break the tsync mechanism?
> >
> > We'll need the change I posted (is_ancestor comparing the underlying
> > bpf_prog instead of the seccomp_filter), but then I think it'll work.
> > I guess we'll need to do some more bookkeeping when we install filters
> > via TSYNC since each thread would need its own seccomp_filter, and
> > we'd also have to decide whether a filter installed via TSYNC was
> > inherited or not.
> >
> > Am I missing something?
> 
> Yes.  I don't think that:
> 
> int fd = [create an ebpf fd];
> if (fork()) {
>   /* Process A */
>   seccomp(attach fd);
>   ...
> } else {
>   /* Process B */
>   seccomp(attach fd);
>   ...
> }
> 
> should result in processes A and B being considered to have the same
> seccomp_filter state.  In particular, I eventually want to make the
> seccomp_filter state be considerably more interesting than just the
> bpf program.
> 
> There's another severe problem, I think.  Suppose that ebpf1 and ebpf2
> are ebpf fds.  If processes C and D start out with no filters at all,
> C attaches ebpf1 and ebpf2, and D attaches just ebpf2, then C and D
> are definitely *not* in the same state, and neither is an ancestor of
> the other.

Ah, yes.

> IOW I really do think that seccomp_filter should have identity.

What if we kept a pointer to the seccomp_filter that was inherited on
fork()? Everything "below" that in the tree is not inherited, and
everything above is. Unfortunately, it's not obvious how to restore
this state.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/6] ebpf: add a seccomp program type

2015-09-09 Thread Tycho Andersen

On Wed, Sep 09, 2015 at 10:27:08AM -0700, Kees Cook wrote:
> On Wed, Sep 9, 2015 at 9:52 AM, Alexei Starovoitov
> <alexei.starovoi...@gmail.com> wrote:
> > On Wed, Sep 09, 2015 at 09:37:51AM -0700, Kees Cook wrote:
> >> On Wed, Sep 9, 2015 at 9:09 AM, Daniel Borkmann <dan...@iogearbox.net> 
> >> wrote:
> >> > On 09/09/2015 06:07 PM, Alexei Starovoitov wrote:
> >> >>
> >> >> On Wed, Sep 09, 2015 at 09:50:35AM -0600, Tycho Andersen wrote:
> >> >
> >> > [...]
> >> >>>
> >> >>> Thoughts?
> >> >>
> >> >>
> >> >> Please do not add any per-instruction hacks. None of them are
> >> >> necessary. Classic had to do extra ugly checks in seccomp only
> >> >> because verifier wasn't flexible enough.
> >> >> If you don't want to see any BPF_CALL in seccomp, just have
> >> >> empty get_func_proto() callback for BPF_PROG_TYPE_SECCOMP
> >> >> and verifier will reject all calls.
> >> >> Currently we have only two non-generic instrucitons
> >> >> LD_ABS and LD_IND that are avaialable for sockets/TC only,
> >> >> because these are legacy instructions and we had to make
> >> >> exceptions for them.
> >> >
> >> > Yep, +1.
> >>
> >> Hrmpf. This adds to the cognitive load for accepting this patch
> >> series. :P Now I have to convince myself that there is no additional
> >> exposure to seccomp by using the entire set of eBPF instructions.
> >> While I'm pretty sure it'll be fine, I really don't want to risk being
> >> wrong and opening a hole here. I will spend some time looking at the
> >> new eBPF instructions...
> >
> > note, as was discussed many times before, there is no pointer leak
> > prevention pass yet, so eBPF is root only.
> > Once the pass is complete it will prevent passing addresses to
> > functions, storing them in maps and returning from the program.
> 
> Tycho, are you building new eBPF filters as the root user and then
> attaching them later?

Yep, exactly.

> I was imagining you were going to need this entirely as non-root.

We can't exactly because of the real root restriction on bpf(). So we
just open the bpf fds when we're real root and keep them until the
very end of the restore when we finally install the seccomp filters.
Not ideal, of course, but I assume we can change how this works once
bpf() deemed safe for non-root.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/6] seccomp: add a way to attach a filter via eBPF fd

2015-09-08 Thread Tycho Andersen

On Sat, Sep 05, 2015 at 09:13:02AM +0200, Michael Kerrisk (man-pages) wrote:
> On 09/04/2015 10:41 PM, Kees Cook wrote:
> > On Fri, Sep 4, 2015 at 9:04 AM, Tycho Andersen
> > <tycho.ander...@canonical.com> wrote:
> >> This is the final bit needed to support seccomp filters created via the bpf
> >> syscall.
> 
> Hmm. Thanks Kees, for CCinf linux-api@. That really should have been done at
> the outset.

Apologies, I'll cc the list on future versions.

> Tycho, where's the man-pages patch describing this new kernel-userspace
> API feature? :-)

Once we get the API finalized I'm happy to write it.

> >> One concern with this patch is exactly what the interface should look like
> >> for users, since seccomp()'s second argument is a pointer, we could ask
> >> people to pass a pointer to the fd, but implies we might write to it which
> >> seems impolite. Right now we cast the pointer (and force the user to cast
> >> it), which generates ugly warnings. I'm not sure what the right answer is
> >> here.
> >>
> >> Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
> >> CC: Kees Cook <keesc...@chromium.org>
> >> CC: Will Drewry <w...@chromium.org>
> >> CC: Oleg Nesterov <o...@redhat.com>
> >> CC: Andy Lutomirski <l...@amacapital.net>
> >> CC: Pavel Emelyanov <xe...@parallels.com>
> >> CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
> >> CC: Alexei Starovoitov <a...@kernel.org>
> >> CC: Daniel Borkmann <dan...@iogearbox.net>
> >> ---
> >>  include/linux/seccomp.h  |  3 +-
> >>  include/uapi/linux/seccomp.h |  1 +
> >>  kernel/seccomp.c | 70 
> >> 
> >>  3 files changed, 61 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> >> index d1a86ed..a725dd5 100644
> >> --- a/include/linux/seccomp.h
> >> +++ b/include/linux/seccomp.h
> >> @@ -3,7 +3,8 @@
> >>
> >>  #include 
> >>
> >> -#define SECCOMP_FILTER_FLAG_MASK   (SECCOMP_FILTER_FLAG_TSYNC)
> >> +#define SECCOMP_FILTER_FLAG_MASK   (\
> >> +   SECCOMP_FILTER_FLAG_TSYNC | SECCOMP_FILTER_FLAG_EBPF)
> >>
> >>  #ifdef CONFIG_SECCOMP
> >>
> >> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> >> index 0f238a4..c29a423 100644
> >> --- a/include/uapi/linux/seccomp.h
> >> +++ b/include/uapi/linux/seccomp.h
> >> @@ -16,6 +16,7 @@
> >>
> >>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> >>  #define SECCOMP_FILTER_FLAG_TSYNC  1
> >> +#define SECCOMP_FILTER_FLAG_EBPF   (1 << 1)
> >>
> >>  /*
> >>   * All BPF programs must return a 32-bit value.
> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> >> index a2c5b32..9c6bea6 100644
> >> --- a/kernel/seccomp.c
> >> +++ b/kernel/seccomp.c
> >> @@ -355,17 +355,6 @@ static struct seccomp_filter 
> >> *seccomp_prepare_filter(struct sock_fprog *fprog)
> >>
> >> BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter));
> >>
> >> -   /*
> >> -* Installing a seccomp filter requires that the task has
> >> -* CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
> >> -* This avoids scenarios where unprivileged tasks can affect the
> >> -* behavior of privileged children.
> >> -*/
> >> -   if (!task_no_new_privs(current) &&
> >> -   security_capable_noaudit(current_cred(), current_user_ns(),
> >> -CAP_SYS_ADMIN) != 0)
> >> -   return ERR_PTR(-EACCES);
> >> -
> >> /* Allocate a new seccomp_filter */
> >> sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN);
> >> if (!sfilter)
> >> @@ -509,6 +498,48 @@ static void seccomp_send_sigsys(int syscall, int 
> >> reason)
> >> info.si_syscall = syscall;
> >> force_sig_info(SIGSYS, , current);
> >>  }
> >> +
> >> +#ifdef CONFIG_BPF_SYSCALL
> >> +static struct seccomp_filter *seccomp_prepare_ebpf(const char __user 
> >> *filter)
> >> +{
> >> +   /* XXX: this cast generates a warning. should we make people pass 
> >> in
> >> +* , or is there some nicer way of doing this?
&g

Re: eBPF / seccomp globals?

2015-09-04 Thread Tycho Andersen

Hi all,

On Thu, Sep 03, 2015 at 08:17:05PM -0700, Alexei Starovoitov wrote:
> On Fri, Sep 04, 2015 at 01:01:20AM +, Michael Tirado wrote:
> > Hiyall,
> > 
> > I have created a seccomp white list filter for a program that launches
> > other less trustworthy programs.  It's working great so far, but I
> > have run into a little roadblock.  the launcher program needs to call
> > execve as it's final step, but that may not be present in the white
> > list.  I am wondering if there is any way to use some sort of global
> > variable that will be preserved between syscall filter calls so that I
> > can allow only one execve, if not present in white list by
> > incrementing a counter variable.
> > 
> > I see that in Documentation/networking/filter.txt one of the registers
> > is documented as being a pointer to struct sk_buff, in the seccomp
> > context this is a pointer to struct seccomp_data  instead, right?  and
> > the line about callee saved registers R6-R9  probably refers to them
> > being saved across calls within that filter, and not calls between
> > filters?
> 
> R6-R9 are the registered preserved across calls to helper functions
> within single program. They are not preserved across invocations
> of the same program. At the start of the program only R1 (pointer
> to context) is valid.
> The eBPF programs used for kprobes, sockets and TC can simulate
> global state via maps. Like a map of one element can have some
> 'struct globals { ... }' as a value in such map. Then programs
> can keep global state in there. If a key into such map is cpu_id,
> then such state becomes per-cpu global. Other tricks possible too.
> Unfortunately seccomp doesn't have access to eBPF yet
> (only classic BPF is supported), but, I believe, Tycho is
> working on adding eBPF to seccomp and criu of eBPF programs...

Indeed I am, however my patches don't have support for seccomp
programs using eBPF maps. I'm intending to post them later today, so
stay tuned.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/6] seccomp: make underlying bpf ref counted as well

2015-09-04 Thread Tycho Andersen

In the next patch, we're going to add a way to access the underlying
filters via bpf fds. This means that we need to ref-count both the
struct seccomp_filter objects and the struct bpf_prog objects separately,
in case a process dies but a filter is still referred to by another
process.

Additionally, we mark classic converted seccomp filters as seccomp eBPF
programs, since they are a subset of what is supported in seccomp eBPF.

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 kernel/seccomp.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 5bd4779..acfe1fb 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -377,6 +377,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct 
sock_fprog *fprog)
}
 
atomic_set(>usage, 1);
+   atomic_set(>prog->aux->refcnt, 1);
+   sfilter->prog->type = BPF_PROG_TYPE_SECCOMP;
 
return sfilter;
 }
@@ -469,7 +471,7 @@ void get_seccomp_filter(struct task_struct *tsk)
 static inline void seccomp_filter_free(struct seccomp_filter *filter)
 {
if (filter) {
-   bpf_prog_free(filter->prog);
+   bpf_prog_put(filter->prog);
kfree(filter);
}
 }
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/6] ebpf: add a way to dump an eBPF program

2015-09-04 Thread Tycho Andersen

This commit adds a way to dump eBPF programs. The initial implementation
doesn't support maps, and therefore only allows dumping seccomp ebpf
programs which themselves don't currently support maps.

We export the GPL bit as well as a unique ID for the program so that
userspace can detect when two seccomp filters were inherited from each
other and clone the filter tree accordingly.

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 include/uapi/linux/bpf.h | 15 +++
 kernel/bpf/syscall.c | 44 
 2 files changed, 59 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 79b825a..c5d8dc2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -107,6 +107,13 @@ enum bpf_cmd {
 * returns fd or negative error
 */
BPF_PROG_LOAD,
+
+   /* dump an existing bpf
+* err = bpf(BPF_PROG_DUMP, union bpf_attr *attr, u32 size)
+* Using attr->prog_fd, attr->dump_insn_cnt, attr->dump_insns
+* returns zero or negative error
+*/
+   BPF_PROG_DUMP,
 };
 
 enum bpf_map_type {
@@ -160,6 +167,14 @@ union bpf_attr {
__aligned_u64   log_buf;/* user supplied buffer */
__u32   kern_version;   /* checked when 
prog_type=kprobe */
};
+
+   struct { /* anonymous struct used by BPF_PROG_DUMP command */
+   __u32   prog_fd;
+   __u32   dump_insn_cnt;
+   __aligned_u64   dump_insns; /* user supplied buffer */
+   __u8gpl_compatible;
+   __u64   prog_id;/* unique id for this prog */
+   };
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a1b14d1..ee580d0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -586,6 +586,47 @@ free_prog:
return err;
 }
 
+static int bpf_prog_dump(union bpf_attr *attr, union __user bpf_attr *uattr)
+{
+   int ufd = attr->prog_fd;
+   struct fd f = fdget(ufd);
+   struct bpf_prog *prog;
+   int ret = -EINVAL;
+
+   prog = get_prog(f);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   /* For now, let's refuse to dump anything that isn't a seccomp program.
+* Other program types have support for maps, which our current dump
+* code doesn't support.
+*/
+   if (prog->type != BPF_PROG_TYPE_SECCOMP)
+   goto out;
+
+   ret = -EFAULT;
+   if (put_user(prog->len, >dump_insn_cnt))
+   goto out;
+
+   if (put_user((u8) prog->gpl_compatible, >gpl_compatible))
+   goto out;
+
+   if (put_user((u64) prog, >prog_id))
+   goto out;
+
+   if (attr->dump_insns) {
+   u32 len = prog->len * sizeof(struct bpf_insn);
+
+   if (copy_to_user(u64_to_ptr(attr->dump_insns),
+prog->insns, len) != 0)
+   goto out;
+   }
+
+   ret = 0;
+out:
+   return ret;
+}
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
size)
 {
union bpf_attr attr = {};
@@ -650,6 +691,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
uattr, unsigned int, siz
case BPF_PROG_LOAD:
err = bpf_prog_load();
break;
+   case BPF_PROG_DUMP:
+   err = bpf_prog_dump(, uattr);
+   break;
default:
err = -EINVAL;
break;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/6] seccomp: add a way to access filters via bpf fds

2015-09-04 Thread Tycho Andersen

This patch adds a way for a process that is "real root" to access the
seccomp filters of another process. The process first does a
PTRACE_SECCOMP_GET_FILTER_FD to get an fd with that process' seccomp filter
attached, and then iterates on this with PTRACE_SECCOMP_NEXT_FILTER using
bpf(BPF_PROG_DUMP) to dump the actual program at each step.

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 include/linux/bpf.h | 12 ++
 include/linux/seccomp.h | 14 +++
 include/uapi/linux/ptrace.h |  3 +++
 kernel/bpf/syscall.c| 26 -
 kernel/ptrace.c |  7 ++
 kernel/seccomp.c| 57 +
 6 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4383476..30682dc 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -157,6 +157,8 @@ void bpf_register_prog_type(struct bpf_prog_type_list *tl);
 void bpf_register_map_type(struct bpf_map_type_list *tl);
 
 struct bpf_prog *bpf_prog_get(u32 ufd);
+int bpf_prog_set(u32 ufd, struct bpf_prog *new);
+int bpf_new_fd(struct bpf_prog *prog, int flags);
 void bpf_prog_put(struct bpf_prog *prog);
 void bpf_prog_put_rcu(struct bpf_prog *prog);
 
@@ -175,6 +177,16 @@ static inline struct bpf_prog *bpf_prog_get(u32 ufd)
return ERR_PTR(-EOPNOTSUPP);
 }
 
+static inline int bpf_prog_set(u32 ufd, struct bpf_prog *new)
+{
+   return -EINVAL;
+}
+
+static inline int bpf_new_fd(struct bpf_prog *prog, int flags)
+{
+   return -EINVAL;
+}
+
 static inline void bpf_prog_put(struct bpf_prog *prog)
 {
 }
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index f426503..d1a86ed 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -95,4 +95,18 @@ static inline void get_seccomp_filter(struct task_struct 
*tsk)
return;
 }
 #endif /* CONFIG_SECCOMP_FILTER */
+
+#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
+extern long seccomp_get_filter_fd(struct task_struct *child);
+extern long seccomp_next_filter(struct task_struct *child, u32 fd);
+#else
+static inline long seccomp_get_filter_fd(struct task_struct *child)
+{
+   return -EINVAL;
+}
+static inline long seccomp_next_filter(struct task_struct *child, u32 fd)
+{
+   return -EINVAL;
+}
+#endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index a7a6979..dfd7d2e 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -23,6 +23,9 @@
 
 #define PTRACE_SYSCALL   24
 
+#define PTRACE_SECCOMP_GET_FILTER_FD   40
+#define PTRACE_SECCOMP_NEXT_FILTER 41
+
 /* 0x4200-0x4300 are reserved for architecture-independent additions.  */
 #define PTRACE_SETOPTIONS  0x4200
 #define PTRACE_GETEVENTMSG 0x4201
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ee580d0..58e7421 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -506,6 +506,30 @@ struct bpf_prog *bpf_prog_get(u32 ufd)
 }
 EXPORT_SYMBOL_GPL(bpf_prog_get);
 
+int bpf_prog_set(u32 ufd, struct bpf_prog *new)
+{
+   struct fd f;
+   struct bpf_prog *prog;
+
+   f = fdget(ufd);
+
+   prog = get_prog(f);
+   if (!IS_ERR(prog) && prog)
+   bpf_prog_put(prog);
+
+   atomic_inc(>aux->refcnt);
+   f.file->private_data = new;
+   fdput(f);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(bpf_prog_set);
+
+int bpf_new_fd(struct bpf_prog *prog, int flags)
+{
+   return anon_inode_getfd("bpf-prog", _prog_fops, prog, flags);
+}
+EXPORT_SYMBOL_GPL(bpf_new_fd);
+
 /* last field in 'union bpf_attr' used by this command */
 #defineBPF_PROG_LOAD_LAST_FIELD kern_version
 
@@ -572,7 +596,7 @@ static int bpf_prog_load(union bpf_attr *attr)
if (err < 0)
goto free_used_maps;
 
-   err = anon_inode_getfd("bpf-prog", _prog_fops, prog, O_RDWR | 
O_CLOEXEC);
+   err = bpf_new_fd(prog, O_RDWR | O_CLOEXEC);
if (err < 0)
/* failed to allocate fd */
goto free_used_maps;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 787320d..4e4b534 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -1016,6 +1016,13 @@ int ptrace_request(struct task_struct *child, long 
request,
break;
}
 #endif
+
+   case PTRACE_SECCOMP_GET_FILTER_FD:
+   return seccomp_get_filter_fd(child);
+
+   case PT

[PATCH 5/6] seccomp: add a way to attach a filter via eBPF fd

2015-09-04 Thread Tycho Andersen

This is the final bit needed to support seccomp filters created via the bpf
syscall.

One concern with this patch is exactly what the interface should look like
for users, since seccomp()'s second argument is a pointer, we could ask
people to pass a pointer to the fd, but implies we might write to it which
seems impolite. Right now we cast the pointer (and force the user to cast
it), which generates ugly warnings. I'm not sure what the right answer is
here.

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 include/linux/seccomp.h  |  3 +-
 include/uapi/linux/seccomp.h |  1 +
 kernel/seccomp.c | 70 
 3 files changed, 61 insertions(+), 13 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index d1a86ed..a725dd5 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -3,7 +3,8 @@
 
 #include 
 
-#define SECCOMP_FILTER_FLAG_MASK   (SECCOMP_FILTER_FLAG_TSYNC)
+#define SECCOMP_FILTER_FLAG_MASK   (\
+   SECCOMP_FILTER_FLAG_TSYNC | SECCOMP_FILTER_FLAG_EBPF)
 
 #ifdef CONFIG_SECCOMP
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 0f238a4..c29a423 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -16,6 +16,7 @@
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
 #define SECCOMP_FILTER_FLAG_TSYNC  1
+#define SECCOMP_FILTER_FLAG_EBPF   (1 << 1)
 
 /*
  * All BPF programs must return a 32-bit value.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index a2c5b32..9c6bea6 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -355,17 +355,6 @@ static struct seccomp_filter 
*seccomp_prepare_filter(struct sock_fprog *fprog)
 
BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter));
 
-   /*
-* Installing a seccomp filter requires that the task has
-* CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
-* This avoids scenarios where unprivileged tasks can affect the
-* behavior of privileged children.
-*/
-   if (!task_no_new_privs(current) &&
-   security_capable_noaudit(current_cred(), current_user_ns(),
-CAP_SYS_ADMIN) != 0)
-   return ERR_PTR(-EACCES);
-
/* Allocate a new seccomp_filter */
sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN);
if (!sfilter)
@@ -509,6 +498,48 @@ static void seccomp_send_sigsys(int syscall, int reason)
info.si_syscall = syscall;
force_sig_info(SIGSYS, , current);
 }
+
+#ifdef CONFIG_BPF_SYSCALL
+static struct seccomp_filter *seccomp_prepare_ebpf(const char __user *filter)
+{
+   /* XXX: this cast generates a warning. should we make people pass in
+* , or is there some nicer way of doing this?
+*/
+   u32 fd = (u32) filter;
+   struct seccomp_filter *ret;
+   struct bpf_prog *prog;
+
+   prog = bpf_prog_get(fd);
+   if (IS_ERR(prog))
+   return (struct seccomp_filter *) prog;
+
+   if (prog->type != BPF_PROG_TYPE_SECCOMP) {
+   bpf_prog_put(prog);
+   return ERR_PTR(-EINVAL);
+   }
+
+   ret = kzalloc(sizeof(*ret), GFP_KERNEL | __GFP_NOWARN);
+   if (!ret) {
+   bpf_prog_put(prog);
+   return ERR_PTR(-ENOMEM);
+   }
+
+   ret->prog = prog;
+   atomic_set(>usage, 1);
+
+   /* Intentionally don't bpf_prog_put() here, because the underlying prog
+* is refcounted too and we're holding a reference from the struct
+* seccomp_filter object.
+*/
+
+   return ret;
+}
+#else
+static struct seccomp_filter *seccomp_prepare_ebpf(const char __user *filter)
+{
+   return ERR_PTR(-EINVAL);
+}
+#endif
 #endif /* CONFIG_SECCOMP_FILTER */
 
 /*
@@ -775,8 +806,23 @@ static long seccomp_set_mode_filter(unsigned int flags,
if (flags & ~SECCOMP_FILTER_FLAG_MASK)
return -EINVAL;
 
+   /*
+* Installing a seccomp filter requires that the task has
+* CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
+* This avoids scenarios where unprivileged tasks can affect the
+* behavior of privileged children.
+*/
+   if (!task_no_new_privs(current) &&
+   security_capable_noaudit(current_cred(), current_user_ns(),
+CAP_SYS_ADMIN) != 0)
+   return -EACCES;
+
/* Prepare the new filter before holding any locks. */
-

[PATCH 6/6] ebpf: allow BPF_REG_X in src_reg conditional jumps

2015-09-04 Thread Tycho Andersen

The classic converter generates conditional jumps with:

if (BPF_SRC(fp->code) == BPF_K && (int) fp->k < 0) {
...
} else {
insn->dst_reg = BPF_REG_A;
insn->src_reg = BPF_REG_X;
insn->imm = fp->k;
bpf_src = BPF_SRC(fp->code);
}

but here, we enforce that the src_reg == BPF_REG_0. We should also allow
BPF_REG_X since that's what the converter generates; this enables us to
load eBPF programs that were generated by the converter.

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 kernel/bpf/verifier.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 039d866..2fff8cd 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1080,7 +1080,12 @@ static int check_cond_jmp_op(struct verifier_env *env,
if (err)
return err;
} else {
-   if (insn->src_reg != BPF_REG_0) {
+   switch (insn->src_reg) {
+   case BPF_REG_0:
+   /* the classic converter generates BPF_JMP with src_reg of X */
+   case BPF_REG_X:
+   break;
+   default:
verbose("BPF_JMP uses reserved fields\n");
return -EINVAL;
}
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/6] ebpf: add a seccomp program type

2015-09-04 Thread Tycho Andersen

seccomp uses eBPF as its underlying storage and execution format, and eBPF
has features that seccomp would like to make use of in the future. This
patch adds a formal seccomp type to the eBPF verifier.

The current implementation of the seccomp eBPF type is very limited, and
doesn't support some interesting features (notably, maps) of eBPF. However,
the primary motivation for this patchset is to enable checkpoint/restore
for seccomp filters later in the series, to this limited feature set is ok
for now.

Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
CC: Kees Cook <keesc...@chromium.org>
CC: Will Drewry <w...@chromium.org>
CC: Oleg Nesterov <o...@redhat.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Pavel Emelyanov <xe...@parallels.com>
CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
CC: Alexei Starovoitov <a...@kernel.org>
CC: Daniel Borkmann <dan...@iogearbox.net>
---
 include/uapi/linux/bpf.h |  1 +
 net/core/filter.c| 95 
 2 files changed, 96 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 29ef6f9..79b825a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -122,6 +122,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_KPROBE,
BPF_PROG_TYPE_SCHED_CLS,
BPF_PROG_TYPE_SCHED_ACT,
+   BPF_PROG_TYPE_SECCOMP,
 };
 
 #define BPF_PSEUDO_MAP_FD  1
diff --git a/net/core/filter.c b/net/core/filter.c
index be3098f..ed339fa 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1466,6 +1466,39 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
}
 }
 
+static const struct bpf_func_proto *
+seccomp_func_proto(enum bpf_func_id func_id)
+{
+   /* Right now seccomp eBPF loading doesn't support maps; seccomp filters
+* are considered to be read-only after they're installed, so map fds
+* probably need to be invalidated when a seccomp filter with maps is
+* installed.
+*
+* The rest of these might be reasonable to call from seccomp, so we
+* export them.
+*/
+   switch (func_id) {
+   case BPF_FUNC_ktime_get_ns:
+   return _ktime_get_ns_proto;
+   case BPF_FUNC_trace_printk:
+   return bpf_get_trace_printk_proto();
+   case BPF_FUNC_get_prandom_u32:
+   return _get_prandom_u32_proto;
+   case BPF_FUNC_get_smp_processor_id:
+   return _get_smp_processor_id_proto;
+   case BPF_FUNC_tail_call:
+   return _tail_call_proto;
+   case BPF_FUNC_get_current_pid_tgid:
+   return _get_current_pid_tgid_proto;
+   case BPF_FUNC_get_current_uid_gid:
+   return _get_current_uid_gid_proto;
+   case BPF_FUNC_get_current_comm:
+   return _get_current_comm_proto;
+   default:
+   return NULL;
+   }
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
/* check bounds */
@@ -1516,6 +1549,17 @@ static bool tc_cls_act_is_valid_access(int off, int size,
return __is_valid_access(off, size, type);
 }
 
+static bool seccomp_is_valid_access(int off, int size,
+   enum bpf_access_type type)
+{
+   if (type == BPF_WRITE)
+   return false;
+
+   if (off < 0 || off >= sizeof(struct seccomp_data) || off & 3)
+   return false;
+
+   return true;
+}
 static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
  int src_reg, int ctx_off,
  struct bpf_insn *insn_buf)
@@ -1630,6 +1674,45 @@ static u32 bpf_net_convert_ctx_access(enum 
bpf_access_type type, int dst_reg,
return insn - insn_buf;
 }
 
+static u32 seccomp_convert_ctx_access(enum bpf_access_type type, int dst_reg,
+ int src_reg, int ctx_off,
+ struct bpf_insn *insn_buf)
+{
+   struct bpf_insn *insn = insn_buf;
+
+   switch (ctx_off) {
+   case offsetof(struct seccomp_data, nr):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct seccomp_data, nr) != 4);
+
+   *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg, ctx_off);
+   break;
+
+   case offsetof(struct seccomp_data, arch):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct seccomp_data, arch) != 4);
+
+   *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg, ctx_off);
+   break;
+
+   case offsetof(struct seccomp_data, instruction_pointer):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct seccomp_data,
+ instruction_pointer) != 8);
+
+   *insn++ = BPF_LDX_MEM(BPF_DW, dst_reg, src_reg, ctx_off);
+   break;
+
+   default:
+   if (ctx_off & 7 ||
+   ctx_off < offs

c/r of seccomp filters via underlying eBPF

2015-09-04 Thread Tycho Andersen

Hi all,

Here is a set that enables checkpoint restore of the underlying eBPF programs
that power seccomp filters via the API we discussed several months ago. A few
notes:

* We expose prog_id in the ebpf dump as the pointer to the ebpf program in
  kernel memory, since this is unique. I'm not sure if this is safe. The goal
  of exposing prog_id is to enable userspace to figure out how the seccomp
  filters are inherited (i.e. is the filter a result of a fork() or an
  identical filter installed). This is relevant because
  SECCOMP_FILTER_FLAG_TSYNC depends on this ancestry.

* I'm not sure what the fd argument to seccomp() should look like in the
  SECCOMP_FILTER_FLAG_EBPF case: it seems ugly either as  or fd; any
  thoughts on this are welcome.

* Please double check the first patch. I'm not too framiliar with eBPF, so this
  is mostly monkey see monkey do.

* The last patch is something I discovered while testing this set (I couldn't
  re-import some filters). I'm not sure if this is indicitave of a wider bug or
  if the fix is even correct but it Works For Me.

Thoughts welcome,

Tycho

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] seccomp: add a way to access filters via bpf fds

2015-09-04 Thread Tycho Andersen

On Fri, Sep 04, 2015 at 01:29:49PM -0700, Alexei Starovoitov wrote:
> On Fri, Sep 04, 2015 at 01:26:42PM -0700, Kees Cook wrote:
> > On Fri, Sep 4, 2015 at 9:04 AM, Tycho Andersen
> > <tycho.ander...@canonical.com> wrote:
> > > This patch adds a way for a process that is "real root" to access the
> > > seccomp filters of another process. The process first does a
> > > PTRACE_SECCOMP_GET_FILTER_FD to get an fd with that process' seccomp 
> > > filter
> > > attached, and then iterates on this with PTRACE_SECCOMP_NEXT_FILTER using
> > > bpf(BPF_PROG_DUMP) to dump the actual program at each step.
> > 
> > Why is this a new ptrace interface instead of a new seccomp interface?
> > I would expect this to only be valid for "current", otherwise we could
> > run into races as the ptracee adds filters.

The task has to be in the stopped state in order for the filters to be
inspected, so I think this isn't a problem.

> > i.e. it is not safe to
> > examine seccomp filters from tasks other than current.
> 
> same question.
> I thought we discussed to add a command to seccomp() syscall for that?

We did initially, although at the end Andy posted about not allowing
tasks to see some filters that had been installed:

On Mon, 10 Aug 2015 14:36:09 -0700 Andy Lutomirski
<l...@amacapital.net> wrote:
> On Mon, Aug 10, 2015 at 2:31 PM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > On Mon, Aug 10, 2015 at 02:18:29PM -0700, Kees Cook wrote:
> >> On Sun, Aug 9, 2015 at 7:20 PM, Andy Lutomirski
> >> <l...@amacapital.net> wrote:
> >> > On Fri, Aug 7, 2015 at 10:59 AM, Tycho Andersen
> >> > <tycho.ander...@canonical.com> wrote:
> >> >> On Fri, Aug 07, 2015 at 10:36:02AM -0700, Andy Lutomirski wrote:
> >> >>> On Fri, Aug 7, 2015 at 8:45 AM, Tycho Andersen
> >> >>> <tycho.ander...@canonical.com> wrote:
> >> >>> >
> >> >>> > fd = seccomp(SECCOMP_GET_FILTER_FD);
> >> >>>
> >> >>>
> >> >
> >> > Let me propose something crazy.  First, some scenarios:
> >> >
> >> > Scenario A: A program runs in a sandbox.  It tries to interrogate
> >> > its
> >> > seccomp filter.  I think it would be fine, and maybe even
> >> > beneficial,
> >> > if this didn't work (or pretended there was no filter).
> >> >
> >> > Scenario B: A shell runs in a sandbox.  The user fires up some
> >> > program
> >> > in that shell and then, *still in that shell*, checkpoints it
> >> > with
> >> > CRIU.  I think CRIU shouldn't see the seccomp filter.
> >> >
> >> > Scenario C: A program runs in a sandbox.  CRIU, *in a different
> >> > sandbox*, tries to checkpoint that program.  IMO this is doomed
> >> > to
> >> > eventual failure no matter what the kernel does: restoring the
> >> > program
> >> > isn't going to work as expected.
> >> >
> >> > Scenario D: A shell runs in a sandbox.  Someone fires up a
> >> > seccomp-using program (with in inner filter) under that shell and
> >> > then
> >> > separately checkpoints it from inside the shell.  I think CRIU
> >> > should
> >> > see the inner filter but not the outer filter.
> >> >
> >> > So here's my crazy proposal: If process A tries to read process
> >> > B's
> >> > seccomp state, the kernel could check whether B's seccomp state
> >> > is a
> >> > descendent of A's.  If so, then the attempt to read B's seccomp
> >> > state
> >> > should see only the part that stricly descends from A's state.
> >> > (If
> >> > B's and A's states are the same under a referential equality
> >> > test,
> >> > then this means A should see no seccomp state at all).  If, on
> >> > the
> >> > other hand, B's state is not a descendent of A's state, then the
> >> > read
> >> > call should fail.
> >> >
> >> > Thoughts?
> >> >
> >> > At the very least, this means that no one ever needs to worry
> >> > about
> >> > preventing seccomp state reads in their seccomp filter program,
> >> > since
> >> > the filter is invisible from inside the sandbox using that
> >> > filter.
> >>
> >> I don't oppose this idea, but I think our first pass at CRIU can
> >>

Re: [PATCH 1/6] ebpf: add a seccomp program type

2015-09-04 Thread Tycho Andersen

On Fri, Sep 04, 2015 at 01:34:12PM -0700, Kees Cook wrote:
> On Fri, Sep 4, 2015 at 9:04 AM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > +static const struct bpf_func_proto *
> > +seccomp_func_proto(enum bpf_func_id func_id)
> > +{
> > +   /* Right now seccomp eBPF loading doesn't support maps; seccomp 
> > filters
> > +* are considered to be read-only after they're installed, so map 
> > fds
> > +* probably need to be invalidated when a seccomp filter with maps 
> > is
> > +* installed.
> > +*
> > +* The rest of these might be reasonable to call from seccomp, so we
> > +* export them.
> > +*/
> > +   switch (func_id) {
> > +   case BPF_FUNC_ktime_get_ns:
> > +   return _ktime_get_ns_proto;
> > +   case BPF_FUNC_trace_printk:
> > +   return bpf_get_trace_printk_proto();
> > +   case BPF_FUNC_get_prandom_u32:
> > +   return _get_prandom_u32_proto;
> > +   case BPF_FUNC_get_smp_processor_id:
> > +   return _get_smp_processor_id_proto;
> > +   case BPF_FUNC_tail_call:
> > +   return _tail_call_proto;
> > +   case BPF_FUNC_get_current_pid_tgid:
> > +   return _get_current_pid_tgid_proto;
> > +   case BPF_FUNC_get_current_uid_gid:
> > +   return _get_current_uid_gid_proto;
> > +   case BPF_FUNC_get_current_comm:
> > +   return _get_current_comm_proto;
> > +   default:
> > +   return NULL;
> > +   }
> > +}
> 
> While this list is probably fine, I don't want to mix the addition of
> eBPF functions to the seccomp ABI with the CRIU changes. No function
> calls are currently possible and it should stay that way.

Ok, I can remove them.

> I was expecting to see a validator, similar to the existing BPF
> validator that is called when creating seccomp filters currently. Can
> we add a similar validator for new BPF_PROG_TYPE_SECCOMP?

That's effectively what this patch does; when the eBPF is loaded via
bpf(), you tell bpf() you want a BPF_PROG_TYPE_SECCOMP, and it invokes
this validation/translation code, i.e. it uses
seccomp_is_valid_access() to check and make sure access are aligned
and inside struct seccomp_data.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] ebpf: add a way to dump an eBPF program

2015-09-04 Thread Tycho Andersen

On Fri, Sep 04, 2015 at 01:17:30PM -0700, Kees Cook wrote:
> On Fri, Sep 4, 2015 at 9:04 AM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > This commit adds a way to dump eBPF programs. The initial implementation
> > doesn't support maps, and therefore only allows dumping seccomp ebpf
> > programs which themselves don't currently support maps.
> >
> > We export the GPL bit as well as a unique ID for the program so that
> 
> This unique ID appears to be the heap address for the prog. That's a
> huge leak, and should not be done. We don't want to introduce new
> kernel address leaks while we're trying to fix the remaining ones.
> Shouldn't the "unique ID" be the fd itself? I imagine KCMP_FILE
> could be used, for example.

No; we acquire the fd per process, so if a task installs a filter and
then forks N times, we'll grab N (+1) copies of the filter from N (+1)
different file descriptors. Ideally, we'd have some way to figure out
that these were all the same. Some sort of prog_id is one way,
although there may be others.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/6] ebpf: add a seccomp program type

2015-09-04 Thread Tycho Andersen

On Fri, Sep 04, 2015 at 01:17:47PM -0700, Alexei Starovoitov wrote:
> On Fri, Sep 04, 2015 at 10:04:19AM -0600, Tycho Andersen wrote:
> > seccomp uses eBPF as its underlying storage and execution format, and eBPF
> > has features that seccomp would like to make use of in the future. This
> > patch adds a formal seccomp type to the eBPF verifier.
> > 
> > The current implementation of the seccomp eBPF type is very limited, and
> > doesn't support some interesting features (notably, maps) of eBPF. However,
> > the primary motivation for this patchset is to enable checkpoint/restore
> > for seccomp filters later in the series, to this limited feature set is ok
> > for now.
> 
> yes. good compromise to start.
> 
> > +static const struct bpf_func_proto *
> > +seccomp_func_proto(enum bpf_func_id func_id)
> > +{
> > +   /* Right now seccomp eBPF loading doesn't support maps; seccomp filters
> > +* are considered to be read-only after they're installed, so map fds
> > +* probably need to be invalidated when a seccomp filter with maps is
> > +* installed.
> 
> Just disabling bpf_map_lookup/update() helpers (the way you did here)
> is enough. The prorgram can still have references to maps, but since they
> won't be accessed it's safe.
> 
> > +*
> > +* The rest of these might be reasonable to call from seccomp, so we
> > +* export them.
> > +*/
> > +   switch (func_id) {
> > +   case BPF_FUNC_ktime_get_ns:
> > +   return _ktime_get_ns_proto;
> > +   case BPF_FUNC_trace_printk:
> > +   return bpf_get_trace_printk_proto();
> > +   case BPF_FUNC_get_prandom_u32:
> > +   return _get_prandom_u32_proto;
> > +   case BPF_FUNC_get_smp_processor_id:
> > +   return _get_smp_processor_id_proto;
> > +   case BPF_FUNC_tail_call:
> > +   return _tail_call_proto;
> > +   case BPF_FUNC_get_current_pid_tgid:
> > +   return _get_current_pid_tgid_proto;
> > +   case BPF_FUNC_get_current_uid_gid:
> > +   return _get_current_uid_gid_proto;
> > +   case BPF_FUNC_get_current_comm:
> > +   return _get_current_comm_proto;
> 
> the list looks good to start with.
> 
> >  
> > +static u32 seccomp_convert_ctx_access(enum bpf_access_type type, int 
> > dst_reg,
> > + int src_reg, int ctx_off,
> > + struct bpf_insn *insn_buf)
> > +{
> > +   struct bpf_insn *insn = insn_buf;
> > +
> > +   switch (ctx_off) {
> > +   case offsetof(struct seccomp_data, nr):
> 
> the conversion of seccomp_data fields is unnecessary.
> We're doing conversion for sk_buff, because sk_buff and __sk_buff aree two
> different structures. __sk_buff is user ABI with its own fields that losely
> correspond to in-kernel struct sk_buff.
> seccomp_data is already part of user ABI, so it's ok to access as-is.

Ok, I noticed this but somehow didn't put it all together. I'll axe
this for the next version, thanks.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] ebpf: add a way to dump an eBPF program

2015-09-04 Thread Tycho Andersen

Hi Alexei,

On Fri, Sep 04, 2015 at 01:27:05PM -0700, Alexei Starovoitov wrote:
> On Fri, Sep 04, 2015 at 10:04:21AM -0600, Tycho Andersen wrote:
> > This commit adds a way to dump eBPF programs. The initial implementation
> > doesn't support maps, and therefore only allows dumping seccomp ebpf
> > programs which themselves don't currently support maps.
> > 
> > 
> > Signed-off-by: Tycho Andersen <tycho.ander...@canonical.com>
> > CC: Kees Cook <keesc...@chromium.org>
> > CC: Will Drewry <w...@chromium.org>
> > CC: Oleg Nesterov <o...@redhat.com>
> > CC: Andy Lutomirski <l...@amacapital.net>
> > CC: Pavel Emelyanov <xe...@parallels.com>
> > CC: Serge E. Hallyn <serge.hal...@ubuntu.com>
> > CC: Alexei Starovoitov <a...@kernel.org>
> > CC: Daniel Borkmann <dan...@iogearbox.net>
> > ---
> >  include/uapi/linux/bpf.h | 15 +++
> >  kernel/bpf/syscall.c | 44 
> >  2 files changed, 59 insertions(+)
> > 
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 79b825a..c5d8dc2 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -107,6 +107,13 @@ enum bpf_cmd {
> >  * returns fd or negative error
> >  */
> > BPF_PROG_LOAD,
> > +
> > +   /* dump an existing bpf
> > +* err = bpf(BPF_PROG_DUMP, union bpf_attr *attr, u32 size)
> > +* Using attr->prog_fd, attr->dump_insn_cnt, attr->dump_insns
> > +* returns zero or negative error
> > +*/
> > +   BPF_PROG_DUMP,
> >  };
> >  
> >  enum bpf_map_type {
> > @@ -160,6 +167,14 @@ union bpf_attr {
> > __aligned_u64   log_buf;/* user supplied buffer */
> > __u32   kern_version;   /* checked when 
> > prog_type=kprobe */
> > };
> > +
> > +   struct { /* anonymous struct used by BPF_PROG_DUMP command */
> > +   __u32   prog_fd;
> > +   __u32   dump_insn_cnt;
> > +   __aligned_u64   dump_insns; /* user supplied buffer */
> > +   __u8gpl_compatible;
> > +   __u64   prog_id;/* unique id for this prog */
> > +   };
> 
> my first reaction was to may be reuse existing struct used to load,
> but I guess it's actually cleaner to have a new one like you did.
> though prog_fd looks redundant and prog_id is ...

prog_fd is input here, the rest are outputs.

> > +   if (put_user((u64) prog, >prog_id))
> > +   goto out;
> 
> .. is definitely not secure.
> 
> > We export the GPL bit as well as a unique ID for the program so that
> > userspace can detect when two seccomp filters were inherited from each
> > other and clone the filter tree accordingly.
> 
> you mean that in-kernel prog pointer is the same?
> I think user space can memcmp insns of programs instead?
> Are you trying to solve the case when parent has an FD for bpf program
> and child has another FD that points to the same program, and both
> doing dump and need to coordinate?

Yes, exactly. If we just do a memcmp(), two users can install the same
filter and have a different inheritance model on checkpoint vs
restore. This means that a checkpoint/restore'd process may see
different behavior when using SECCOMP_FILTER_FLAG_TSYNC in the future.

I'm not entirely clear on how much of a problem this actually is, and
perhaps it is too small to be worth worry about, but if there was
another way to export some unique id, that would be dandy.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] ebpf: add a way to dump an eBPF program

2015-09-04 Thread Tycho Andersen

On Fri, Sep 04, 2015 at 01:58:25PM -0700, Alexei Starovoitov wrote:
> On Fri, Sep 04, 2015 at 01:50:55PM -0700, Kees Cook wrote:
> > On Fri, Sep 4, 2015 at 1:45 PM, Tycho Andersen
> > <tycho.ander...@canonical.com> wrote:
> > > On Fri, Sep 04, 2015 at 01:17:30PM -0700, Kees Cook wrote:
> > >> On Fri, Sep 4, 2015 at 9:04 AM, Tycho Andersen
> > >> <tycho.ander...@canonical.com> wrote:
> > >> > This commit adds a way to dump eBPF programs. The initial 
> > >> > implementation
> > >> > doesn't support maps, and therefore only allows dumping seccomp ebpf
> > >> > programs which themselves don't currently support maps.
> > >> >
> > >> > We export the GPL bit as well as a unique ID for the program so that
> > >>
> > >> This unique ID appears to be the heap address for the prog. That's a
> > >> huge leak, and should not be done. We don't want to introduce new
> > >> kernel address leaks while we're trying to fix the remaining ones.
> > >> Shouldn't the "unique ID" be the fd itself? I imagine KCMP_FILE
> > >> could be used, for example.
> > >
> > > No; we acquire the fd per process, so if a task installs a filter and
> > > then forks N times, we'll grab N (+1) copies of the filter from N (+1)
> > > different file descriptors. Ideally, we'd have some way to figure out
> > > that these were all the same. Some sort of prog_id is one way,
> > > although there may be others.
> > 
> > If KCMP_FILE or a new KCMP_BPF isn't possible, then we'll probably
> > have to add a unique id (counter) to all bpf programs as they're
> > created.
> 
> I think tweaking KCMP_FILE for anon_inodes should do the trick
> and should work at the end (if it's not working already).

Sounds good. I'll look into that for the next version, thanks.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/6] ebpf: allow BPF_REG_X in src_reg conditional jumps

2015-09-04 Thread Tycho Andersen

Hi Alexei,

On Fri, Sep 04, 2015 at 02:06:19PM -0700, Alexei Starovoitov wrote:
> On Fri, Sep 04, 2015 at 10:04:24AM -0600, Tycho Andersen wrote:
> > The classic converter generates conditional jumps with:
> > 
> > if (BPF_SRC(fp->code) == BPF_K && (int) fp->k < 0) {
> > ...
> > } else {
> > insn->dst_reg = BPF_REG_A;
> > insn->src_reg = BPF_REG_X;
> > insn->imm = fp->k;
> > bpf_src = BPF_SRC(fp->code);
> > }
> > 
> > but here, we enforce that the src_reg == BPF_REG_0. We should also allow
> > BPF_REG_X since that's what the converter generates; this enables us to
> > load eBPF programs that were generated by the converter.
> 
> good catch. classic->extended converter is just being untidy.
> It shouldn't be populating unused 'src_reg' field when BPF_SRC == BPF_K
> verifier is doing the right thing. It's rejecting instructions that
> have junk in unused fields to make sure that someday we can extend it
> with something useful.
> The fix should be something like this:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 13079f03902e..05a04ea87172 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -478,9 +478,9 @@ do_pass:
> bpf_src = BPF_X;
> } else {
> insn->dst_reg = BPF_REG_A;
> -   insn->src_reg = BPF_REG_X;
> insn->imm = fp->k;
> bpf_src = BPF_SRC(fp->code);
> +   insn->src_reg = bpf_src == BPF_X ? BPF_REG_X 
> : 0;
> }

Yep, I just tested this and it works for me. Do you want to manage it
or should I carry it as part of this set?

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] ebpf: add a way to dump an eBPF program

2015-09-04 Thread Tycho Andersen

On Fri, Sep 04, 2015 at 02:48:03PM -0700, Andy Lutomirski wrote:
> On Fri, Sep 4, 2015 at 1:45 PM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > On Fri, Sep 04, 2015 at 01:17:30PM -0700, Kees Cook wrote:
> >> On Fri, Sep 4, 2015 at 9:04 AM, Tycho Andersen
> >> <tycho.ander...@canonical.com> wrote:
> >> > This commit adds a way to dump eBPF programs. The initial implementation
> >> > doesn't support maps, and therefore only allows dumping seccomp ebpf
> >> > programs which themselves don't currently support maps.
> >> >
> >> > We export the GPL bit as well as a unique ID for the program so that
> >>
> >> This unique ID appears to be the heap address for the prog. That's a
> >> huge leak, and should not be done. We don't want to introduce new
> >> kernel address leaks while we're trying to fix the remaining ones.
> >> Shouldn't the "unique ID" be the fd itself? I imagine KCMP_FILE
> >> could be used, for example.
> >
> > No; we acquire the fd per process, so if a task installs a filter and
> > then forks N times, we'll grab N (+1) copies of the filter from N (+1)
> > different file descriptors. Ideally, we'd have some way to figure out
> > that these were all the same. Some sort of prog_id is one way,
> > although there may be others.
> 
> I disagree a bit.  I think we want the actual hierarchy to be a
> well-defined thing, because I have plans to make the hierarchy
> actually do something.  That means that we'll need to have a more
> exact way to dump the hierarchy than "these two filters are identical"
> or "these two filters are not identical".

Can you elaborate on what this would look like? I think with the
"these two filters are the same" primitive (the same in the sense that
they were inherited during a fork, not just that
memcmp(filter1->insns, filter2->insns) == 0) you can infer the entire
hierarchy, however clunky it may be to do so.

Another issue is that KCMP_FILE won't work in this case, as it
effectively compares the struct file *, which will be different since
we need to call anon_inode_getfd() for each call of
ptrace(PTRACE_SECCOMP_GET_FILTER_FD). We could add a KCMP_BPF (or just
a KCMP_FILE_PRIVATE_DATA, since that's effectively what it would be).
Does that make sense? [added Cyrill]

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] ebpf: add a way to dump an eBPF program

2015-09-04 Thread Tycho Andersen

On Fri, Sep 04, 2015 at 04:08:53PM -0700, Andy Lutomirski wrote:
> On Fri, Sep 4, 2015 at 3:28 PM, Tycho Andersen
> <tycho.ander...@canonical.com> wrote:
> > On Fri, Sep 04, 2015 at 02:48:03PM -0700, Andy Lutomirski wrote:
> >> On Fri, Sep 4, 2015 at 1:45 PM, Tycho Andersen
> >> <tycho.ander...@canonical.com> wrote:
> >> > On Fri, Sep 04, 2015 at 01:17:30PM -0700, Kees Cook wrote:
> >> >> On Fri, Sep 4, 2015 at 9:04 AM, Tycho Andersen
> >> >> <tycho.ander...@canonical.com> wrote:
> >> >> > This commit adds a way to dump eBPF programs. The initial 
> >> >> > implementation
> >> >> > doesn't support maps, and therefore only allows dumping seccomp ebpf
> >> >> > programs which themselves don't currently support maps.
> >> >> >
> >> >> > We export the GPL bit as well as a unique ID for the program so that
> >> >>
> >> >> This unique ID appears to be the heap address for the prog. That's a
> >> >> huge leak, and should not be done. We don't want to introduce new
> >> >> kernel address leaks while we're trying to fix the remaining ones.
> >> >> Shouldn't the "unique ID" be the fd itself? I imagine KCMP_FILE
> >> >> could be used, for example.
> >> >
> >> > No; we acquire the fd per process, so if a task installs a filter and
> >> > then forks N times, we'll grab N (+1) copies of the filter from N (+1)
> >> > different file descriptors. Ideally, we'd have some way to figure out
> >> > that these were all the same. Some sort of prog_id is one way,
> >> > although there may be others.
> >>
> >> I disagree a bit.  I think we want the actual hierarchy to be a
> >> well-defined thing, because I have plans to make the hierarchy
> >> actually do something.  That means that we'll need to have a more
> >> exact way to dump the hierarchy than "these two filters are identical"
> >> or "these two filters are not identical".
> >
> > Can you elaborate on what this would look like? I think with the
> > "these two filters are the same" primitive (the same in the sense that
> > they were inherited during a fork, not just that
> > memcmp(filter1->insns, filter2->insns) == 0) you can infer the entire
> > hierarchy, however clunky it may be to do so.
> >
> > Another issue is that KCMP_FILE won't work in this case, as it
> > effectively compares the struct file *, which will be different since
> > we need to call anon_inode_getfd() for each call of
> > ptrace(PTRACE_SECCOMP_GET_FILTER_FD). We could add a KCMP_BPF (or just
> > a KCMP_FILE_PRIVATE_DATA, since that's effectively what it would be).
> > Does that make sense? [added Cyrill]
> >
> 
> I don't really know what it would look like.  I think we want a way to
> compare struct seccomp_filter pointers.

Not to complicate things further, but this brings up another
interesting issue. Right now, we require PT_SUSPEND_SECCOMP in order
to restore seccomp and do things afterwards (otherwise the seccomp
filters might kill whatever things the restore process is doing). If
we want the struct seccomp_filter pointers to be identical on restore,
that means we need to restore when we are real root, because bpf()
requires that we be real root. This means that we essentially need to
ptrace the entire restore, which we don't want to do.

In order to work around this, I was thinking we could change the
ancestry check slightly:

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 9c6bea6..efc3f36 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -239,7 +239,7 @@ static int is_ancestor(struct seccomp_filter *parent,
if (parent == NULL)
return 1;
for (; child; child = child->prev)
-   if (child == parent)
+   if (child->prog == parent->prog)
return 1;
return 0;
 }

so that we can do bpf() when we're real root, and just restore seccomp
at the very end. This would mean that the struct bpf_prog pointers are
shared, but the struct seccomp_filter pointers aren't.

Even assuming we have some sort of way to identify shared ancestry, we
still need something like the above in order to be able to restore it.
This sort of sucks; it would be ideal to to share struct
seccomp_filter *s too. We could do something like
seccomp(COPY_FROM_PARENT) or something, but given the struggles Kees
told me he had with getting SECCOMP_FILTER_FLAG_TSYNC right, I suspect
that won't fly.

> FWIW, I *hate* kcmp.  It might be worth trying to come up with a less
> awful way to do th

84 matches

Mail list logo