Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-04 Thread Andy Lutomirski
On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
 wrote:
>
> On Tue, 2018-12-04 at 16:03 +, Will Deacon wrote:
> > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe 
> > > > wrote:
> > > >
> > > > Since vfree will lazily flush the TLB, but not lazily free the 
> > > > underlying
> > > > pages,
> > > > it often leaves stale TLB entries to freed pages that could get re-used.
> > > > This is
> > > > undesirable for cases where the memory being freed has special 
> > > > permissions
> > > > such
> > > > as executable.
> > >
> > > So I am trying to finish my patch-set for preventing transient W+X 
> > > mappings
> > > from taking space, by handling kprobes & ftrace that I missed (thanks 
> > > again
> > > for
> > > pointing it out).
> > >
> > > But all of the sudden, I don’t understand why we have the problem that 
> > > this
> > > (your) patch-set deals with at all. We already change the mappings to make
> > > the memory writable before freeing the memory, so why can’t we make it
> > > non-executable at the same time? Actually, why do we make the module 
> > > memory,
> > > including its data executable before freeing it???
> >
> > Yeah, this is really confusing, but I have a suspicion it's a combination
> > of the various different configurations and hysterical raisins. We can't
> > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > can we rely on disable_ro_nx() being available at build time.
> >
> > If we *could* rely on module allocations always using vmalloc(), then
> > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > afaict -- who cares about the memory attributes of a mapping that's about
> > to disappear anyway?
> >
> > Is it just nios2 that does something different?
> >
> > Will
>
> Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> solve it as well, in fact that was what I first thought the solution should be
> until this was suggested. It's interesting that from the other thread Masami
> Hiramatsu referenced, set_memory_nx was suggested last year and would have
> inadvertently blocked this on x86. But, on the other architectures I have 
> since
> learned it is a bit different.
>
> It looks like actually most arch's don't re-define set_memory_*, and so all of
> the frob_* functions are actually just noops. In which case allocating RWX is
> needed to make it work at all, because that is what the allocation is going to
> stay at. So in these archs, set_memory_nx won't solve it because it will do
> nothing.
>
> On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> changing of the permissions on the directmap as well. You don't want some 
> other
> caller getting a page that was left RO when freed and then trying to write to
> it, if I understand this.
>

Exactly.

After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
VM_MAY_ADJUST_PERMS or similar.  It would have the semantics you want,
but it would also call some arch hooks to put back the direct map
permissions before the flush.  Does that seem reasonable?  It would
need to be hooked up that implement set_memory_ro(), but that should
be quite easy.  If nothing else, it could fall back to set_memory_ro()
in the absence of a better implementation.


Re: [PATCH net-next v6 07/23] zinc: ChaCha20 ARM and ARM64 implementations

2018-09-27 Thread Andy Lutomirski



> On Sep 27, 2018, at 8:19 AM, Jason A. Donenfeld  wrote:
> 
> Hey again Thomas,
> 
>> On Thu, Sep 27, 2018 at 3:26 PM Jason A. Donenfeld  wrote:
>> 
>> Hi Thomas,
>> 
>> I'm trying to optimize this for crypto performance while still taking
>> into account preemption concerns. I'm having a bit of trouble figuring
>> out a way to determine numerically what the upper bounds for this
>> stuff looks like. I'm sure I could pick a pretty sane number that's
>> arguably okay -- and way under the limit -- but I still am interested
>> in determining what that limit actually is. I was hoping there'd be a
>> debugging option called, "warn if preemption is disabled for too
>> long", or something, but I couldn't find anything like that. I'm also
>> not quite sure what the latency limits are, to just compute this with
>> a formula. Essentially what I'm trying to determine is:
>> 
>> preempt_disable();
>> asm volatile(".fill N, 1, 0x90;");
>> preempt_enable();
>> 
>> What is the maximum value of N for which the above is okay? What
>> technique would you generally use in measuring this?
>> 
>> Thanks,
>> Jason
> 
> From talking to Peter (now CC'd) on IRC, it sounds like what you're
> mostly interested in is clocktime latency on reasonable hardware, with
> a goal of around ~20µs as a maximum upper bound? I don't expect to get
> anywhere near this value at all, but if you can confirm that's a
> decent ballpark, it would make for some interesting calculations.
> 
> 

I would add another consideration: if you can get better latency with 
negligible overhead (0.1%? 0.05%), then that might make sense too. For example, 
it seems plausible that checking need_resched() every few blocks adds basically 
no overhead, and the SIMD helpers could do this themselves or perhaps only ever 
do a block at a time.

need_resched() costs a cacheline access, but it’s usually a hot cacheline, and 
the actual check is just whether a certain bit in memory is set.

Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library

2018-09-16 Thread Andy Lutomirski
On Tue, Sep 11, 2018 at 4:57 PM David Miller  wrote:
>
> From: Andrew Lunn 
> Date: Wed, 12 Sep 2018 01:30:15 +0200
>
> > Just as an FYI:
> >
> > 1) I don't think anybody in netdev has taken a serious look at the
> > network code yet. There is little point until the controversial part
> > of the code, Zinc, has been sorted out.
> >
> > 2) I personally would be surprised if DaveM took this code without
> > having an Acked-by from the crypto subsystem people. In the same way,
> > i doubt the crypto people would take an Ethernet driver without having
> > DaveM's Acked-by.
>
> Both of Andrew's statements are completely true.
>
> I'm not looking at any of the networking bits until the crypto stuff
> is fully sorted and fully supported and Ack'd by crypto folks.

So, as a process question, whom exactly are we waiting for:

CRYPTO API
M:  Herbert Xu 
M:  "David S. Miller" 
L:  linux-cry...@vger.kernel.org

Herbert hasn't replied to any of these submissions.  You're the other
maintainer :)

To the extent that you (DaveM) want my ack, here's what I think of the
series so far:

The new APIs to the crypto primitives are good.  For code that wants
to do a specific known crypto operation, they are much, much more
pleasant to use than the existing crypto API.  The code cleanups in
the big keys patch speak for themselves IMO.  I have no problem with
the addition of a brand-new API to the kernel, especially when it's a
nice one like Zinc's, even if that API starts out with only a couple
of users.

Zinc's arrangement of arch code is just fine.  Sure, there are
arguments that putting arch-specific code in arch/ is better, but this
is mostly bikeshedding IMO.

There has been some discussion of the exact best way to handle
simd_relax() and some other minor nitpicks of API details.  This kind
of stuff doesn't need to block the series -- it can always be reworked
down the road if needed.

There are two things I don't like right now, though:

1. Zinc conflates the addition of a new API with the replacement of
some algorithm implementations.  This is problematic.  Look at the
recent benchmarks of ipsec before and after this series.  Apparently
big packets get faster and small packets get slower.  It would be
really nice to bisect the series to narrow down *where* the regression
came from, but, as currently structured, you can't.

The right way to do this is to rearrange the series.  First, the new
Zinc APIs should be added, and they should be backed with the
*existing* crypto code.  (If the code needs to be moved or copied to a
new location, so be it.  The patch will be messy because somehow the
Zinc API is going to have to dispatch to the arch-specific code, and
the way that the crypto API handles it is not exactly friendly to this
type of use.  So be it.)  Then another patch should switch the crypto
API to use the Zinc interface.  That patch, *by itself*, can be
benchmarked.  If it causes a regression for small ipsec packets, then
it can be tracked down relatively easily.  Once this is all done, the
actual crypto implementation can be changed, and that changed can be
reviewed on its own merits.

As a simplistic example, I wrote this code a while back:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf=e9e12f056f2abed50a30b762db9185799f5864e6

and its two parents.  This series added a more Zinc-like API to
SHA256.  And it did it without replacing the SHA256 implementation.
Doing this for Zinc would be a bit more complication, since the arch
code would need to be invoked too, but it should be doable.

FWIW, Wireguard should not actually depend on the replacement of the
crypto implementation.

2. The new Zinc crypto implementations look like they're brand new.  I
realize that they have some history, some of them are derived from
OpenSSL, etc, but none of this is really apparent in the patches
themselves.  It would be great if the kernel could literally share the
same code as something like OpenSSL, since OpenSSL gets much more
attention than the kernel's crypto.  Failing that, it would be nice if
the patches made it more clear how the code differs from its origin.
At the very least, though, if the replacement of the crypto code were,
as above, a patch that just replaced the crypto code, it would be much
easier to review and benchmark intelligently.

--Andy


Re: [PATCH v2 net-next 0/2] tcp: mmap: rework zerocopy receive

2018-04-26 Thread Andy Lutomirski
At the risk of further muddying the waters, there's another minor tweak
that could improve performance on certain workloads.  Currently you mmap()
a range for a given socket and then getsockopt() to receive.  If you made
it so you could mmap() something once for any number of sockets (by
mmapping /dev/misc/tcp_zero_receive or whatever), then the performance of
the getsockopt() bit would be identical, but you could release the mapping
for many sockets at once with only a single flush.  For some use cases,
this could be a big win.

You could also add this later easily enough, too.


Re: [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive

2018-04-25 Thread Andy Lutomirski
On Wed, Apr 25, 2018 at 9:04 AM, Matthew Wilcox <wi...@infradead.org> wrote:
> On Wed, Apr 25, 2018 at 06:01:02AM -0700, Eric Dumazet wrote:
>> On 04/24/2018 11:28 PM, Christoph Hellwig wrote:
>> > On Tue, Apr 24, 2018 at 10:27:21PM -0700, Eric Dumazet wrote:
>> >> When adding tcp mmap() implementation, I forgot that socket lock
>> >> had to be taken before current->mm->mmap_sem. syzbot eventually caught
>> >> the bug.
>> >>
>> >> Since we can not lock the socket in tcp mmap() handler we have to
>> >> split the operation in two phases.
>> >>
>> >> 1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
>> >>   This operation does not involve any TCP locking.
>> >>
>> >> 2) setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
>> >>  the transfert of pages from skbs to one VMA.
>> >>   This operation only uses down_read(>mm->mmap_sem) after
>> >>   holding TCP lock, thus solving the lockdep issue.
>> >>
>> >> This new implementation was suggested by Andy Lutomirski with great 
>> >> details.
>> >
>> > Thanks, this looks much more sensible to me.
>> >
>>
>> Thanks Christoph
>>
>> Note the high cost of zap_page_range(), needed to avoid -EBUSY being returned
>> from vm_insert_page() the second time TCP_ZEROCOPY_RECEIVE is used on one 
>> VMA.
>>
>> Ideally a vm_replace_page() would avoid this cost ?
>
> If you don't zap the page range, any of the CPUs in the system where
> any thread in this task have ever run may have a TLB entry pointing to
> this page ... if the page is being recycled into the page allocator,
> then that page might end up as a slab page or page table or page cache
> while the other CPU still have access to it.

Indeed.  This is one of the reasons that Linus has generally been
quite vocal that he doesn't like MMU-based zerocopy schemes.

>
> You could hang onto the page until you've built up a sufficiently large
> batch, then bulk-invalidate all of the TLB entries, but we start to get
> into weirdnesses on different CPU architectures.

The existing mmu_gather code should already handle this at least
moderately well.  If it's not, then it should be fixed.

On x86, there is no operation to flush a range of addresses.  You can
flush one address or you can flush all of them.  If you flush one page
at a time, then you might never recover the performance of a plain old
memcpy().  If you flush all of them, then you're hurting the
performance of everything else in the task.

In general, I suspect that the zerocopy receive mechanism will only
really be a win in single-threaded applications that consume large
amounts of receive bandwidth on a single TCP socket using lots of
memory and don't do all that much else.


Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue

2018-04-23 Thread Andy Lutomirski
On Mon, Apr 23, 2018 at 2:38 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> Hi Andy
>
> On 04/23/2018 02:14 PM, Andy Lutomirski wrote:

>> I would suggest that you rework the interface a bit.  First a user would 
>> call mmap() on a TCP socket, which would create an empty VMA.  (It would set 
>> vm_ops to point to tcp_vm_ops or similar so that the TCP code could 
>> recognize it, but it would have no effect whatsoever on the TCP state 
>> machine.  Reading the VMA would get SIGBUS.)  Then a user would call a new 
>> ioctl() or setsockopt() function and pass something like:
>
>
>>
>> struct tcp_zerocopy_receive {
>>   void *address;
>>   size_t length;
>> };
>>
>> The kernel would verify that [address, address+length) is entirely inside a 
>> single TCP VMA and then would do the vm_insert_range magic.
>
> I have no idea what is the proper API for that.
> Where the TCP VMA(s) would be stored ?
> In TCP socket, or MM layer ?

MM layer.  I haven't tested this at all, and the error handling is
totally wrong, but I think you'd do something like:

len = get_user(...);

down_read(>mm->mmap_sem);

vma = find_vma(mm, start);
if (!vma || vma->vm_start > start)
  return -EFAULT;

/* This is buggy.  You also need to check that the file is a socket.
This is probably trivial. */
if (vma->vm_file->private_data != sock)
  return -EINVAL;

if (len > vma->vm_end - start)
  return -EFAULT;  /* too big a request. */

and now you'd do the vm_insert_page() dance, except that you don't
have to abort the whole procedure if you discover that something isn't
aligned right.  Instead you'd just stop and tell the caller that you
didn't map the full requested size.  You might also need to add some
code to charge the caller for the pages that get pinned, but that's an
orthogonal issue.

You also need to provide some way for user programs to signal that
they're done with the page in question.  MADV_DONTNEED might be
sufficient.

In the mmap() helper, you might want to restrict the mapped size to
something reasonable.  And it might be nice to hook mremap() to
prevent user code from causing too much trouble.

With my x86-writer-of-TLB-code hat on, I expect the major performance
costs to be the generic costs of mmap() and munmap() (which only
happen once per socket instead of once per read if you like my idea),
the cost of a TLB miss when the data gets read (really not so bad on
modern hardware), and the cost of the TLB invalidation when user code
is done with the buffers.  The latter is awful, especially in
multithreaded programs.  In fact, it's so bad that it might be worth
mentioning in the documentation for this code that it just shouldn't
be used in multithreaded processes.  (Also, on non-PCID hardware,
there's an annoying situation in which a recently-migrated thread that
removes a mapping sends an IPI to the CPU that the thread used to be
on.  I thought I had a clever idea to get rid of that IPI once, but it
turned out to be wrong.)

Architectures like ARM that have superior TLB handling primitives will
not be hurt as badly if this is used my a multithreaded program.

>
>
> And I am not sure why the error handling would be better (point 4), unless we 
> can return smaller @length than requested maybe ?

Exactly.  If I request 10MB mapped and only the first 9MB are aligned
right, I still want the first 9 MB.

>
> Also how the VMA space would be accounted (point 3) when creating an empty 
> VMA (no pages in there yet)

There's nothing to account.  It's the same as mapping /dev/null or
similar -- the mm core should take care of it for you.


Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue

2018-04-23 Thread Andy Lutomirski

On 04/20/2018 08:55 AM, Eric Dumazet wrote:

This patch series provide a new mmap_hook to fs willing to grab
a mutex before mm->mmap_sem is taken, to ensure lockdep sanity.

This hook allows us to shorten tcp_mmap() execution time (while mmap_sem
is held), and improve multi-threading scalability.



I think that the right solution is to rework mmap() on TCP sockets a 
bit.  The current approach in net-next is very strange for a few reasons:


1. It uses mmap() as an operation that has side effects besides just 
creating a mapping.  If nothing else, it's surprising, since mmap() 
doesn't usually do that.  But it's also causing problems like what 
you're seeing.


2. The performance is worse than it needs to be.  mmap() is slow, and I 
doubt you'll find many mm developers who consider this particular abuse 
of mmap() to be a valid thing to optimize for.


3. I'm not at all convinced the accounting is sane.  As far as I can 
tell, you're allowing unprivileged users to increment the count on 
network-owned pages, limited only by available virtual memory, without 
obviously charging it to the socket buffer limits.  It looks like a 
program that simply forgot to call munmap() would cause the system to 
run out of memory, and I see no reason to expect the OOM killer to have 
any real chance of killing the right task.


4. Error handling sucks.  If I try to mmap() a large range (which is the 
whole point -- using a small range will kill performance) and not quite 
all of it can be mapped, then I waste a bunch of time in the kernel and 
get *none* of the range mapped.


I would suggest that you rework the interface a bit.  First a user would 
call mmap() on a TCP socket, which would create an empty VMA.  (It would 
set vm_ops to point to tcp_vm_ops or similar so that the TCP code could 
recognize it, but it would have no effect whatsoever on the TCP state 
machine.  Reading the VMA would get SIGBUS.)  Then a user would call a 
new ioctl() or setsockopt() function and pass something like:


struct tcp_zerocopy_receive {
  void *address;
  size_t length;
};

The kernel would verify that [address, address+length) is entirely 
inside a single TCP VMA and then would do the vm_insert_range magic.  On 
success, length is changed to the length that actually got mapped.  The 
kernel could do this while holding mmap_sem for *read*, and it could get 
the lock ordering right.  If and when mm range locks ever get merged, it 
could switch to using a range lock.


Then you could use MADV_DONTNEED or another ioctl/setsockopt to zap the 
part of the mapping that you're done with.


Does this seem reasonable?  It should involve very little code change, 
it will run faster, it will scale better, and it is much less weird IMO.


Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy

2018-04-08 Thread Andy Lutomirski
On Sun, Apr 8, 2018 at 6:13 AM, Mickaël Salaün <m...@digikod.net> wrote:
>
> On 02/27/2018 10:48 PM, Mickaël Salaün wrote:
>>
>> On 27/02/2018 17:39, Andy Lutomirski wrote:
>>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>>> <alexei.starovoi...@gmail.com> wrote:
>>>> On Tue, Feb 27, 2018 at 05:20:55AM +, Andy Lutomirski wrote:
>>>>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
>>>>> <alexei.starovoi...@gmail.com> wrote:
>>>>>> On Tue, Feb 27, 2018 at 04:40:34AM +, Andy Lutomirski wrote:
>>>>>>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>>>>>>> <alexei.starovoi...@gmail.com> wrote:
>>>>>>>> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>>>>>>>>> The seccomp(2) syscall can be used by a task to apply a Landlock 
>>>>>>>>> program
>>>>>>>>> to itself. As a seccomp filter, a Landlock program is enforced for the
>>>>>>>>> current task and all its future children. A program is immutable and a
>>>>>>>>> task can only add new restricting programs to itself, forming a list 
>>>>>>>>> of
>>>>>>>>> programss.
>>>>>>>>>
>>>>>>>>> A Landlock program is tied to a Landlock hook. If the action on a 
>>>>>>>>> kernel
>>>>>>>>> object is allowed by the other Linux security mechanisms (e.g. DAC,
>>>>>>>>> capabilities, other LSM), then a Landlock hook related to this kind of
>>>>>>>>> object is triggered. The list of programs for this hook is then
>>>>>>>>> evaluated. Each program return a 32-bit value which can deny the 
>>>>>>>>> action
>>>>>>>>> on a kernel object with a non-zero value. If every programs of the 
>>>>>>>>> list
>>>>>>>>> return zero, then the action on the object is allowed.
>>>>>>>>>
>>>>>>>>> Multiple Landlock programs can be chained to share a 64-bits value 
>>>>>>>>> for a
>>>>>>>>> call chain (e.g. evaluating multiple elements of a file path).  This
>>>>>>>>> chaining is restricted when a process construct this chain by loading 
>>>>>>>>> a
>>>>>>>>> program, but additional checks are performed when it requests to apply
>>>>>>>>> this chain of programs to itself.  The restrictions ensure that it is
>>>>>>>>> not possible to call multiple programs in a way that would imply to
>>>>>>>>> handle multiple shared values (i.e. cookies) for one chain.  For now,
>>>>>>>>> only a fs_pick program can be chained to the same type of program,
>>>>>>>>> because it may make sense if they have different triggers (cf. next
>>>>>>>>> commits).  This restrictions still allows to reuse Landlock programs 
>>>>>>>>> in
>>>>>>>>> a safe way (e.g. use the same loaded fs_walk program with multiple
>>>>>>>>> chains of fs_pick programs).
>>>>>>>>>
>>>>>>>>> Signed-off-by: Mickaël Salaün <m...@digikod.net>
>>>>>>>>
>>>>>>>> ...
>>>>>>>>
>>>>>>>>> +struct landlock_prog_set *landlock_prepend_prog(
>>>>>>>>> + struct landlock_prog_set *current_prog_set,
>>>>>>>>> + struct bpf_prog *prog)
>>>>>>>>> +{
>>>>>>>>> + struct landlock_prog_set *new_prog_set = current_prog_set;
>>>>>>>>> + unsigned long pages;
>>>>>>>>> + int err;
>>>>>>>>> + size_t i;
>>>>>>>>> + struct landlock_prog_set tmp_prog_set = {};
>>>>>>>>> +
>>>>>>>>> + if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>>>>>>>>> + return ERR_PTR(-EINVAL);
>>>>>>>>> +
>>>>>>>>> + /* validate memory size allocation */
>>>>>>>>> + pages = prog->

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-22 Thread Andy Lutomirski
On Thu, Mar 22, 2018 at 8:54 PM, Luis R. Rodriguez  wrote:
> If we can ensure that these usermode modules don't take *any time at all* on
> their init *from the start*, it would be wonderful and we'd end up avoiding
> some really odd corner case issues later.
>

I don't see why this issue needs to exist at all for the new stuff.
After all, the new things aren't usermode modules per se.  They're
regular kernel code (modular or otherwise) that loads a usermode
helper.  All we need to do is to make sure that, if this is
distributed as a module, that it's init routine doesn't wait for a
long time, right?


Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread Andy Lutomirski
On Thu, Mar 22, 2018 at 5:40 PM, Alexei Starovoitov
 wrote:
> On Thu, Mar 22, 2018 at 10:33:43AM +0100, Ingo Molnar wrote:
>>
>>  - I think the BPF JIT, whose byte code machine languge is used by an
>>increasing number of kernel subsystems, could benefit from having vector 
>> ops.
>>It would possibly allow the handling of floating point types.
>
> this is on our todo list already.
> To process certain traffic inside BPF in XDP we'd like to have access to
> floating point. The current workaround is to precompute the math and do
> bpf map lookup instead.
> Since XDP processing of packets is batched (typically up to napi budget
> of 64 packets at a time), we can, in theory, wrap the loop with
> kernel_fpu_begin/end and it will be cleaner and faster,
> but the work hasn't started yet.
> The microbenchmark numbers you quoted for xsave/xrestore look promising,
> so we probably will focus on it soon.
>
> Another use case for vector insns is to accelerate fib/lpm lookups
> which is likely beneficial for kernel overall regardless of bpf usage.
>

This is a further argument for the deferred restore approach IMO.
With deferred restore, kernel_fpu_begin() + kernel_fpu_end() has very
low amortized cost as long as we do lots of work in the kernel before
re-entering userspace.  For XDP workloads, this could be a pretty big
win, I imagine.

Someone just needs to do the nasty x86 work.


Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-21 Thread Andy Lutomirski
On Wed, Mar 21, 2018 at 6:32 AM, Ingo Molnar  wrote:
>
> * Linus Torvalds  wrote:
>
>> And even if you ignore that "maintenance problems down the line" issue
>> ("we can fix them when they happen") I don't want to see games like
>> this, because I'm pretty sure it breaks the optimized xsave by tagging
>> the state as being dirty.
>
> That's true - and it would penalize the context switch cost of the affected 
> task
> for the rest of its lifetime, as I don't think there's much that clears XINUSE
> other than a FINIT, which is rarely done by user-space.
>
>> So no. Don't use vector stuff in the kernel. It's not worth the pain.
>
> I agree, but:
>
>> The *only* valid use is pretty much crypto, and even there it has had issues.
>> Benchmarks use big arrays and/or dense working sets etc to "prove" how good 
>> the
>> vector version is, and then you end up in situations where it's used once per
>> fairly small packet for an interrupt, and it's actually much worse than 
>> doing it
>> by hand.
>
> That's mainly because the XSAVE/XRESTOR done by kernel_fpu_begin()/end() is so
> expensive, so this argument is somewhat circular.

If we do the deferred restore, then the XSAVE/XRSTOR happens at most
once per kernel entry, which isn't so bad IMO.  Also, with PTI, kernel
entries are already so slow that this will be mostly in the noise :(

--Andy


Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Andy Lutomirski
On Tue, Mar 20, 2018 at 3:10 PM, David Laight <david.lai...@aculab.com> wrote:
> From: Andy Lutomirski
>> Sent: 20 March 2018 14:57
> ...
>> I'd rather see us finally finish the work that Rik started to rework
>> this differently.  I'd like kernel_fpu_begin() to look like:
>>
>> if (test_thread_flag(TIF_NEED_FPU_RESTORE)) {
>>   return; // we're already okay.  maybe we need to check
>> in_interrupt() or something, though?
>> } else {
>>   XSAVES/XSAVEOPT/XSAVE;
>>   set_thread_flag(TIF_NEED_FPU_RESTORE):
>> }
>>
>> and kernel_fpu_end() does nothing at all.
>
> I guess it might need to set (clear?) the CFLAGS bit for a process
> that isn't using the fpu at all - which seems a sensible feature.

What do you mean "CFLAGS"?

But we no longer have any concept of "isn't using the fpu at all" --
we got rid of that.

>
>> We take the full performance hit for a *single* kernel_fpu_begin() on
>> an otherwise short syscall or interrupt, but there's no additional
>> cost for more of them or for long-enough-running things that we
>> schedule in the middle.
>
> It might be worth adding a parameter to kernel_fpu_begin() to indicate
> which registers are needed, and a return value to say what has been
> granted.
> Then a driver could request AVX2 (for example) and use a fallback path
> if the register set isn't available (for any reason).
> A call from an ISR could always fail.

Last time I benchmarked it, XSAVEC on none of the state wasn't a whole
lot faster than XSAVEC for all of it.

>
>> As I remember, the main hangup was that this interacts a bit oddly
>> with PKRU, but that's manageable.
>
> WTF PKRU ??

PKRU is uniquely demented.  All the rest of the XSAVEC state only
affects code that explicitly references that state.  PKRU affects
every single access to user pages, so we need PKRU to match the
current task at all times in the kernel.  This means that, if we start
deferring XRSTORS until prepare_exit_to_usermode(), we need to start
restoring PKRU using WRPKRU in __switch_to().  Of course, *that*
interacts a bit oddly with XINUSE, but maybe we don't care.

Phooey on you, Intel, for putting PKRU into xstate and not giving a
fast instruction to control XINUSE.


Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Andy Lutomirski
On Tue, Mar 20, 2018 at 8:26 AM, Ingo Molnar  wrote:
>
> * Thomas Gleixner  wrote:
>
>> > Useful also for code that needs AVX-like registers to do things like CRCs.
>>
>> x86/crypto/ has a lot of AVX optimized code.
>
> Yeah, that's true, but the crypto code is processing fundamentally bigger 
> blocks
> of data, which amortizes the cost of using kernel_fpu_begin()/_end().
>
> kernel_fpu_begin()/_end() is a pretty heavy operation because it does a full 
> FPU
> save/restore via the XSAVE[S] and XRSTOR[S] instructions, which can easily 
> copy a
> thousand bytes around! So kernel_fpu_begin()/_end() is probably a non-starter 
> for
> something small, like a single 256-bit or 512-bit word access.
>
> But there's actually a new thing in modern kernels: we got rid of (most of) 
> lazy
> save/restore FPU code, our new x86 FPU model is very "direct" with no FPU 
> faults
> taken normally.
>
> So assuming the target driver will only load on modern FPUs I *think* it 
> should
> actually be possible to do something like (pseudocode):
>
> vmovdqa %ymm0, 40(%rsp)
> vmovdqa %ymm1, 80(%rsp)
>
> ...
> # use ymm0 and ymm1
> ...
>
> vmovdqa 80(%rsp), %ymm1
> vmovdqa 40(%rsp), %ymm0
>

I think this kinda sorts works, but only kinda sorta:

 - I'm a bit worried about newer CPUs where, say, a 256-bit vector
operation will implicitly clear the high 768 bits of the regs.  (IIRC
that's how it works for the new VEX stuff.)

 - This code will cause XINUSE to be set, which is user-visible and
may have performance and correctness effects.  I think the kernel may
already have some but where it occasionally sets XINUSE on its own,
and this caused problems for rr in the past.  This might not be a
total showstopper, but it's odd.

I'd rather see us finally finish the work that Rik started to rework
this differently.  I'd like kernel_fpu_begin() to look like:

if (test_thread_flag(TIF_NEED_FPU_RESTORE)) {
  return; // we're already okay.  maybe we need to check
in_interrupt() or something, though?
} else {
  XSAVES/XSAVEOPT/XSAVE;
  set_thread_flag(TIF_NEED_FPU_RESTORE):
}

and kernel_fpu_end() does nothing at all.

We take the full performance hit for a *single* kernel_fpu_begin() on
an otherwise short syscall or interrupt, but there's no additional
cost for more of them or for long-enough-running things that we
schedule in the middle.

As I remember, the main hangup was that this interacts a bit oddly
with PKRU, but that's manageable.

--Andy


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-10 Thread Andy Lutomirski
On Sat, Mar 10, 2018 at 1:43 AM, Alexei Starovoitov <a...@fb.com> wrote:
> On 3/9/18 11:37 AM, Andy Lutomirski wrote:
>>
>> On Fri, Mar 9, 2018 at 6:55 PM, David Miller <da...@davemloft.net> wrote:
>>>
>>> From: Alexei Starovoitov <a...@fb.com>
>>> Date: Fri, 9 Mar 2018 10:50:49 -0800
>>>
>>>> On 3/9/18 10:23 AM, Andy Lutomirski wrote:
>>>>>
>>>>> It might not be totally crazy to back it by tmpfs.
>>>>
>>>>
>>>> interesting. how do you propose to do it?
>>>> Something like:
>>>> - create /umh_module_tempxxx dir
>>>> - mount tmpfs there
>>>> - copy elf into it and exec it?
>>>
>>>
>>> I think the idea is that it's an internal tmpfs mount that only
>>> the kernel has access too.
>>
>>
>> That's what I was imagining.  There's precedent.  For example, there's
>> a very short piece of code that does it in
>> drivers/gpu/drm/i915/i915_gemfs.c.
>
>
> I can do "monkey see monkey do" approach which will look like:
> type = get_fs_type("tmpfs");
> fs = kern_mount(type);
>
> /* for each request_umh("foo") */
> file = shmem_file_setup_with_mnt(fs, "umh_foo");
> do {
>   pagecache_write_begin(file,...);
>   memcpy()
>   pagecache_write_end();
> } while (umh_elf_size);
> do_execve_file(file);
> fput(file);
>
> while keeping fs mounted forever?
> is there better way?
>

Nice!  I'm definitely not a pagecache expert, but it looks generally
sane.  Once the thing is actually functional, we can bang on it, and
I'm sure that linux-mm will have some suggestions to tidy it up.

As for the actual lifetime of the filesystem, I think it should be
mounted once and never unmounted.  Whenever it gains a second user,
the whole thing can be moved to mm/ or lib/ and all the users can
share the same mount.

Minor caveat: I would arrange the code a bit differently, like this:

static (or extern) unsigned char __initdata the_blob[];
static struct file *umh_blob_file;

static int __init my_module_init_function(void)
{
 /* for each request_umh("foo") */
 umh_blob_file = shmem_file_setup_with_mnt(fs, "umh_foo");
 do {
   pagecache_write_begin(umh_file,...);
   memcpy()
   pagecache_write_end();
 } while (umh_elf_size);

 /* the_blob is implicitly freed after this returns */
}

and then actually use the struct file later on.  If and when you're
sure you're not going to spawn another copy, you can fput() it.  This
way the memory becomes swappable immediately on load.

As for request_module() vs request_module_umh(), my advice would be to
write the code and then see what interface makes sense.  I wouldn't be
surprised if it ends up making more sense to keep all of this entirely
independent from the module system.

P.S. I suspect that, before this hits a release, someone's going to
have to fiddle with the LSM hooks in do_execve() a bit to make sure
that LSM unconditionally approves this type of umh program.  Otherwise
there might be pointless failures on some more locked down
configurations.  But that can wait until it's more final and the
security folks review the code.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Andy Lutomirski
On Fri, Mar 9, 2018 at 7:38 PM, Linus Torvalds
 wrote:
> On Fri, Mar 9, 2018 at 11:12 AM, Linus Torvalds
>  wrote:
>>
>> How are you going to handle five processes doing the same setup concurrently?
>
> Side note: it's not just serialization. It's also "is it actually up
> and running".
>

I think the right way to solve this would be to take a hint from
systemd's socket activation model.  The current patch had the module
load process kick off an ELF binary that goes an registers itself to
handle something.  We can turn that around.  Make the module init
function create the socket (or pipe or whatever) receives request and
pass it to the user program as stdin.  Then the kernel can start
queueing requests into the socket immediately, and the user program
will get to them whenever it finishes initializing.  Or it can write
some message to the socket saying "hey, I'm ready".

This also completely avoids the issue where some clever user manually
loads the "module" with exec() ("hey, I'm so clever, I can just run
the damn thing instead if using init_module()!" or writes an
out-of-tree program that uses whatever supposedly secret API the
in-kernel binary is supposed to use to register itself (and I know
people who would do exactly that!) and the kernel does
request_module() at roughly the same time.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Andy Lutomirski
On Fri, Mar 9, 2018 at 6:55 PM, David Miller <da...@davemloft.net> wrote:
> From: Alexei Starovoitov <a...@fb.com>
> Date: Fri, 9 Mar 2018 10:50:49 -0800
>
>> On 3/9/18 10:23 AM, Andy Lutomirski wrote:
>>> It might not be totally crazy to back it by tmpfs.
>>
>> interesting. how do you propose to do it?
>> Something like:
>> - create /umh_module_tempxxx dir
>> - mount tmpfs there
>> - copy elf into it and exec it?
>
> I think the idea is that it's an internal tmpfs mount that only
> the kernel has access too.

That's what I was imagining.  There's precedent.  For example, there's
a very short piece of code that does it in
drivers/gpu/drm/i915/i915_gemfs.c.


>
> And I don't think that even hurts your debuggability concerns.  The
> user can just attach using the foo.ko file in the actual filesystem.
>

Not if the .ko is actually a shim that actually just contains a blob
and a few lines of code to kick off the umh.  But one could still
debug it using kernel debug symbols (like vDSO debugging works right
now, at least if your distro is in a good mood) or by reading the
contents from /proc/PID/exe.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Andy Lutomirski


> On Mar 9, 2018, at 10:17 AM, Linus Torvalds  
> wrote:
> 

> 
> Hmm. I wish we had an "execute blob" model, but we really don't, and
> it would be hard/impossible to do without pinning the pages in memory.
> 

Why so hard?  We can already execute a struct file for execveat, and Alexei 
already has this working for umh. Surely we can make an immutable (as in even 
root can’t write it) kernel-internal tmpfs file, execveat it, then unlink it. 
And /proc/PID/exe should be openable and readable.  The blob itself would be 
__initdata so it gets discarded after it lands in tmpfs. 

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Andy Lutomirski


> On Mar 9, 2018, at 10:15 AM, Greg KH  wrote:
> 

> 
> Oh, and for the record, I like Andy's proposal as well as dumping this
> into a kernel module "blob" with the exception that this now would take
> up unswapable memory, which isn't the nicest and is one big reason we
> removed the in-kernel-memory firmware blobs many years ago.
> 

It might not be totally crazy to back it by tmpfs. 


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Andy Lutomirski
On Fri, Mar 9, 2018 at 3:39 PM, Alexei Starovoitov <a...@fb.com> wrote:
> On 3/9/18 7:16 AM, Andy Lutomirski wrote:
>>>>
>>>> On Mar 8, 2018, at 9:08 PM, Alexei Starovoitov <a...@fb.com> wrote:
>>>>
>>>> On 3/8/18 7:54 PM, Andy Lutomirski wrote:
>>>>
>>>>
>>>>
>>>>> On Mar 8, 2018, at 7:06 PM, Linus Torvalds
>>>>> <torva...@linux-foundation.org> wrote:
>>>>>
>>>>>
>>>>> Honestly, that "read twice" thing may be what scuttles this.
>>>>> Initially, I thought it was a non-issue, because anybody who controls
>>>>> the module subdirectory enough to rewrite files would be in a position
>>>>> to just execute the file itself directly instead.
>>>>
>>>>
>>>> On further consideration, I think there’s another showstopper. This
>>>> patch is a potentially severe ABI break. Right now, loading a module
>>>> *copies* it into memory and does not hold a reference to the underlying fs.
>>>> With the patch applied, all kinds of use cases can break in gnarly ways.
>>>> Initramfs is maybe okay, but initrd may be screwed. If you load an ET_EXEC
>>>> module from initrd, then umount it, then clear the ramdisk, something will
>>>> go horribly wrong. Exactly what goes wrong depends on whether userspace
>>>> notices that umount() failed. Similarly, if you load one of these modules
>>>> over a network and then lose your connection, you have a problem.
>>>
>>>
>>> there is not abi breakage and file cannot disappear from running task.
>>> One cannot umount fs while file is still being used.
>>
>>
>> Sure it is. Without your patch, init_module doesn’t keep using the
>> file, so it’s common practice to load a module and then delete or
>> unmount it. With your patch, the unmount case breaks. This is likely
>> to break existing userspace, so, in Linux speak it’s an ABI break.
>
>
> please read the patch again.
> file is only used in case of umh modules.
> There is zero difference in default case.

Say I'm running some distro or other working Linux setup.  I upgrade
my kernel to a kernel that uses umh modules.  The user tooling
generates some kind of boot entry that references the new kernel
image, and it also generates a list of modules to be loaded at various
times in the boot process.  This list might, and probably should,
include one or more umh modules.  (You are being very careful to make
sure that depmod keeps working, so umh modules are clearly intended to
work with existing tooling.)  So now I have a kernel image and some
modules to be loaded from various places.  And I have an init script
(initramfs's '/init' or similar) that will call init_module() on that
.ko file.  That script was certainly written under the assumption
that, once init_module() returns, the kernel is done with the .ko
file.  With your patch applied, that assumption is no longer true.

RHEL 5 uses initrd and is still supported.  For all I know, some
embedded setups put their initial /lib on some block device that
literally disappears after they finish booting.  There are livecds
that let you boot in a mode that lets you remove the CD after you
finish booting.  Heck, someone must have built something that calls
init_module() on a FUSE filesystem.

Heck, on my laptop, all my .ko files are labeled
system_u:object_r:modules_object_t:s0.  I wonder how many SELinux
setups (and AppArmor, etc) will actually disallow execve() on modules?

>
>>>
>>>>
>>>> The “read twice” thing is also bad for another reason: containers.
>>>> Suppose I have a setup where a container can load a signed module blob. 
>>>> With
>>>> the read twice code, the container can race and run an entirely different
>>>> blob outside the container.
>>>
>>>
>>> Not only "read twice", but "read many".
>>> If .text sections of elf that are not yet in memory can be modified
>>> by malicious user, later they will be brought in with different code.
>>> I think the easiest fix to tighten this "umh modules" to CAP_SYS_ADMIN.
>>
>>
>> Given this issue, I think the patch would need Kees’s explicit ack. I
>> had initially thought your patch had minimal security impact, but I
>> was wrong Module security is very complicated and needs to satisfy a
>> bunch of requirements. There is a lot of code in the kernel that
>> assumes that it’s sufficient to verify a module once at load time,
>> your patch changes that, and this has all kinds of nasty interactions
>> with autoloading.
>
>
> not true. you misread the patch and making incorrect conclusions.

So please tell my exactly how I misread the patch and why Linus'
conclusion (which is what I'm echoing) is wrong.

>
> I think you need to stop overreacting on non-issue.
>

Can you please try to have a constructive discussion here?  It's not
so enjoyable to review patches when author declares review comments to
be non-issues without actually explaining *why* they're non-issues.

Alexei, I'm willing to be shown that I'm wrong, but you have to show
me how I'm wrong rather than just telling me over and over that I'm
wrong.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Andy Lutomirski
>> On Mar 8, 2018, at 9:08 PM, Alexei Starovoitov <a...@fb.com> wrote:
>>
>> On 3/8/18 7:54 PM, Andy Lutomirski wrote:
>>
>>
>>
>>> On Mar 8, 2018, at 7:06 PM, Linus Torvalds <torva...@linux-foundation.org> 
>>> wrote:
>>>
>>>
>>> Honestly, that "read twice" thing may be what scuttles this.
>>> Initially, I thought it was a non-issue, because anybody who controls
>>> the module subdirectory enough to rewrite files would be in a position
>>> to just execute the file itself directly instead.
>>
>> On further consideration, I think there’s another showstopper. This patch is 
>> a potentially severe ABI break. Right now, loading a module *copies* it into 
>> memory and does not hold a reference to the underlying fs. With the patch 
>> applied, all kinds of use cases can break in gnarly ways. Initramfs is maybe 
>> okay, but initrd may be screwed. If you load an ET_EXEC module from initrd, 
>> then umount it, then clear the ramdisk, something will go horribly wrong. 
>> Exactly what goes wrong depends on whether userspace notices that umount() 
>> failed. Similarly, if you load one of these modules over a network and then 
>> lose your connection, you have a problem.
>
> there is not abi breakage and file cannot disappear from running task.
> One cannot umount fs while file is still being used.

Sure it is. Without your patch, init_module doesn’t keep using the
file, so it’s common practice to load a module and then delete or
unmount it. With your patch, the unmount case breaks. This is likely
to break existing userspace, so, in Linux speak it’s an ABI break.

>
>>
>> The “read twice” thing is also bad for another reason: containers. Suppose I 
>> have a setup where a container can load a signed module blob. With the read 
>> twice code, the container can race and run an entirely different blob 
>> outside the container.
>
> Not only "read twice", but "read many".
> If .text sections of elf that are not yet in memory can be modified
> by malicious user, later they will be brought in with different code.
> I think the easiest fix to tighten this "umh modules" to CAP_SYS_ADMIN.

Given this issue, I think the patch would need Kees’s explicit ack. I
had initially thought your patch had minimal security impact, but I
was wrong Module security is very complicated and needs to satisfy a
bunch of requirements. There is a lot of code in the kernel that
assumes that it’s sufficient to verify a module once at load time,
your patch changes that, and this has all kinds of nasty interactions
with autoloading.  Kees is very reasonable, and he’ll change his mind
and ack a patch that he’s nacked when presented with a valid technical
argument.

But I think my ABI break observation is also a major problem, and
Linus is going to be pissed if this thing lands in his tree and breaks
systems due to an issue that was raised during review. So I think you
need to either rework the patch or do a serious survey of how all the
distros deal with modules (dracut, initramfs-tools, all the older
stuff, and probably more) and make sure they can all handle your
patch.  I'd also be concerned about anyone who puts /lib/modules on
less reliable storage than they use for the rest of their system.  (I
know it's quite common to have /boot be the only non-RAID partition on
a system, but modules don't generally live in /boot.)

Also, I think you really ought to explain how your approach will work
with MODULES=n or convince Linus that it’s okay to start adding core
networking features that don’t work with MODULES=n and can't be built
into the main kernel image.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Andy Lutomirski



> On Mar 8, 2018, at 7:06 PM, Linus Torvalds  
> wrote:
> 
> 
> Honestly, that "read twice" thing may be what scuttles this.
> Initially, I thought it was a non-issue, because anybody who controls
> the module subdirectory enough to rewrite files would be in a position
> to just execute the file itself directly instead.
> 

On further consideration, I think there’s another showstopper. This patch is a 
potentially severe ABI break. Right now, loading a module *copies* it into 
memory and does not hold a reference to the underlying fs. With the patch 
applied, all kinds of use cases can break in gnarly ways. Initramfs is maybe 
okay, but initrd may be screwed. If you load an ET_EXEC module from initrd, 
then umount it, then clear the ramdisk, something will go horribly wrong. 
Exactly what goes wrong depends on whether userspace notices that umount() 
failed. Similarly, if you load one of these modules over a network and then 
lose your connection, you have a problem. 


The “read twice” thing is also bad for another reason: containers. Suppose I 
have a setup where a container can load a signed module blob. With the read 
twice code, the container can race and run an entirely different blob outside 
the container. 

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Andy Lutomirski



> On Mar 8, 2018, at 6:31 PM, David Miller <da...@davemloft.net> wrote:
> 
> From: Andy Lutomirski <l...@kernel.org>
> Date: Fri, 9 Mar 2018 02:12:24 +
> 
>> First, compile your user code and emit a staitc binary.  Use objdump
>> fiddling or a trivial .S file to make that static binary into a
>> variable.  Then write a tiny shim module like this:
>> 
>> extern unsigned char __begin_user_code[], __end_user_code[];
>> 
>> int __init init_shim_module(void)
>> {
>>  return call_umh_blob(__begin_user_code, __end_user_code - 
>> __begin_user_code);
>> }
>> 
>> By itself, this is clearly a worse solution than yours, but it has two
>> benefits, one small and two big.  The small benefit is that it is
>> completely invisible to userspace: the .ko file is a bona fide module.
> 
> Anything you try to do which makes these binaries "special" is a huge
> negative.

I don’t know what you mean.  Alexei’s approach introduces a whole new kind of 
special module.  Mine doesn’t. 

> 
>> The big benefits are:
> 
> I don't see those things as benefits at all, and Alexei's scheme can
> easily be made to work in your benefit #1 case too.
> 

How?  I think you’ll find that a non-modular implementation of a bundled ELF 
binary looks a *lot* like my call_umh_blob().

> It's a user binary.  It's shipped with the kernel and it's signed.
> 
> If we can't trust that, we can't trust much else.

I’m not making any arguments about security at all. I’m talking about 
functionality. 

If we apply Alexei’s patch as is, then I think we’ll have a situation where 
ET_EXEC modules are only useful if they can do their jobs without any 
filesystem access at all.  This is fine for networking, where netlink sockets 
are used, but I think it’s not so great for other use cases. If we ever try to 
stick a usb driver into userspace, we’re going to want to instantiate the user 
task once per device, passed as stdin or similar, and Alexei’s code will make 
that very awkward.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Andy Lutomirski
On Fri, Mar 9, 2018 at 1:20 AM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Fri, Mar 09, 2018 at 12:59:36AM +0000, Andy Lutomirski wrote:
>>
>> Alexei, can you give an example use case?  I'm sure it's upthread
>> somewhere, but I'm having trouble finding it.
>
> at the time of iptable's setsockopt() the kernel will do
> err = request_module("bpfilter");
> once.
> The rough POC code:
> https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/ipv4/bpfilter/sockopt.c?h=ipt_bpf#n25

Here's what I gather from reading that code: you have a new kernel
feature (consisting of actual kernel code) that wants to defer some of
its implementation to user mode.  I like this idea a lot.  But I have
a suggestion for a slightly different way of accomplishing the same
thing.  Rather than extending init_module() to accept ELF input,
except the call_umh code to be able to call blobs.  You'd use it it
very roughly like this:

First, compile your user code and emit a staitc binary.  Use objdump
fiddling or a trivial .S file to make that static binary into a
variable.  Then write a tiny shim module like this:

extern unsigned char __begin_user_code[], __end_user_code[];

int __init init_shim_module(void)
{
  return call_umh_blob(__begin_user_code, __end_user_code - __begin_user_code);
}

By itself, this is clearly a worse solution than yours, but it has two
benefits, one small and two big.  The small benefit is that it is
completely invisible to userspace: the .ko file is a bona fide module.
The big benefits are:

1. It works even in a non-modular kernel!  (Okay, it probably only
works if you can arrange for the built-in module to be initialized
late enough, but that's straightforward.)

2. It allows future extensions to change the way the glue works.  For
example, maybe you want the module to integrate properly with lsmod,
etc.  Rather than adding a mechanism for general privileged programs
to register themselves with lsmod (ick!), you could do it entirely in
the kernel where lsmod would know that a particular umh task is
special.  More usefully, you could extend call_umh_blob() to pass in
some pre-initialized struct files, which would give a clean way to
*synchronously* create a communication channel to user code for
whatever service the user code provides.  And it would be more
straightforward to make the umh blob do what it needs to do without
relying on any particular filesystems being mounted.

I think we don't want to end up in a situation where we ship a program
with a .ko extension that opens something in /dev, for example.

call_umh_blob() would create an anon_inode or similar object backed by
the blob and exec it.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Andy Lutomirski
On Fri, Mar 9, 2018 at 12:57 AM, Alexei Starovoitov  wrote:
> On 3/8/18 4:24 PM, Kees Cook wrote:
>>
>> As Andy asked earlier, why not DYN too to catch PIE executables? Seems
>> like forcing the userspace helper to be non-PIE would defeat some of
>> the userspace defenses in use in most distros.
>
>
> because we don't add features without concrete users.

I disagree here.  If you're going to add a magic trick that triggers
an execve(), then I think you should either support *both* standard,
widely-used types of ELF programs or you should give a compelling use
case that works for ET_EXEC but not for ET_DYN.  Keep in mind that
many distros have a very strong preference for ET_DYN.

Or you could argue that ET_DYN requires tooling changes, but I think
it's awkward to ask the tooling to change in advance of the kernel
being willing to actually invoke the thing.  I'm not actually
convinced that any tooling changes would be needed, though.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Andy Lutomirski
On Fri, Mar 9, 2018 at 12:24 AM, Kees Cook  wrote:
> How is this not marked [RFC]? :)
>
> On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov  wrote:
>> As the first step in development of bpfilter project [1] the request_module()
>> code is extended to allow user mode helpers to be invoked. Idea is that
>> user mode helpers are built as part of the kernel build and installed as
>> traditional kernel modules with .ko file extension into distro specified
>> location, such that from a distribution point of view, they are no different
>> than regular kernel modules. Thus, allow request_module() logic to load such
>> user mode helper (umh) modules via:
>>
>>   request_module("foo") ->
>> call_umh("modprobe foo") ->
>>   sys_finit_module(FD of /lib/modules/.../foo.ko) ->
>> call_umh(struct file)
>
> Yikes. This means autoloaded artbitrary exec() now, instead of
> autoloading just loading arbitrary modules? That seems extremely bad.
> This just extends all the problems we've had with defining security
> boundaries with modules out to umh too. I would need some major
> convincing that this can be made safe.
>
> It's easy for container to trigger arbitrary module loading, so if it
> can find/use a similar one of the bugs we've seen in the past to
> redirect the module loading we don't just get new module-created
> attack surface added to the kernel, but we again get arbitrary ELF
> running in the root ns (not in the container). And in the insane case
> of a container with CAP_SYS_MODULE, this is a trivial escape without
> even needing to build a special kernel module: the root ns, running
> with all privileges runs an arbitrary ELF.
>
> As it stands, I have to NAK this. At the very least, you need to solve
> the execution environment problems here: the ELF should run with no
> greater privileges than what loaded the module, and very importantly,
> must not be allowed to bypass these checks through autoloading. What
> _triggered_ the autoload must be the environment, not the "modprobe",
> since that's running with full privileges. Basically, we need to fix
> module autoloading first, or at least a significant subset of the
> problems: https://lkml.org/lkml/2017/11/27/754

I disagree.  If we're going to somehow have ELF binaries that pretend
to be modules, then they should run with high privilege and, more
importantly, should run in a context independent of whoever triggered
autoloading.

Also, I don't see how this is any more exploitable than any other
init_module().  If someone has CAP_SYS_MODULE or autoload privileges
and they can trigger init_module() on a file, then they're granting
maximum privilege to the contents of that file.  So who cares if that
file is a kernel module or an ELF binary?

Alexei, can you give an example use case?  I'm sure it's upthread
somewhere, but I'm having trouble finding it.

Also, I just tested this concept a bit.  Depmod invoked explicitly on
an ET_EXEC with a.ko extension gets mad, but depmod -a on a kernel
that has a "module" like that seems to work fine.  Go figure.


Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-08 Thread Andy Lutomirski
On Thu, Mar 8, 2018 at 11:51 PM, Mickaël Salaün <m...@digikod.net> wrote:
>
> On 07/03/2018 02:21, Andy Lutomirski wrote:
>> On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün <m...@digikod.net> wrote:
>>>
>>> On 06/03/2018 23:46, Tycho Andersen wrote:
>>>> On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote:
>>>>>>> Suppose I'm writing a container manager.  I want to run "mount" in the
>>>>>>> container, but I don't want to allow moun() in general and I want to
>>>>>>> emulate certain mount() actions.  I can write a filter that catches
>>>>>>> mount using seccomp and calls out to the container manager for help.
>>>>>>> This isn't theoretical -- Tycho wants *exactly* this use case to be
>>>>>>> supported.
>>>>>>
>>>>>> Well, I think this use case should be handled with something like
>>>>>> LD_PRELOAD and a helper library. FYI, I did something like this:
>>>>>> https://github.com/stemjail/stemshim
>>>>>
>>>>> I doubt that will work for containers.  Containers that use user
>>>>> namespaces and, for example, setuid programs aren't going to honor
>>>>> LD_PRELOAD.
>>>>
>>>> Or anything that calls syscalls directly, like go programs.
>>>
>>> That's why the vDSO-like approach. Enforcing an access control is not
>>> the issue here, patching a buggy userland (without patching its code) is
>>> the issue isn't it?
>>>
>>> As far as I remember, the main problem is to handle file descriptors
>>> while "emulating" the kernel behavior. This can be done with a "shim"
>>> code mapped in every processes. Chrome used something like this (in a
>>> previous sandbox mechanism) as a kind of emulation (with the current
>>> seccomp-bpf ). I think it should be doable to replace the (userland)
>>> emulation code with an IPC wrapper receiving file descriptors through
>>> UNIX socket.
>>>
>>
>> Can you explain exactly what you mean by "vDSO-like"?
>>
>> When a 64-bit program does a syscall, it just executes the SYSCALL
>> instruction.  The vDSO isn't involved at all.  32-bit programs usually
>> go through the vDSO, but not always.
>>
>> It could be possible to force-load a DSO into an entire container and
>> rig up seccomp to intercept all SYSCALLs not originating from the DSO
>> such that they merely redirect control to the DSO, but that seems
>> quite messy.
>
> vDSO is a code mapped for all processes. As you said, these processes
> may use it or not. What I was thinking about is to use the same concept,
> i.e. map a "shim" code into each processes pertaining to a particular
> hierarchy (the same way seccomp filters are inherited across processes).
> With a seccomp filter matching some syscall (e.g. mount, open), it is
> possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This
> shim code should then be able to emulate/patch what is needed, even
> faking a file opening by receiving a file descriptor through a UNIX
> socket. As did the Chrome sandbox, the seccomp filter may look at the
> calling address to allow the shim code to call syscalls without being
> catched, if needed. However, relying on SIGSYS may not fit with
> arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump
> to a specific process address, to emulate the syscall in an easier way
> than only relying on a {c,e}BPF program.
>

This could indeed be done, but I think that Tycho's approach is much
cleaner and probably faster.


Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-06 Thread Andy Lutomirski
On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün <m...@digikod.net> wrote:
>
> On 06/03/2018 23:46, Tycho Andersen wrote:
>> On Tue, Mar 06, 2018 at 10:33:17PM +0000, Andy Lutomirski wrote:
>>>>> Suppose I'm writing a container manager.  I want to run "mount" in the
>>>>> container, but I don't want to allow moun() in general and I want to
>>>>> emulate certain mount() actions.  I can write a filter that catches
>>>>> mount using seccomp and calls out to the container manager for help.
>>>>> This isn't theoretical -- Tycho wants *exactly* this use case to be
>>>>> supported.
>>>>
>>>> Well, I think this use case should be handled with something like
>>>> LD_PRELOAD and a helper library. FYI, I did something like this:
>>>> https://github.com/stemjail/stemshim
>>>
>>> I doubt that will work for containers.  Containers that use user
>>> namespaces and, for example, setuid programs aren't going to honor
>>> LD_PRELOAD.
>>
>> Or anything that calls syscalls directly, like go programs.
>
> That's why the vDSO-like approach. Enforcing an access control is not
> the issue here, patching a buggy userland (without patching its code) is
> the issue isn't it?
>
> As far as I remember, the main problem is to handle file descriptors
> while "emulating" the kernel behavior. This can be done with a "shim"
> code mapped in every processes. Chrome used something like this (in a
> previous sandbox mechanism) as a kind of emulation (with the current
> seccomp-bpf ). I think it should be doable to replace the (userland)
> emulation code with an IPC wrapper receiving file descriptors through
> UNIX socket.
>

Can you explain exactly what you mean by "vDSO-like"?

When a 64-bit program does a syscall, it just executes the SYSCALL
instruction.  The vDSO isn't involved at all.  32-bit programs usually
go through the vDSO, but not always.

It could be possible to force-load a DSO into an entire container and
rig up seccomp to intercept all SYSCALLs not originating from the DSO
such that they merely redirect control to the DSO, but that seems
quite messy.


Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-06 Thread Andy Lutomirski
On Tue, Mar 6, 2018 at 10:25 PM, Mickaël Salaün <m...@digikod.net> wrote:
>
>
> On 28/02/2018 00:09, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 10:03 PM, Mickaël Salaün <m...@digikod.net> wrote:
>>>
>>> On 27/02/2018 05:36, Andy Lutomirski wrote:
>>>> On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün <m...@digikod.net> wrote:
>>>>> Hi,
>>>>>
>>
>>>>>
>>>>> ## Why use the seccomp(2) syscall?
>>>>>
>>>>> Landlock use the same semantic as seccomp to apply access rule
>>>>> restrictions. It add a new layer of security for the current process
>>>>> which is inherited by its children. It makes sense to use an unique
>>>>> access-restricting syscall (that should be allowed by seccomp filters)
>>>>> which can only drop privileges. Moreover, a Landlock rule could come
>>>>> from outside a process (e.g.  passed through a UNIX socket). It is then
>>>>> useful to differentiate the creation/load of Landlock eBPF programs via
>>>>> bpf(2), from rule enforcement via seccomp(2).
>>>>
>>>> This seems like a weak argument to me.  Sure, this is a bit different
>>>> from seccomp(), and maybe shoving it into the seccomp() multiplexer is
>>>> awkward, but surely the bpf() multiplexer is even less applicable.
>>>
>>> I think using the seccomp syscall is fine, and everyone agreed on it.
>>>
>>
>> Ah, sorry, I completely misread what you wrote.  My apologies.  You
>> can disregard most of my email.
>>
>>>
>>>>
>>>> Also, looking forward, I think you're going to want a bunch of the
>>>> stuff that's under consideration as new seccomp features.  Tycho is
>>>> working on a "user notifier" feature for seccomp where, in addition to
>>>> accepting, rejecting, or kicking to ptrace, you can send a message to
>>>> the creator of the filter and wait for a reply.  I think that Landlock
>>>> will want exactly the same feature.
>>>
>>> I don't think why this may be useful at all her. Landlock does not
>>> filter at the syscall level but handles kernel object and actions as
>>> does an LSM. That is the whole purpose of Landlock.
>>
>> Suppose I'm writing a container manager.  I want to run "mount" in the
>> container, but I don't want to allow moun() in general and I want to
>> emulate certain mount() actions.  I can write a filter that catches
>> mount using seccomp and calls out to the container manager for help.
>> This isn't theoretical -- Tycho wants *exactly* this use case to be
>> supported.
>
> Well, I think this use case should be handled with something like
> LD_PRELOAD and a helper library. FYI, I did something like this:
> https://github.com/stemjail/stemshim

I doubt that will work for containers.  Containers that use user
namespaces and, for example, setuid programs aren't going to honor
LD_PRELOAD.

>
> Otherwise, we should think about enabling a process to (dynamically)
> extend/patch the vDSO (similar to LD_PRELOAD but at the syscall level
> and works with static binaries) for a subset of processes (the same way
> seccomp filters are inherited). It may be more powerful and flexible
> than extending the kernel/seccomp to patch (buggy?) userland.

Egads!


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Andy Lutomirski
On Tue, Mar 6, 2018 at 1:34 AM, Alexei Starovoitov  wrote:
> As the first step in development of bpfilter project [1] the request_module()
> code is extended to allow user mode helpers to be invoked. Idea is that
> user mode helpers are built as part of the kernel build and installed as
> traditional kernel modules with .ko file extension into distro specified
> location, such that from a distribution point of view, they are no different
> than regular kernel modules. Thus, allow request_module() logic to load such
> user mode helper (umh) modules via:
>
>   request_module("foo") ->
> call_umh("modprobe foo") ->
>   sys_finit_module(FD of /lib/modules/.../foo.ko) ->
> call_umh(struct file)
>

I assume I'm missing some context here, but why does this need to be
handled by the kernel rather than, say, a change to how modprobe
works?  I imagine that usermode tooling needs to change regardless
because the existing tools may get rather confused if a .ko "module"
is really a dynamically liked program.  I notice that you're using
ET_EXEC in your example, and that will probably avoid problems, but I
imagine that some distros would much rather use ET_DYN.


Re: [net-next v3 1/2] bpf, seccomp: Add eBPF filter capabilities

2018-03-01 Thread Andy Lutomirski
On Mon, Feb 26, 2018 at 7:27 AM, Sargun Dhillon  wrote:
> This introduces the BPF_PROG_TYPE_SECCOMP bpf program type. It is meant
> to be used for seccomp filters as an alternative to cBPF filters. The
> program type has relatively limited capabilities in terms of helpers,
> but that can be extended later on.
>
> The eBPF code loading is separated from attachment of the filter, so
> a privileged user can load the filter, and pass it back to an
> unprivileged user who can attach it and use it at a later time.
>
> In order to attach the filter itself, you need to supply a flag to the
> seccomp syscall indicating that a eBPF filter is being attached, as
> opposed to a cBPF one. Verification occurs at program load time,
> so the user should only receive errors related to attachment.
> +static const struct bpf_func_proto *
> +seccomp_func_proto(enum bpf_func_id func_id)
> +{

return NULL unconditionally.  If you want different behavior, make it
a separate patch to be reviewed separately.

> +   switch (func_id) {
> +   case BPF_FUNC_get_current_uid_gid:
> +   return _get_current_uid_gid_proto;

NAK.  Doesn't work right in containers.  *Maybe* it would be okay to
have a helper that returns the uid/gid in the namespace of the filter
creator, but that's not what this does.

> +   case BPF_FUNC_ktime_get_ns:
> +   return _ktime_get_ns_proto;

NAK.  This would need a very clear use case that makes it worthwhile.
As it stands, it interferes with CRIU for no obvious good reason.

> +   case BPF_FUNC_get_prandom_u32:
> +   return _get_prandom_u32_proto;

NAK.  This helper is cryptographically insecure in that it surely
allows one user to figure out the entire random number stream of
another and, worse, it could be abused to allow one user to *control*
the stream of another user.  Read the code in bpf_user_rnd_u32.

If there is a legitimate reason to allow random number generation in
seccomp eBPF programs, it can be its own patch, and it needs, at the
very least, the properties that a filter's stream of random bytes is
unpredictable and uncontrollable by anyone else.  Realistically, this
means you need the arch_random stuff or the actual get_random_bytes()
interface.

> +   case BPF_FUNC_get_current_pid_tgid:
> +   return _get_current_pid_tgid_proto;

NAK for the same reason as get_current_uid_gid.

But Tycho: would hooking user notifiers in right here work for you?
As I see it, this would be the best justification for seccomp eBPF.

--Andy


Re: [net-next v3 0/2] eBPF seccomp filters

2018-03-01 Thread Andy Lutomirski
On Thu, Mar 1, 2018 at 9:51 PM, Sargun Dhillon <sar...@sargun.me> wrote:
> On Thu, Mar 1, 2018 at 9:44 AM, Andy Lutomirski <l...@amacapital.net> wrote:
>> On Wed, Feb 28, 2018 at 7:56 PM, Daniel Borkmann <dan...@iogearbox.net> 
>> wrote:
>>> On 02/28/2018 12:55 AM, chris hyser wrote:
>>>>> On 02/27/2018 04:58 PM, Daniel Borkmann wrote: >> On 02/27/2018 05:59 PM, 
>>>>> chris hyser wrote:
>>>>>>> On 02/27/2018 11:00 AM, Kees Cook wrote:
>>>>>>>> On Tue, Feb 27, 2018 at 6:53 AM, chris hyser <chris.hy...@oracle.com> 
>>>>>>>> wrote:
>>>>>>>>> On 02/26/2018 11:38 PM, Kees Cook wrote:
>>>>>>>>>> On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski 
>>>>>>>>>> <l...@amacapital.net>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> 3. Straight-up bugs.  Those are exactly as problematic as verifier
>>>>>>>>>>> bugs in any other unprivileged eBPF program type, right?  I don't 
>>>>>>>>>>> see
>>>>>>>>>>> why seccomp is special here.
>>>>>>>>>>
>>>>>>>>>> My concern is more about unintended design mistakes or other feature
>>>>>>>>>> creep with side-effects, especially when it comes to privileges and
>>>>>>>>>> synchronization. Getting no-new-privs done correctly, for example,
>>>>>>>>>> took some careful thought and discussion, and I'm shy from how 
>>>>>>>>>> painful
>>>>>>>>>> TSYNC was on the process locking side, and eBPF has had some rather
>>>>>>>>>> ugly flaws in the past (and recently: it was nice to be able to say
>>>>>>>>>> for Spectre that seccomp filters couldn't be constructed to make
>>>>>>>>>> attacks but eBPF could). Adding the complexity needs to be worth the
>>>>>>
>>>>>> Well, not really. One part of all the Spectre mitigations that went 
>>>>>> upstream
>>>>>> from BPF side was to have an option to remove interpreter entirely and 
>>>>>> that
>>>>>> also relates to seccomp eventually. But other than that an attacker might
>>>>>> potentially find as well useful gadgets inside seccomp or any other code
>>>>>> that is inside the kernel, so it's not a strict necessity either.
>>>>>>
>>>>>>>>>> gain. I'm on board for doing it, I just want to be careful. :)
>>>>>>>>>
>>>>>>>>> Another option might be to remove c/eBPF from the equation all 
>>>>>>>>> together.
>>>>>>>>> c/eBPF allows flexibility and that almost always comes at the cost of
>>>>>>>>> additional security risk. Seccomp is for enhanced security yes? How 
>>>>>>>>> about a
>>>>>>>>> new seccomp mode that passes in something like a bit vector or 
>>>>>>>>> hashmap for
>>>>>>>>> "simple" white/black list checks validated by kernel code, versus user
>>>>>>>>> provided interpreted code? Of course this removes a fair number of 
>>>>>>>>> things
>>>>>>>>> you can currently do or would be able to do with eBPF. Of course, 
>>>>>>>>> restated
>>>>>>>>> from a security point of view, this removes a fair number of things an
>>>>>>>>> _attacker_ can do. Presumably the performance improvement would also 
>>>>>>>>> be
>>>>>>>>> significant.
>>>>>>
>>>>>> Good luck with not breaking existing applications relying on seccomp out
>>>>>> there.
>>>>>
>>>>> This wasn't in the context of an implementation proposal, but the 
>>>>> assumption would be to add this in addition to the old way. Now, does 
>>>>> that make sense to do? That is the discussion.
>>>
>>> I see; didn't read that out from the above when you also mentioned removing
>>> cBPF, but fair enough.
>>>
>>>>>>>>> Is this an idea worth prototyping?
>>>>

Re: [net-next v3 0/2] eBPF seccomp filters

2018-03-01 Thread Andy Lutomirski
On Wed, Feb 28, 2018 at 7:56 PM, Daniel Borkmann <dan...@iogearbox.net> wrote:
> On 02/28/2018 12:55 AM, chris hyser wrote:
>>> On 02/27/2018 04:58 PM, Daniel Borkmann wrote: >> On 02/27/2018 05:59 PM, 
>>> chris hyser wrote:
>>>>> On 02/27/2018 11:00 AM, Kees Cook wrote:
>>>>>> On Tue, Feb 27, 2018 at 6:53 AM, chris hyser <chris.hy...@oracle.com> 
>>>>>> wrote:
>>>>>>> On 02/26/2018 11:38 PM, Kees Cook wrote:
>>>>>>>> On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski <l...@amacapital.net>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> 3. Straight-up bugs.  Those are exactly as problematic as verifier
>>>>>>>>> bugs in any other unprivileged eBPF program type, right?  I don't see
>>>>>>>>> why seccomp is special here.
>>>>>>>>
>>>>>>>> My concern is more about unintended design mistakes or other feature
>>>>>>>> creep with side-effects, especially when it comes to privileges and
>>>>>>>> synchronization. Getting no-new-privs done correctly, for example,
>>>>>>>> took some careful thought and discussion, and I'm shy from how painful
>>>>>>>> TSYNC was on the process locking side, and eBPF has had some rather
>>>>>>>> ugly flaws in the past (and recently: it was nice to be able to say
>>>>>>>> for Spectre that seccomp filters couldn't be constructed to make
>>>>>>>> attacks but eBPF could). Adding the complexity needs to be worth the
>>>>
>>>> Well, not really. One part of all the Spectre mitigations that went 
>>>> upstream
>>>> from BPF side was to have an option to remove interpreter entirely and that
>>>> also relates to seccomp eventually. But other than that an attacker might
>>>> potentially find as well useful gadgets inside seccomp or any other code
>>>> that is inside the kernel, so it's not a strict necessity either.
>>>>
>>>>>>>> gain. I'm on board for doing it, I just want to be careful. :)
>>>>>>>
>>>>>>> Another option might be to remove c/eBPF from the equation all together.
>>>>>>> c/eBPF allows flexibility and that almost always comes at the cost of
>>>>>>> additional security risk. Seccomp is for enhanced security yes? How 
>>>>>>> about a
>>>>>>> new seccomp mode that passes in something like a bit vector or hashmap 
>>>>>>> for
>>>>>>> "simple" white/black list checks validated by kernel code, versus user
>>>>>>> provided interpreted code? Of course this removes a fair number of 
>>>>>>> things
>>>>>>> you can currently do or would be able to do with eBPF. Of course, 
>>>>>>> restated
>>>>>>> from a security point of view, this removes a fair number of things an
>>>>>>> _attacker_ can do. Presumably the performance improvement would also be
>>>>>>> significant.
>>>>
>>>> Good luck with not breaking existing applications relying on seccomp out
>>>> there.
>>>
>>> This wasn't in the context of an implementation proposal, but the 
>>> assumption would be to add this in addition to the old way. Now, does that 
>>> make sense to do? That is the discussion.
>
> I see; didn't read that out from the above when you also mentioned removing
> cBPF, but fair enough.
>
>>>>>>> Is this an idea worth prototyping?
>>>>>>
>>>>>> That was the original prototype for seccomp-filter. :) The discussion
>>>>>> around that from years ago basically boiled down to it being
>>>>>> inflexible. Given all the things people want to do at syscall time,
>>>>>> that continues to be true. So true, in fact, that here we are now,
>>>>>> trying to move to eBPF from cBPF. ;)
>>>>
>>>> Right, agree. cBPF is also pretty much frozen these days and aside from
>>>> that, seccomp/BPF also just uses a proper subset of it. I wouldn't mind
>>>> doing something similar for eBPF side as long as this is reasonably
>>>> maintainable and not making BPF core more complex, but most of it can
>>>> already be set in the verifier anyway based on prog type

Re: [PATCH bpf-next v8 08/11] landlock: Add ptrace restrictions

2018-02-27 Thread Andy Lutomirski
On Wed, Feb 28, 2018 at 12:00 AM, Mickaël Salaün <m...@digikod.net> wrote:
>
> On 28/02/2018 00:23, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 11:02 PM, Andy Lutomirski <l...@kernel.org> wrote:
>>> On Tue, Feb 27, 2018 at 10:14 PM, Mickaël Salaün <m...@digikod.net> wrote:
>>>>
>>>
>>> I think you're wrong here.  Any sane container trying to use Landlock
>>> like this would also create a PID namespace.  Problem solved.  I still
>>> think you should drop this patch.
>
> Containers is one use case, another is build-in sandboxing (e.g. for web
> browser…) and another one is for sandbox managers (e.g. Firejail,
> Bubblewrap, Flatpack…). In some of these use cases, especially from a
> developer point of view, you may want/need to debug your applications
> (without requiring to be root). For nested Landlock access-controls
> (e.g. container + user session + web browser), it may not be allowed to
> create a PID namespace, but you still want to have a meaningful
> access-control.
>

The consideration should be exactly the same as for normal seccomp.
If I'm in a container (using PID namespaces + seccomp) and a run a web
browser, I can debug the browser.

If there's a real use case for adding this type of automatic ptrace
protection, then by all means, let's add it as a general seccomp
feature.


Re: [PATCH bpf-next v8 08/11] landlock: Add ptrace restrictions

2018-02-27 Thread Andy Lutomirski
On Tue, Feb 27, 2018 at 11:02 PM, Andy Lutomirski <l...@kernel.org> wrote:
> On Tue, Feb 27, 2018 at 10:14 PM, Mickaël Salaün <m...@digikod.net> wrote:
>>
>> On 27/02/2018 06:01, Andy Lutomirski wrote:
>>>
>>>
>>>> On Feb 26, 2018, at 8:17 PM, Andy Lutomirski <l...@amacapital.net> wrote:
>>>>
>>>>> On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün <m...@digikod.net> wrote:
>>>>> A landlocked process has less privileges than a non-landlocked process
>>>>> and must then be subject to additional restrictions when manipulating
>>>>> processes. To be allowed to use ptrace(2) and related syscalls on a
>>>>> target process, a landlocked process must have a subset of the target
>>>>> process' rules.
>>>>>
>>>>> Signed-off-by: Mickaël Salaün <m...@digikod.net>
>>>>> Cc: Alexei Starovoitov <a...@kernel.org>
>>>>> Cc: Andy Lutomirski <l...@amacapital.net>
>>>>> Cc: Daniel Borkmann <dan...@iogearbox.net>
>>>>> Cc: David S. Miller <da...@davemloft.net>
>>>>> Cc: James Morris <james.l.mor...@oracle.com>
>>>>> Cc: Kees Cook <keesc...@chromium.org>
>>>>> Cc: Serge E. Hallyn <se...@hallyn.com>
>>>>> ---
>>>>>
>>>>> Changes since v6:
>>>>> * factor out ptrace check
>>>>> * constify pointers
>>>>> * cleanup headers
>>>>> * use the new security_add_hooks()
>>>>> ---
>>>>> security/landlock/Makefile   |   2 +-
>>>>> security/landlock/hooks_ptrace.c | 124 
>>>>> +++
>>>>> security/landlock/hooks_ptrace.h |  11 
>>>>> security/landlock/init.c |   2 +
>>>>> 4 files changed, 138 insertions(+), 1 deletion(-)
>>>>> create mode 100644 security/landlock/hooks_ptrace.c
>>>>> create mode 100644 security/landlock/hooks_ptrace.h
>>>>>
>>>>> diff --git a/security/landlock/Makefile b/security/landlock/Makefile
>>>>> index d0f532a93b4e..605504d852d3 100644
>>>>> --- a/security/landlock/Makefile
>>>>> +++ b/security/landlock/Makefile
>>>>> @@ -3,4 +3,4 @@ obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
>>>>> landlock-y := init.o chain.o task.o \
>>>>>tag.o tag_fs.o \
>>>>>enforce.o enforce_seccomp.o \
>>>>> -   hooks.o hooks_cred.o hooks_fs.o
>>>>> +   hooks.o hooks_cred.o hooks_fs.o hooks_ptrace.o
>>>>> diff --git a/security/landlock/hooks_ptrace.c 
>>>>> b/security/landlock/hooks_ptrace.c
>>>>> new file mode 100644
>>>>> index ..f1b977b9c808
>>>>> --- /dev/null
>>>>> +++ b/security/landlock/hooks_ptrace.c
>>>>> @@ -0,0 +1,124 @@
>>>>> +/*
>>>>> + * Landlock LSM - ptrace hooks
>>>>> + *
>>>>> + * Copyright © 2017 Mickaël Salaün <m...@digikod.net>
>>>>> + *
>>>>> + * This program is free software; you can redistribute it and/or modify
>>>>> + * it under the terms of the GNU General Public License version 2, as
>>>>> + * published by the Free Software Foundation.
>>>>> + */
>>>>> +
>>>>> +#include 
>>>>> +#include 
>>>>> +#include  /* ARRAY_SIZE */
>>>>> +#include 
>>>>> +#include  /* struct task_struct */
>>>>> +#include 
>>>>> +
>>>>> +#include "common.h" /* struct landlock_prog_set */
>>>>> +#include "hooks.h" /* landlocked() */
>>>>> +#include "hooks_ptrace.h"
>>>>> +
>>>>> +static bool progs_are_subset(const struct landlock_prog_set *parent,
>>>>> +   const struct landlock_prog_set *child)
>>>>> +{
>>>>> +   size_t i;
>>>>> +
>>>>> +   if (!parent || !child)
>>>>> +   return false;
>>>>> +   if (parent == child)
>>>>> +   return true;
>>>>> +
>>>>> +   for (i = 0; i < ARRAY_SIZE(child->programs); i++) {
>>>>
>>>> ARRAY_SIZE(child->programs) seems misleading.  Is there no define
>>>> NUM_LANDLOCK_PRO

Re: [net-next v3 0/2] eBPF seccomp filters

2018-02-27 Thread Andy Lutomirski
On Tue, Feb 27, 2018 at 11:10 PM, Mickaël Salaün <m...@digikod.net> wrote:
>
> On 27/02/2018 05:54, Andy Lutomirski wrote:
>>
>>
>>> On Feb 26, 2018, at 8:38 PM, Kees Cook <keesc...@chromium.org> wrote:
>>>
>>> On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski <l...@amacapital.net> 
>>> wrote:
>>>>> On Feb 26, 2018, at 3:20 PM, Kees Cook <keesc...@chromium.org> wrote:
>>>>>
>>>>> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov
>>>>> <alexei.starovoi...@gmail.com> wrote:
>>>>>>> On Mon, Feb 26, 2018 at 07:26:54AM +, Sargun Dhillon wrote:
>>>>>>> This patchset enables seccomp filters to be written in eBPF. Although, 
>>>>>>> this
>>>>>>> [...]
>>>>>> The main statement I want to hear from seccomp maintainers before
>>>>>> proceeding any further on this that enabling eBPF in seccomp won't lead
>>>>>> to seccomp folks arguing against changes in bpf core (like verifier)
>>>>>> just because it's used by seccomp.
>>>>>> It must be spelled out in the commit log with explicit Ack.
>>>>>
>>>>> The primary thing I'm concerned about with eBPF and seccomp is
>>>>> side-effects from eBPF programs running at syscall time. This is an
>>>>> extremely sensitive area, and I want to be sure there won't be
>>>>> feature-creep here that leads to seccomp getting into a bad state.
>>>>>
>>>>> As long as seccomp can continue have its own verifier, I *think* this
>>>>> will be fine, though, again I remain concerned about maps, etc. I'm
>>>>> still reviewing these patches and how they might provide overlap with
>>>>> Tycho's needs too, etc.
>>>>
>>>> I'm not sure I see this as a huge problem.  As far as I can see, there
>>>> are three ways that a verifier change could be problematic:
>>>>
>>>> 1. Addition of a new type of map.  But seccomp would just not allow
>>>> new map types by default, right?
>>>>
>>>> 2. Addition of a new BPF_CALLable helper.  Seccomp wants a way to
>>>> whitelist BPF_CALL targets.  That should be straightforward.
>>>
>>> Yup, agreed on 1 and 2.
>>>
>>>> 3. Straight-up bugs.  Those are exactly as problematic as verifier
>>>> bugs in any other unprivileged eBPF program type, right?  I don't see
>>>> why seccomp is special here.
>>>
>>> My concern is more about unintended design mistakes or other feature
>>> creep with side-effects, especially when it comes to privileges and
>>> synchronization. Getting no-new-privs done correctly, for example,
>>> took some careful thought and discussion, and I'm shy from how painful
>>> TSYNC was on the process locking side, and eBPF has had some rather
>>> ugly flaws in the past (and recently: it was nice to be able to say
>>> for Spectre that seccomp filters couldn't be constructed to make
>>> attacks but eBPF could). Adding the complexity needs to be worth the
>>> gain. I'm on board for doing it, I just want to be careful. :)
>>>
>>
>> I agree.  I think that, if we do this right, we get a clean version of 
>> Tycho's notifiers.  We can also very easily build on that to send a 
>> non-blocking message to the notifier fd, which gets us a version of seccomp 
>> logging that works for things like Chromium and even strace.  I think this 
>> is worth it.
>>
>> I also think this sort of argument is why Mickaël's privileged-first 
>> Landlock is the wrong approach.  By getting the unprivileged parts right 
>> from day one, we can carefully extend the mechanism and keep it usable by 
>> unprivileged apps.  But, if we'd started as root-only, fixing up everything 
>> needed to make it safe for unprivileged users after the fact would have been 
>> quite messy.
>
> We agreed (including Kees and you, at the Santa Fe LPC) to limit the use
> of Landlock to CAP_SYS_ADMIN at first. It is an artificial limitation
> that can be re-enabled by removing three explicit checks/lines. Landlock
> was designed for unprivileged use from day one and it is still the goal.

Indeed.  I was obviously too tired to read your email intelligently
last night.  Sorry.


Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-02-27 Thread Andy Lutomirski
On Tue, Feb 27, 2018 at 10:03 PM, Mickaël Salaün <m...@digikod.net> wrote:
>
> On 27/02/2018 05:36, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün <m...@digikod.net> wrote:
>>> Hi,
>>>

>>>
>>> ## Why use the seccomp(2) syscall?
>>>
>>> Landlock use the same semantic as seccomp to apply access rule
>>> restrictions. It add a new layer of security for the current process
>>> which is inherited by its children. It makes sense to use an unique
>>> access-restricting syscall (that should be allowed by seccomp filters)
>>> which can only drop privileges. Moreover, a Landlock rule could come
>>> from outside a process (e.g.  passed through a UNIX socket). It is then
>>> useful to differentiate the creation/load of Landlock eBPF programs via
>>> bpf(2), from rule enforcement via seccomp(2).
>>
>> This seems like a weak argument to me.  Sure, this is a bit different
>> from seccomp(), and maybe shoving it into the seccomp() multiplexer is
>> awkward, but surely the bpf() multiplexer is even less applicable.
>
> I think using the seccomp syscall is fine, and everyone agreed on it.
>

Ah, sorry, I completely misread what you wrote.  My apologies.  You
can disregard most of my email.

>
>>
>> Also, looking forward, I think you're going to want a bunch of the
>> stuff that's under consideration as new seccomp features.  Tycho is
>> working on a "user notifier" feature for seccomp where, in addition to
>> accepting, rejecting, or kicking to ptrace, you can send a message to
>> the creator of the filter and wait for a reply.  I think that Landlock
>> will want exactly the same feature.
>
> I don't think why this may be useful at all her. Landlock does not
> filter at the syscall level but handles kernel object and actions as
> does an LSM. That is the whole purpose of Landlock.

Suppose I'm writing a container manager.  I want to run "mount" in the
container, but I don't want to allow moun() in general and I want to
emulate certain mount() actions.  I can write a filter that catches
mount using seccomp and calls out to the container manager for help.
This isn't theoretical -- Tycho wants *exactly* this use case to be
supported.

But using seccomp for this is indeed annoying.  It would be nice to
use Landlock's ability to filter based on the filesystem type, for
example.  So Tycho could write a Landlock rule like:

bool filter_mount(...)
{
  if (path needs emulation)
call_user_notifier();
}

And it should work.

This means that, if both seccomp user notifiers and Landlock make it
upstream, then there should probably be a way to have a user notifier
bound to a seccomp filter and a set of landlock filters.


Re: [PATCH bpf-next v8 08/11] landlock: Add ptrace restrictions

2018-02-27 Thread Andy Lutomirski
On Tue, Feb 27, 2018 at 10:14 PM, Mickaël Salaün <m...@digikod.net> wrote:
>
> On 27/02/2018 06:01, Andy Lutomirski wrote:
>>
>>
>>> On Feb 26, 2018, at 8:17 PM, Andy Lutomirski <l...@amacapital.net> wrote:
>>>
>>>> On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün <m...@digikod.net> wrote:
>>>> A landlocked process has less privileges than a non-landlocked process
>>>> and must then be subject to additional restrictions when manipulating
>>>> processes. To be allowed to use ptrace(2) and related syscalls on a
>>>> target process, a landlocked process must have a subset of the target
>>>> process' rules.
>>>>
>>>> Signed-off-by: Mickaël Salaün <m...@digikod.net>
>>>> Cc: Alexei Starovoitov <a...@kernel.org>
>>>> Cc: Andy Lutomirski <l...@amacapital.net>
>>>> Cc: Daniel Borkmann <dan...@iogearbox.net>
>>>> Cc: David S. Miller <da...@davemloft.net>
>>>> Cc: James Morris <james.l.mor...@oracle.com>
>>>> Cc: Kees Cook <keesc...@chromium.org>
>>>> Cc: Serge E. Hallyn <se...@hallyn.com>
>>>> ---
>>>>
>>>> Changes since v6:
>>>> * factor out ptrace check
>>>> * constify pointers
>>>> * cleanup headers
>>>> * use the new security_add_hooks()
>>>> ---
>>>> security/landlock/Makefile   |   2 +-
>>>> security/landlock/hooks_ptrace.c | 124 
>>>> +++
>>>> security/landlock/hooks_ptrace.h |  11 
>>>> security/landlock/init.c |   2 +
>>>> 4 files changed, 138 insertions(+), 1 deletion(-)
>>>> create mode 100644 security/landlock/hooks_ptrace.c
>>>> create mode 100644 security/landlock/hooks_ptrace.h
>>>>
>>>> diff --git a/security/landlock/Makefile b/security/landlock/Makefile
>>>> index d0f532a93b4e..605504d852d3 100644
>>>> --- a/security/landlock/Makefile
>>>> +++ b/security/landlock/Makefile
>>>> @@ -3,4 +3,4 @@ obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
>>>> landlock-y := init.o chain.o task.o \
>>>>tag.o tag_fs.o \
>>>>enforce.o enforce_seccomp.o \
>>>> -   hooks.o hooks_cred.o hooks_fs.o
>>>> +   hooks.o hooks_cred.o hooks_fs.o hooks_ptrace.o
>>>> diff --git a/security/landlock/hooks_ptrace.c 
>>>> b/security/landlock/hooks_ptrace.c
>>>> new file mode 100644
>>>> index ..f1b977b9c808
>>>> --- /dev/null
>>>> +++ b/security/landlock/hooks_ptrace.c
>>>> @@ -0,0 +1,124 @@
>>>> +/*
>>>> + * Landlock LSM - ptrace hooks
>>>> + *
>>>> + * Copyright © 2017 Mickaël Salaün <m...@digikod.net>
>>>> + *
>>>> + * This program is free software; you can redistribute it and/or modify
>>>> + * it under the terms of the GNU General Public License version 2, as
>>>> + * published by the Free Software Foundation.
>>>> + */
>>>> +
>>>> +#include 
>>>> +#include 
>>>> +#include  /* ARRAY_SIZE */
>>>> +#include 
>>>> +#include  /* struct task_struct */
>>>> +#include 
>>>> +
>>>> +#include "common.h" /* struct landlock_prog_set */
>>>> +#include "hooks.h" /* landlocked() */
>>>> +#include "hooks_ptrace.h"
>>>> +
>>>> +static bool progs_are_subset(const struct landlock_prog_set *parent,
>>>> +   const struct landlock_prog_set *child)
>>>> +{
>>>> +   size_t i;
>>>> +
>>>> +   if (!parent || !child)
>>>> +   return false;
>>>> +   if (parent == child)
>>>> +   return true;
>>>> +
>>>> +   for (i = 0; i < ARRAY_SIZE(child->programs); i++) {
>>>
>>> ARRAY_SIZE(child->programs) seems misleading.  Is there no define
>>> NUM_LANDLOCK_PROG_TYPES or similar?
>>>
>>>> +   struct landlock_prog_list *walker;
>>>> +   bool found_parent = false;
>>>> +
>>>> +   if (!parent->programs[i])
>>>> +   continue;
>>>> +   for (walker = child->programs[i]; walker;
>>>> +   walker = walker->prev) {
&

Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy

2018-02-27 Thread Andy Lutomirski
On Tue, Feb 27, 2018 at 5:30 PM, Casey Schaufler <ca...@schaufler-ca.com> wrote:
> On 2/27/2018 8:39 AM, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>> <alexei.starovoi...@gmail.com> wrote:
>>> [ Snip ]
>> An earlier version of the patch set used the seccomp filter chain.
>> Mickaël, what exactly was wrong with that approach other than that the
>> seccomp() syscall was awkward for you to use?  You could add a
>> seccomp_add_landlock_rule() syscall if you needed to.
>>
>> As a side comment, why is this an LSM at all, let alone a non-stacking
>> LSM?  It would make a lot more sense to me to make Landlock depend on
>> having LSMs configured in but to call the landlock hooks directly from
>> the security_xyz() hooks.
>
> Please, no. It is my serious intention to have at least the
> infrastructure blob management in within a release or two, and
> I think that's all Landlock needs. The security_xyz() hooks are
> sufficiently hackish as it is without unnecessarily adding more
> special cases.
>
>

What do you mean by "infrastructure blob management"?


Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy

2018-02-27 Thread Andy Lutomirski
On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
>> <alexei.starovoi...@gmail.com> wrote:
>> > On Tue, Feb 27, 2018 at 04:40:34AM +, Andy Lutomirski wrote:
>> >> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>> >> <alexei.starovoi...@gmail.com> wrote:
>> >> > On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>> >> >> The seccomp(2) syscall can be used by a task to apply a Landlock 
>> >> >> program
>> >> >> to itself. As a seccomp filter, a Landlock program is enforced for the
>> >> >> current task and all its future children. A program is immutable and a
>> >> >> task can only add new restricting programs to itself, forming a list of
>> >> >> programss.
>> >> >>
>> >> >> A Landlock program is tied to a Landlock hook. If the action on a 
>> >> >> kernel
>> >> >> object is allowed by the other Linux security mechanisms (e.g. DAC,
>> >> >> capabilities, other LSM), then a Landlock hook related to this kind of
>> >> >> object is triggered. The list of programs for this hook is then
>> >> >> evaluated. Each program return a 32-bit value which can deny the action
>> >> >> on a kernel object with a non-zero value. If every programs of the list
>> >> >> return zero, then the action on the object is allowed.
>> >> >>
>> >> >> Multiple Landlock programs can be chained to share a 64-bits value for 
>> >> >> a
>> >> >> call chain (e.g. evaluating multiple elements of a file path).  This
>> >> >> chaining is restricted when a process construct this chain by loading a
>> >> >> program, but additional checks are performed when it requests to apply
>> >> >> this chain of programs to itself.  The restrictions ensure that it is
>> >> >> not possible to call multiple programs in a way that would imply to
>> >> >> handle multiple shared values (i.e. cookies) for one chain.  For now,
>> >> >> only a fs_pick program can be chained to the same type of program,
>> >> >> because it may make sense if they have different triggers (cf. next
>> >> >> commits).  This restrictions still allows to reuse Landlock programs in
>> >> >> a safe way (e.g. use the same loaded fs_walk program with multiple
>> >> >> chains of fs_pick programs).
>> >> >>
>> >> >> Signed-off-by: Mickaël Salaün <m...@digikod.net>
>> >> >
>> >> > ...
>> >> >
>> >> >> +struct landlock_prog_set *landlock_prepend_prog(
>> >> >> + struct landlock_prog_set *current_prog_set,
>> >> >> + struct bpf_prog *prog)
>> >> >> +{
>> >> >> + struct landlock_prog_set *new_prog_set = current_prog_set;
>> >> >> + unsigned long pages;
>> >> >> + int err;
>> >> >> + size_t i;
>> >> >> + struct landlock_prog_set tmp_prog_set = {};
>> >> >> +
>> >> >> + if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>> >> >> + return ERR_PTR(-EINVAL);
>> >> >> +
>> >> >> + /* validate memory size allocation */
>> >> >> + pages = prog->pages;
>> >> >> + if (current_prog_set) {
>> >> >> + size_t i;
>> >> >> +
>> >> >> + for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); 
>> >> >> i++) {
>> >> >> + struct landlock_prog_list *walker_p;
>> >> >> +
>> >> >> + for (walker_p = current_prog_set->programs[i];
>> >> >> + walker_p; walker_p = 
>> >> >> walker_p->prev)
>> >> >> + pages += walker_p->prog->pages;
>> >> >> + }
>> >> >> + /* count a struct landlock_prog_set if we need to 
>> >> >> allocate one */
>> >> >> + if (refcount_read(_prog_set->usage) !

Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy

2018-02-26 Thread Andy Lutomirski
On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>> <alexei.starovoi...@gmail.com> wrote:
>> > On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>> >> The seccomp(2) syscall can be used by a task to apply a Landlock program
>> >> to itself. As a seccomp filter, a Landlock program is enforced for the
>> >> current task and all its future children. A program is immutable and a
>> >> task can only add new restricting programs to itself, forming a list of
>> >> programss.
>> >>
>> >> A Landlock program is tied to a Landlock hook. If the action on a kernel
>> >> object is allowed by the other Linux security mechanisms (e.g. DAC,
>> >> capabilities, other LSM), then a Landlock hook related to this kind of
>> >> object is triggered. The list of programs for this hook is then
>> >> evaluated. Each program return a 32-bit value which can deny the action
>> >> on a kernel object with a non-zero value. If every programs of the list
>> >> return zero, then the action on the object is allowed.
>> >>
>> >> Multiple Landlock programs can be chained to share a 64-bits value for a
>> >> call chain (e.g. evaluating multiple elements of a file path).  This
>> >> chaining is restricted when a process construct this chain by loading a
>> >> program, but additional checks are performed when it requests to apply
>> >> this chain of programs to itself.  The restrictions ensure that it is
>> >> not possible to call multiple programs in a way that would imply to
>> >> handle multiple shared values (i.e. cookies) for one chain.  For now,
>> >> only a fs_pick program can be chained to the same type of program,
>> >> because it may make sense if they have different triggers (cf. next
>> >> commits).  This restrictions still allows to reuse Landlock programs in
>> >> a safe way (e.g. use the same loaded fs_walk program with multiple
>> >> chains of fs_pick programs).
>> >>
>> >> Signed-off-by: Mickaël Salaün <m...@digikod.net>
>> >
>> > ...
>> >
>> >> +struct landlock_prog_set *landlock_prepend_prog(
>> >> + struct landlock_prog_set *current_prog_set,
>> >> + struct bpf_prog *prog)
>> >> +{
>> >> + struct landlock_prog_set *new_prog_set = current_prog_set;
>> >> + unsigned long pages;
>> >> + int err;
>> >> + size_t i;
>> >> + struct landlock_prog_set tmp_prog_set = {};
>> >> +
>> >> + if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>> >> + return ERR_PTR(-EINVAL);
>> >> +
>> >> + /* validate memory size allocation */
>> >> + pages = prog->pages;
>> >> + if (current_prog_set) {
>> >> + size_t i;
>> >> +
>> >> + for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); 
>> >> i++) {
>> >> + struct landlock_prog_list *walker_p;
>> >> +
>> >> + for (walker_p = current_prog_set->programs[i];
>> >> + walker_p; walker_p = walker_p->prev)
>> >> + pages += walker_p->prog->pages;
>> >> + }
>> >> + /* count a struct landlock_prog_set if we need to allocate 
>> >> one */
>> >> + if (refcount_read(_prog_set->usage) != 1)
>> >> + pages += round_up(sizeof(*current_prog_set), 
>> >> PAGE_SIZE)
>> >> + / PAGE_SIZE;
>> >> + }
>> >> + if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>> >> + return ERR_PTR(-E2BIG);
>> >> +
>> >> + /* ensure early that we can allocate enough memory for the new
>> >> +  * prog_lists */
>> >> + err = store_landlock_prog(_prog_set, current_prog_set, prog);
>> >> + if (err)
>> >> + return ERR_PTR(err);
>> >> +
>> >> + /*
>> >> +  * Each task_struct points to an array of prog list pointers.  These
>> >> +  * tables are duplicated when additions are made (which me

Re: [PATCH bpf-next v8 08/11] landlock: Add ptrace restrictions

2018-02-26 Thread Andy Lutomirski


> On Feb 26, 2018, at 8:17 PM, Andy Lutomirski <l...@amacapital.net> wrote:
> 
>> On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün <m...@digikod.net> wrote:
>> A landlocked process has less privileges than a non-landlocked process
>> and must then be subject to additional restrictions when manipulating
>> processes. To be allowed to use ptrace(2) and related syscalls on a
>> target process, a landlocked process must have a subset of the target
>> process' rules.
>> 
>> Signed-off-by: Mickaël Salaün <m...@digikod.net>
>> Cc: Alexei Starovoitov <a...@kernel.org>
>> Cc: Andy Lutomirski <l...@amacapital.net>
>> Cc: Daniel Borkmann <dan...@iogearbox.net>
>> Cc: David S. Miller <da...@davemloft.net>
>> Cc: James Morris <james.l.mor...@oracle.com>
>> Cc: Kees Cook <keesc...@chromium.org>
>> Cc: Serge E. Hallyn <se...@hallyn.com>
>> ---
>> 
>> Changes since v6:
>> * factor out ptrace check
>> * constify pointers
>> * cleanup headers
>> * use the new security_add_hooks()
>> ---
>> security/landlock/Makefile   |   2 +-
>> security/landlock/hooks_ptrace.c | 124 
>> +++
>> security/landlock/hooks_ptrace.h |  11 
>> security/landlock/init.c |   2 +
>> 4 files changed, 138 insertions(+), 1 deletion(-)
>> create mode 100644 security/landlock/hooks_ptrace.c
>> create mode 100644 security/landlock/hooks_ptrace.h
>> 
>> diff --git a/security/landlock/Makefile b/security/landlock/Makefile
>> index d0f532a93b4e..605504d852d3 100644
>> --- a/security/landlock/Makefile
>> +++ b/security/landlock/Makefile
>> @@ -3,4 +3,4 @@ obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
>> landlock-y := init.o chain.o task.o \
>>tag.o tag_fs.o \
>>enforce.o enforce_seccomp.o \
>> -   hooks.o hooks_cred.o hooks_fs.o
>> +   hooks.o hooks_cred.o hooks_fs.o hooks_ptrace.o
>> diff --git a/security/landlock/hooks_ptrace.c 
>> b/security/landlock/hooks_ptrace.c
>> new file mode 100644
>> index ..f1b977b9c808
>> --- /dev/null
>> +++ b/security/landlock/hooks_ptrace.c
>> @@ -0,0 +1,124 @@
>> +/*
>> + * Landlock LSM - ptrace hooks
>> + *
>> + * Copyright © 2017 Mickaël Salaün <m...@digikod.net>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2, as
>> + * published by the Free Software Foundation.
>> + */
>> +
>> +#include 
>> +#include 
>> +#include  /* ARRAY_SIZE */
>> +#include 
>> +#include  /* struct task_struct */
>> +#include 
>> +
>> +#include "common.h" /* struct landlock_prog_set */
>> +#include "hooks.h" /* landlocked() */
>> +#include "hooks_ptrace.h"
>> +
>> +static bool progs_are_subset(const struct landlock_prog_set *parent,
>> +   const struct landlock_prog_set *child)
>> +{
>> +   size_t i;
>> +
>> +   if (!parent || !child)
>> +   return false;
>> +   if (parent == child)
>> +   return true;
>> +
>> +   for (i = 0; i < ARRAY_SIZE(child->programs); i++) {
> 
> ARRAY_SIZE(child->programs) seems misleading.  Is there no define
> NUM_LANDLOCK_PROG_TYPES or similar?
> 
>> +   struct landlock_prog_list *walker;
>> +   bool found_parent = false;
>> +
>> +   if (!parent->programs[i])
>> +   continue;
>> +   for (walker = child->programs[i]; walker;
>> +   walker = walker->prev) {
>> +   if (walker == parent->programs[i]) {
>> +   found_parent = true;
>> +   break;
>> +   }
>> +   }
>> +   if (!found_parent)
>> +   return false;
>> +   }
>> +   return true;
>> +}
> 
> If you used seccomp, you'd get this type of check for free, and it
> would be a lot easier to comprehend.  AFAICT the only extra leniency
> you're granting is that you're agnostic to the order in which the
> rules associated with different program types were applied, which
> could easily be added to seccomp.

On second thought, this is all way too complicated.  I think the correct logic 
is either "if you are filtered by Landlock, you cannot ptrace anything" or to 
delete this patch entirely. If something like Tycho's notifiers goes in, then 
it's not obvious that, just because you have the same set of filters, you have 
the same privilege.  Similarly, if a feature that lets a filter query its 
cgroup goes in (and you proposed this once!) then the logic you implemented 
here is wrong.

Or you could just say that it's the responsibility of a Landlock user to 
properly filter ptrace() just like it's the responsibility of seccomp users to 
filter ptrace if needed.

I take this as further evidence that Landlock makes much more sense as part of 
seccomp than as a totally separate thing.  We've very carefully reviewed these 
things for seccomp.  Please don't make us do it again from scratch.

Re: [net-next v3 0/2] eBPF seccomp filters

2018-02-26 Thread Andy Lutomirski


> On Feb 26, 2018, at 8:38 PM, Kees Cook <keesc...@chromium.org> wrote:
> 
> On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski <l...@amacapital.net> wrote:
>>> On Feb 26, 2018, at 3:20 PM, Kees Cook <keesc...@chromium.org> wrote:
>>> 
>>> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov
>>> <alexei.starovoi...@gmail.com> wrote:
>>>>> On Mon, Feb 26, 2018 at 07:26:54AM +, Sargun Dhillon wrote:
>>>>> This patchset enables seccomp filters to be written in eBPF. Although, 
>>>>> this
>>>>> [...]
>>>> The main statement I want to hear from seccomp maintainers before
>>>> proceeding any further on this that enabling eBPF in seccomp won't lead
>>>> to seccomp folks arguing against changes in bpf core (like verifier)
>>>> just because it's used by seccomp.
>>>> It must be spelled out in the commit log with explicit Ack.
>>> 
>>> The primary thing I'm concerned about with eBPF and seccomp is
>>> side-effects from eBPF programs running at syscall time. This is an
>>> extremely sensitive area, and I want to be sure there won't be
>>> feature-creep here that leads to seccomp getting into a bad state.
>>> 
>>> As long as seccomp can continue have its own verifier, I *think* this
>>> will be fine, though, again I remain concerned about maps, etc. I'm
>>> still reviewing these patches and how they might provide overlap with
>>> Tycho's needs too, etc.
>> 
>> I'm not sure I see this as a huge problem.  As far as I can see, there
>> are three ways that a verifier change could be problematic:
>> 
>> 1. Addition of a new type of map.  But seccomp would just not allow
>> new map types by default, right?
>> 
>> 2. Addition of a new BPF_CALLable helper.  Seccomp wants a way to
>> whitelist BPF_CALL targets.  That should be straightforward.
> 
> Yup, agreed on 1 and 2.
> 
>> 3. Straight-up bugs.  Those are exactly as problematic as verifier
>> bugs in any other unprivileged eBPF program type, right?  I don't see
>> why seccomp is special here.
> 
> My concern is more about unintended design mistakes or other feature
> creep with side-effects, especially when it comes to privileges and
> synchronization. Getting no-new-privs done correctly, for example,
> took some careful thought and discussion, and I'm shy from how painful
> TSYNC was on the process locking side, and eBPF has had some rather
> ugly flaws in the past (and recently: it was nice to be able to say
> for Spectre that seccomp filters couldn't be constructed to make
> attacks but eBPF could). Adding the complexity needs to be worth the
> gain. I'm on board for doing it, I just want to be careful. :)
> 

I agree.  I think that, if we do this right, we get a clean version of Tycho's 
notifiers.  We can also very easily build on that to send a non-blocking 
message to the notifier fd, which gets us a version of seccomp logging that 
works for things like Chromium and even strace.  I think this is worth it.

I also think this sort of argument is why Mickaël's privileged-first Landlock 
is the wrong approach.  By getting the unprivileged parts right from day one, 
we can carefully extend the mechanism and keep it usable by unprivileged apps.  
But, if we'd started as root-only, fixing up everything needed to make it safe 
for unprivileged users after the fact would have been quite messy.

And the considerations for making eBPF safe for use by unprivileged tasks to 
filter their descendents are more or less the same for seccomp and Landlock.  
Can we please arrange things so we solve this problem only once?

Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy

2018-02-26 Thread Andy Lutomirski
On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
 wrote:
> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>> The seccomp(2) syscall can be used by a task to apply a Landlock program
>> to itself. As a seccomp filter, a Landlock program is enforced for the
>> current task and all its future children. A program is immutable and a
>> task can only add new restricting programs to itself, forming a list of
>> programss.
>>
>> A Landlock program is tied to a Landlock hook. If the action on a kernel
>> object is allowed by the other Linux security mechanisms (e.g. DAC,
>> capabilities, other LSM), then a Landlock hook related to this kind of
>> object is triggered. The list of programs for this hook is then
>> evaluated. Each program return a 32-bit value which can deny the action
>> on a kernel object with a non-zero value. If every programs of the list
>> return zero, then the action on the object is allowed.
>>
>> Multiple Landlock programs can be chained to share a 64-bits value for a
>> call chain (e.g. evaluating multiple elements of a file path).  This
>> chaining is restricted when a process construct this chain by loading a
>> program, but additional checks are performed when it requests to apply
>> this chain of programs to itself.  The restrictions ensure that it is
>> not possible to call multiple programs in a way that would imply to
>> handle multiple shared values (i.e. cookies) for one chain.  For now,
>> only a fs_pick program can be chained to the same type of program,
>> because it may make sense if they have different triggers (cf. next
>> commits).  This restrictions still allows to reuse Landlock programs in
>> a safe way (e.g. use the same loaded fs_walk program with multiple
>> chains of fs_pick programs).
>>
>> Signed-off-by: Mickaël Salaün 
>
> ...
>
>> +struct landlock_prog_set *landlock_prepend_prog(
>> + struct landlock_prog_set *current_prog_set,
>> + struct bpf_prog *prog)
>> +{
>> + struct landlock_prog_set *new_prog_set = current_prog_set;
>> + unsigned long pages;
>> + int err;
>> + size_t i;
>> + struct landlock_prog_set tmp_prog_set = {};
>> +
>> + if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>> + return ERR_PTR(-EINVAL);
>> +
>> + /* validate memory size allocation */
>> + pages = prog->pages;
>> + if (current_prog_set) {
>> + size_t i;
>> +
>> + for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>> + struct landlock_prog_list *walker_p;
>> +
>> + for (walker_p = current_prog_set->programs[i];
>> + walker_p; walker_p = walker_p->prev)
>> + pages += walker_p->prog->pages;
>> + }
>> + /* count a struct landlock_prog_set if we need to allocate one 
>> */
>> + if (refcount_read(_prog_set->usage) != 1)
>> + pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>> + / PAGE_SIZE;
>> + }
>> + if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>> + return ERR_PTR(-E2BIG);
>> +
>> + /* ensure early that we can allocate enough memory for the new
>> +  * prog_lists */
>> + err = store_landlock_prog(_prog_set, current_prog_set, prog);
>> + if (err)
>> + return ERR_PTR(err);
>> +
>> + /*
>> +  * Each task_struct points to an array of prog list pointers.  These
>> +  * tables are duplicated when additions are made (which means each
>> +  * table needs to be refcounted for the processes using it). When a new
>> +  * table is created, all the refcounters on the prog_list are bumped 
>> (to
>> +  * track each table that references the prog). When a new prog is
>> +  * added, it's just prepended to the list for the new table to point
>> +  * at.
>> +  *
>> +  * Manage all the possible errors before this step to not uselessly
>> +  * duplicate current_prog_set and avoid a rollback.
>> +  */
>> + if (!new_prog_set) {
>> + /*
>> +  * If there is no Landlock program set used by the current 
>> task,
>> +  * then create a new one.
>> +  */
>> + new_prog_set = new_landlock_prog_set();
>> + if (IS_ERR(new_prog_set))
>> + goto put_tmp_lists;
>> + } else if (refcount_read(_prog_set->usage) > 1) {
>> + /*
>> +  * If the current task is not the sole user of its Landlock
>> +  * program set, then duplicate them.
>> +  */
>> + new_prog_set = new_landlock_prog_set();
>> + if (IS_ERR(new_prog_set))
>> + goto put_tmp_lists;
>> + for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>> + new_prog_set->programs[i] =
>> +   

Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-02-26 Thread Andy Lutomirski
On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün  wrote:
> Hi,
>
> This eight series is a major revamp of the Landlock design compared to
> the previous series [1]. This enables more flexibility and granularity
> of access control with file paths. It is now possible to enforce an
> access control according to a file hierarchy. Landlock uses the concept
> of inode and path to identify such hierarchy. In a way, it brings tools
> to program what is a file hierarchy.
>
> There is now three types of Landlock hooks: FS_WALK, FS_PICK and FS_GET.
> Each of them accepts a dedicated eBPF program, called a Landlock
> program.  They can be chained to enforce a full access control according
> to a list of directories or files. The set of actions on a file is well
> defined (e.g. read, write, ioctl, append, lock, mount...) taking
> inspiration from the major Linux LSMs and some other access-controls
> like Capsicum.  These program types are designed to be cache-friendly,
> which give room for optimizations in the future.
>
> The documentation patch contains some kernel documentation and
> explanations on how to use Landlock.  The compiled documentation and
> a talk I gave at FOSDEM can be found here: https://landlock.io
> This patch series can be found in the branch landlock-v8 in this repo:
> https://github.com/landlock-lsm/linux
>
> There is still some minor issues with this patch series but it should
> demonstrate how powerful this design may be. One of these issues is that
> it is not a stackable LSM anymore, but the infrastructure management of
> security blobs should allow to stack it with other LSM [4].
>
> This is the first step of the roadmap discussed at LPC [2].  While the
> intended final goal is to allow unprivileged users to use Landlock, this
> series allows only a process with global CAP_SYS_ADMIN to load and
> enforce a rule.  This may help to get feedback and avoid unexpected
> behaviors.
>
> This series can be applied on top of bpf-next, commit 7d72637eb39f
> ("Merge branch 'x86-jit'").  This can be tested with
> CONFIG_SECCOMP_FILTER and CONFIG_SECURITY_LANDLOCK.  I would really
> appreciate constructive comments on the design and the code.
>
>
> # Landlock LSM
>
> The goal of this new Linux Security Module (LSM) called Landlock is to
> allow any process, including unprivileged ones, to create powerful
> security sandboxes comparable to XNU Sandbox or OpenBSD Pledge. This
> kind of sandbox is expected to help mitigate the security impact of bugs
> or unexpected/malicious behaviors in user-space applications.
>
> The approach taken is to add the minimum amount of code while still
> allowing the user-space application to create quite complex access
> rules.  A dedicated security policy language such as the one used by
> SELinux, AppArmor and other major LSMs involves a lot of code and is
> usually permitted to only a trusted user (i.e. root).  On the contrary,
> eBPF programs already exist and are designed to be safely loaded by
> unprivileged user-space.
>
> This design does not seem too intrusive but is flexible enough to allow
> a powerful sandbox mechanism accessible by any process on Linux. The use
> of seccomp and Landlock is more suitable with the help of a user-space
> library (e.g.  libseccomp) that could help to specify a high-level
> language to express a security policy instead of raw eBPF programs.
> Moreover, thanks to the LLVM front-end, it is quite easy to write an
> eBPF program with a subset of the C language.
>
>
> # Frequently asked questions
>
> ## Why is seccomp-bpf not enough?
>
> A seccomp filter can access only raw syscall arguments (i.e. the
> register values) which means that it is not possible to filter according
> to the value pointed to by an argument, such as a file pathname. As an
> embryonic Landlock version demonstrated, filtering at the syscall level
> is complicated (e.g. need to take care of race conditions). This is
> mainly because the access control checkpoints of the kernel are not at
> this high-level but more underneath, at the LSM-hook level. The LSM
> hooks are designed to handle this kind of checks.  Landlock abstracts
> this approach to leverage the ability of unprivileged users to limit
> themselves.
>
> Cf. section "What it isn't?" in Documentation/prctl/seccomp_filter.txt
>
>
> ## Why use the seccomp(2) syscall?
>
> Landlock use the same semantic as seccomp to apply access rule
> restrictions. It add a new layer of security for the current process
> which is inherited by its children. It makes sense to use an unique
> access-restricting syscall (that should be allowed by seccomp filters)
> which can only drop privileges. Moreover, a Landlock rule could come
> from outside a process (e.g.  passed through a UNIX socket). It is then
> useful to differentiate the creation/load of Landlock eBPF programs via
> bpf(2), from rule enforcement via seccomp(2).

This seems like a weak argument to me.  Sure, this is a bit different
from seccomp(), 

Re: [net-next v3 0/2] eBPF seccomp filters

2018-02-26 Thread Andy Lutomirski
> On Feb 26, 2018, at 3:20 PM, Kees Cook  wrote:
>
> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov
>  wrote:
>>> On Mon, Feb 26, 2018 at 07:26:54AM +, Sargun Dhillon wrote:
>>> This patchset enables seccomp filters to be written in eBPF. Although, this
>>> [...]
>> The main statement I want to hear from seccomp maintainers before
>> proceeding any further on this that enabling eBPF in seccomp won't lead
>> to seccomp folks arguing against changes in bpf core (like verifier)
>> just because it's used by seccomp.
>> It must be spelled out in the commit log with explicit Ack.
>
> The primary thing I'm concerned about with eBPF and seccomp is
> side-effects from eBPF programs running at syscall time. This is an
> extremely sensitive area, and I want to be sure there won't be
> feature-creep here that leads to seccomp getting into a bad state.
>
> As long as seccomp can continue have its own verifier, I *think* this
> will be fine, though, again I remain concerned about maps, etc. I'm
> still reviewing these patches and how they might provide overlap with
> Tycho's needs too, etc.

I'm not sure I see this as a huge problem.  As far as I can see, there
are three ways that a verifier change could be problematic:

1. Addition of a new type of map.  But seccomp would just not allow
new map types by default, right?

2. Addition of a new BPF_CALLable helper.  Seccomp wants a way to
whitelist BPF_CALL targets.  That should be straightforward.

3. Straight-up bugs.  Those are exactly as problematic as verifier
bugs in any other unprivileged eBPF program type, right?  I don't see
why seccomp is special here.


Re: [PATCH bpf-next v8 08/11] landlock: Add ptrace restrictions

2018-02-26 Thread Andy Lutomirski
On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün <m...@digikod.net> wrote:
> A landlocked process has less privileges than a non-landlocked process
> and must then be subject to additional restrictions when manipulating
> processes. To be allowed to use ptrace(2) and related syscalls on a
> target process, a landlocked process must have a subset of the target
> process' rules.
>
> Signed-off-by: Mickaël Salaün <m...@digikod.net>
> Cc: Alexei Starovoitov <a...@kernel.org>
> Cc: Andy Lutomirski <l...@amacapital.net>
> Cc: Daniel Borkmann <dan...@iogearbox.net>
> Cc: David S. Miller <da...@davemloft.net>
> Cc: James Morris <james.l.mor...@oracle.com>
> Cc: Kees Cook <keesc...@chromium.org>
> Cc: Serge E. Hallyn <se...@hallyn.com>
> ---
>
> Changes since v6:
> * factor out ptrace check
> * constify pointers
> * cleanup headers
> * use the new security_add_hooks()
> ---
>  security/landlock/Makefile   |   2 +-
>  security/landlock/hooks_ptrace.c | 124 
> +++
>  security/landlock/hooks_ptrace.h |  11 
>  security/landlock/init.c |   2 +
>  4 files changed, 138 insertions(+), 1 deletion(-)
>  create mode 100644 security/landlock/hooks_ptrace.c
>  create mode 100644 security/landlock/hooks_ptrace.h
>
> diff --git a/security/landlock/Makefile b/security/landlock/Makefile
> index d0f532a93b4e..605504d852d3 100644
> --- a/security/landlock/Makefile
> +++ b/security/landlock/Makefile
> @@ -3,4 +3,4 @@ obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
>  landlock-y := init.o chain.o task.o \
> tag.o tag_fs.o \
> enforce.o enforce_seccomp.o \
> -   hooks.o hooks_cred.o hooks_fs.o
> +   hooks.o hooks_cred.o hooks_fs.o hooks_ptrace.o
> diff --git a/security/landlock/hooks_ptrace.c 
> b/security/landlock/hooks_ptrace.c
> new file mode 100644
> index ..f1b977b9c808
> --- /dev/null
> +++ b/security/landlock/hooks_ptrace.c
> @@ -0,0 +1,124 @@
> +/*
> + * Landlock LSM - ptrace hooks
> + *
> + * Copyright © 2017 Mickaël Salaün <m...@digikod.net>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2, as
> + * published by the Free Software Foundation.
> + */
> +
> +#include 
> +#include 
> +#include  /* ARRAY_SIZE */
> +#include 
> +#include  /* struct task_struct */
> +#include 
> +
> +#include "common.h" /* struct landlock_prog_set */
> +#include "hooks.h" /* landlocked() */
> +#include "hooks_ptrace.h"
> +
> +static bool progs_are_subset(const struct landlock_prog_set *parent,
> +   const struct landlock_prog_set *child)
> +{
> +   size_t i;
> +
> +   if (!parent || !child)
> +   return false;
> +   if (parent == child)
> +   return true;
> +
> +   for (i = 0; i < ARRAY_SIZE(child->programs); i++) {

ARRAY_SIZE(child->programs) seems misleading.  Is there no define
NUM_LANDLOCK_PROG_TYPES or similar?

> +   struct landlock_prog_list *walker;
> +   bool found_parent = false;
> +
> +   if (!parent->programs[i])
> +   continue;
> +   for (walker = child->programs[i]; walker;
> +   walker = walker->prev) {
> +   if (walker == parent->programs[i]) {
> +   found_parent = true;
> +   break;
> +   }
> +   }
> +   if (!found_parent)
> +   return false;
> +   }
> +   return true;
> +}

If you used seccomp, you'd get this type of check for free, and it
would be a lot easier to comprehend.  AFAICT the only extra leniency
you're granting is that you're agnostic to the order in which the
rules associated with different program types were applied, which
could easily be added to seccomp.


Re: [PATCH net-next 0/3] eBPF Seccomp filters

2018-02-15 Thread Andy Lutomirski

> On Feb 14, 2018, at 8:30 PM, Alexei Starovoitov 
>  wrote:
> 
> On Wed, Feb 14, 2018 at 10:32:22AM -0700, Tycho Andersen wrote:
 
 What's the reason for adding eBPF support? seccomp shouldn't need it,
 and it only makes the code more complex. I'd rather stick with cBPF
 until we have an overwhelmingly good reason to use eBPF as a "native"
 seccomp filter language.
 
>>> 
>>> I can think of two fairly strong use cases for eBPF's ability to call
>>> functions: logging and Tycho's user notifier thing.
>> 
>> Worth noting that there is one additional thing that I didn't
>> implement, but which would be nice and is probably not possible with
>> eBPF (at least, not without a bunch of additional infrastructure):
>> passing fds back to the tracee from the manager if you intercept
>> socket(), or accept() or something.
>> 
>> This could again be accomplished via other means, though it would be a
>> lot nicer to have a primitive for it.
> 
> there is bpf_perf_event_output() interface that allows to stream
> arbitrary data from kernel into user space via perf ring buffer.
> User space can epoll on it. We use this in both tracing and networking
> for notifications and streaming data transfers.
> I suspect this can be used for 'logging' too, since it's cheap and fast.

I think this is the right idea but we'd want to tweak it.  We don't want the 
log messages to go to some systemwide buffer (seccomp can already so this and 
its annoying) -- we want them to go to the filter's creator.  In fact, the 
seccomp listener fd concept could easily be extended to do exactly this.

> 
> Also I think the argument that seccomp+eBPF will be faster than
> seccomp+cBPF is a weak one. I bet kpti on/off makes no difference
> under seccomp, since _all_ syscalls are already slow for sandboxed app.

It's been a while since I benchmarked it, but I suspect that a simple seccomp 
filter is quite a bit faster than a PTI transition.



Re: [PATCH net-next 0/3] eBPF Seccomp filters

2018-02-14 Thread Andy Lutomirski
On Tue, Feb 13, 2018 at 3:47 PM, Kees Cook  wrote:
> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon  wrote:
>> This patchset enables seccomp filters to be written in eBPF. Although,
>> this patchset doesn't introduce much of the functionality enabled by
>> eBPF, it lays the ground work for it.
>>
>> It also introduces the capability to dump eBPF filters via the PTRACE
>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
>> In the attached samples, there's an example of this. One can then use
>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
>> and use that at reload time.
>>
>> The primary reason for not adding maps support in this patchset is
>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
>> If we have a map that the BPF program can read, it can potentially
>> "change" privileges after running. It seems like doing writes only
>> is safe, because it can be pure, and side effect free, and therefore
>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
>> to an agreement, this can be in a follow-up patchset.
>
> What's the reason for adding eBPF support? seccomp shouldn't need it,
> and it only makes the code more complex. I'd rather stick with cBPF
> until we have an overwhelmingly good reason to use eBPF as a "native"
> seccomp filter language.
>

I can think of two fairly strong use cases for eBPF's ability to call
functions: logging and Tycho's user notifier thing.  They let seccomp
filters *do* something synchronously, which is a better match for both
use cases than the current approach of "hey, I'd like to log this
syscall, but it's really awkward to attach other information or to
track exactly *which* filter logged what or to stack any of it".

Also, eBPF's stronger arithmetic support would allow bitops (I think),
which would make "is the nr in this list" quite a bit faster in some
cases.


Re: [PATCH V2] Fix a sleep-in-atomic bug in shash_setkey_unaligned

2017-10-03 Thread Andy Lutomirski
On Mon, Oct 2, 2017 at 10:26 PM, Herbert Xu <herb...@gondor.apana.org.au> wrote:
> On Mon, Oct 02, 2017 at 09:18:24PM -0700, Andy Lutomirski wrote:
>> > On Oct 2, 2017, at 7:25 PM, Jia-Ju Bai <baijiaju1...@163.com> wrote:
>> >
>> > The SCTP program may sleep under a spinlock, and the function call path is:
>> > sctp_generate_t3_rtx_event (acquire the spinlock)
>> >  sctp_do_sm
>> >sctp_side_effects
>> >  sctp_cmd_interpreter
>> >sctp_make_init_ack
>> >  sctp_pack_cookie
>> >crypto_shash_setkey
>> >  shash_setkey_unaligned
>> >kmalloc(GFP_KERNEL)
>>
>> I'm going to go out on a limb here: why on Earth is out crypto API so
>> full of indirection that we allocate memory at all here?
>
> The crypto API operates on a one key per-tfm basis.  So normally
> tfm allocation and key setting is done once only and not done on
> the data path.
>
> I have looked at the SCTP code and it appears to fit this paradigm.
> That is, we should be able to allocate the tfm and set the key when
> the key is actually generated via get_random_bytes, rather than every
> time the key is used which is not only a waste but as you see runs
> into API issues.

It's a waste because it loses a pre-computation advantage.

The fact that it has memory allocation issues is crypto API's fault,
full stop.  There is no legit reason to need to allocate anything.


Re: [PATCH V2] Fix a sleep-in-atomic bug in shash_setkey_unaligned

2017-10-02 Thread Andy Lutomirski
> On Oct 2, 2017, at 7:25 PM, Jia-Ju Bai  wrote:
>
> The SCTP program may sleep under a spinlock, and the function call path is:
> sctp_generate_t3_rtx_event (acquire the spinlock)
>  sctp_do_sm
>sctp_side_effects
>  sctp_cmd_interpreter
>sctp_make_init_ack
>  sctp_pack_cookie
>crypto_shash_setkey
>  shash_setkey_unaligned
>kmalloc(GFP_KERNEL)
>

I'm going to go out on a limb here: why on Earth is out crypto API so
full of indirection that we allocate memory at all here?

We're synchronously computing a hash of a small amount of data using
either HMAC-SHA1 or HMAC-SHA256 (determined at runtime) if I read it
right.  There's a sane way to do this that doesn't need kmalloc,
alloca, or fancy indirection.  And then there's crypto_shash_xyz().

--Andy, who is sick of seeing stupid bugs caused by the fact that it's
basically impossible to use the crypto API in a sane way.

P.S. gnulib has:

int hmac_sha256 (const void *key, size_t keylen, const void *in,
size_t inlen, void *resbuf);

An init/update/final-style API would be nice, but something like this
would be a phenomenal improvement over what we have.


Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions

2017-05-23 Thread Andy Lutomirski
On Tue, May 23, 2017 at 11:36 AM, Kees Cook  wrote:
> On Tue, May 23, 2017 at 12:48 AM, Solar Designer  wrote:
>> For modules_autoload_mode=2, we already seem to have the equivalent of
>> modprobe=/bin/true (or does it differ subtly, maybe in return values?),
>> which I already use at startup on a GPU box like this (preloading
>> modules so that the OpenCL backends wouldn't need the autoloading):
>>
>> nvidia-smi
>> nvidia-modprobe -u -c=0
>> #modprobe nvidia_uvm
>> #modprobe fglrx
>>
>> sysctl -w kernel.modprobe=/bin/true
>> sysctl -w kernel.hotplug=/bin/true
>>
>> but it's good to also have this supported more explicitly and more
>> consistently through modules_autoload_mode=2 while we're at it.  So I
>> support having this mode as well.  I just question the need to have it
>> non-resettable.
>
> I agree it's useful to have the explicit =2 state just to avoid
> confusion when more systems start implementing
> CONFIG_STATIC_USERMODEHELPER and kernel.modprobe becomes read-only
> (though the userspace implementation may allow for some way to disable
> it, etc). I just like avoiding the upcall to modprobe at all.

I fully support =2 to mean "no automatic loading at all".  I dislike
making it non-resettable.  If you can write to sysctls, then, most
likely you can either call init_module() directly or the system has
module loading disabled entirely.


Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions

2017-05-22 Thread Andy Lutomirski
On Mon, May 22, 2017 at 4:07 PM, Kees Cook  wrote:
> On Mon, May 22, 2017 at 12:55 PM, Djalal Harouni  wrote:
>> On Mon, May 22, 2017 at 6:43 PM, Solar Designer  wrote:
>>> On Mon, May 22, 2017 at 03:49:15PM +0200, Djalal Harouni wrote:
 On Mon, May 22, 2017 at 2:08 PM, Solar Designer  wrote:
 > On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote:
 >> *) When modules_autoload_mode is set to (2), automatic module loading is
 >> disabled for all. Once set, this value can not be changed.
 >
 > What purpose does this securelevel-like property ("Once set, this value
 > can not be changed.") serve here?  I think this mode 2 is needed, but
 > without this extra property, which is bypassable by e.g. explicitly
 > loaded kernel modules anyway (and that's OK).

 My reasoning about "Once set, this value can not be changed" is mainly for:

 If you have some systems where modules are not updated for any given
 reason, then the only one who will be able to load a module is an
 administrator, basically this is a shortcut for:

 * Apps/services can run with CAP_NET_ADMIN but they are not allowed to
 auto-load 'netdev' modules.

 * Explicitly loading modules can be guarded by seccomp filters *per*
 app, so even if these apps have
   CAP_SYS_MODULE they won't be able to explicitly load modules, one
 has to remount some sysctl /proc/ entries read-only here and remove
 CAP_SYS_ADMIN for all apps anyway.

 This mainly serves the purpose of these systems that do not receive
 updates, if I don't want to expose those kernel interfaces what should
 I do ? then if I want to unload old versions and replace them with new
 ones what operation should be allowed ? and only real root of the
 system can do it. Hence, the "Once set, this value can not be changed"
 is more of a shortcut, also the idea was put in my mind based on how
 "modules_disabled" is disabled forever, and some other interfaces. I
 would say: it is easy to handle a transition from 1) "hey this system
 is still up to date, some features should be exposed" to 2) "this
 system is not up to date anymore, only root should expose some
 features..."

 Hmm, I am not sure if this answers your question ? :-)
>>>
>>> This answers my question, but in a way that I summarize as "there's no
>>> good reason to include this securelevel-like property".
>>>
>>
>> Hmm, sorry I did forget to add in my previous comment that with such
>> systems, CAP_SYS_MODULE can be used to reset the
>> "modules_autoload_mode" sysctl back from mode 2 to mode 1, even if we
>> disable it privileged tasks can be triggered to overwrite the sysctl
>> flag and get it back unless /proc is read-only... that's one of the
>> points, it should not be so easy to relax it.
>
> I'm on the fence. For modules_disabled and Yama, it was tied to
> CAP_SYS_ADMIN, basically designed to be a at-boot setting that could
> not later be undone by an attacker gaining that privilege, keeping
> them out of either kernel memory or existing user process memory.
> Here, it's CAP_SYS_MODULE... it's hard to imagine the situation where
> a CAP_SYS_MODULE-capable process could write to this sysctl but NOT
> issue direct modprobe requests, but it's _possible_ via crazy symlink
> games to trick capable processes into writing to sysctls. We've seen
> this multiple times before, and it's a way for attackers to turn a
> single privileged write into a privileged exec.
>
> I might turn the question around, though: why would we want to have it
> changeable at this setting?
>
> I'm fine leaving that piece off, either way.

I think that having the un-resettable mode is unnecessary.  We should
have option that disables loading modules entirely and cannot be
unset.  (That means no explicit loads and not implicit loads.)  Maybe
we already have this.  Otherwise, tightening caps needed for implicit
loads should just be a normal yes/no setting IMO.


Re: [net-next PATCH v2 8/8] net: Introduce SO_INCOMING_NAPI_ID

2017-03-23 Thread Andy Lutomirski
On Thu, Mar 23, 2017 at 5:58 PM, Alexander Duyck
<alexander.du...@gmail.com> wrote:
> On Thu, Mar 23, 2017 at 3:43 PM, Andy Lutomirski <l...@kernel.org> wrote:
>> On Thu, Mar 23, 2017 at 2:38 PM, Alexander Duyck
>> <alexander.du...@gmail.com> wrote:
>>> From: Sridhar Samudrala <sridhar.samudr...@intel.com>
>>>
>>> This socket option returns the NAPI ID associated with the queue on which
>>> the last frame is received. This information can be used by the apps to
>>> split the incoming flows among the threads based on the Rx queue on which
>>> they are received.
>>>
>>> If the NAPI ID actually represents a sender_cpu then the value is ignored
>>> and 0 is returned.
>>
>> This may be more of a naming / documentation issue than a
>> functionality issue, but to me this reads as:
>>
>> "This socket option returns an internal implementation detail that, if
>> you are sufficiently clueful about the current performance heuristics
>> used by the Linux networking stack, just might give you a hint as to
>> which epoll set to put the socket in."  I've done some digging into
>> Linux networking stuff, but not nearly enough to have the slighest
>> clue what you're supposed to do with the NAPI ID.
>
> Really the NAPI ID is an arbitrary number that will be unique per
> device queue, though multiple Rx queues can share a NAPI ID if they
> are meant to be processed in the same call to poll.
>
> If we wanted we could probably rename it to something like Device Poll
> Identifier or Device Queue Identifier, DPID or DQID, if that would
> work for you.  Essentially it is just a unique u32 value that should
> not identify any other queue in the system while this device queue is
> active.  Really the number itself is mostly arbitrary, the main thing
> is that it doesn't change and uniquely identifies the queue in the
> system.

That seems reasonably sane to me.

>
>> It would be nice to make this a bit more concrete and a bit less tied
>> in Linux innards.  Perhaps a socket option could instead return a hint
>> saying "for best results, put this socket in an epoll set that's on
>> cpu N"?  After all, you're unlikely to do anything productive with
>> busy polling at all, even on a totally different kernel
>> implementation, if you have more than one epoll set per CPU.  I can
>> see cases where you could plausibly poll with fewer than one set per
>> CPU, I suppose.
>
> Really we kind of already have an option that does what you are
> implying called SO_INCOMING_CPU.  The problem is it requires pinning
> the interrupts to the CPUs in order to keep the values consistent,

Some day the kernel should just solve this problem once and for all.
Have root give a basic policy for mapping queues to CPUs (one per
physical core / one per logical core / use this subset of cores) and
perhaps forcibly prevent irqbalanced from even seeing it.  I'm sure
other solutions are possible.

> and
> even then busy polling can mess that up if the busy poll thread is
> running on a different CPU.  With the NAPI ID we have to do a bit of
> work on the application end, but we can uniquely identify each
> incoming queue and interrupt migration and busy polling don't have any
> effect on it.  So for example we could stack all the interrupts on CPU
> 0, and have our main thread located there doing the sorting of
> incoming requests and handing them out to epoll listener threads on
> other CPUs.  When those epoll listener threads start doing busy
> polling the NAPI ID won't change even though the packet is being
> processed on a different CPU.
>
>> Again, though, from the description, it's totally unclear what a user
>> is supposed to do.
>
> What you end up having to do is essentially create a hash of sorts so
> that you can go from NAPI IDs to threads.  In an ideal setup what you
> end up with multiple threads, each one running one epoll, and each
> epoll polling on one specific queue.

So don't we want queue id, not NAPI id?  Or am I still missing something?

But I'm also a but confused as to the overall performance effect.
Suppose I have an rx queue that has its interrupt bound to cpu 0.  For
whatever reason (random chance if I'm hashing, for example), I end up
with the epoll caller on cpu 1.  Suppose further that cpus 0 and 1 are
on different NUMA nodes.

Now, let's suppose that I get lucky and *all* the packets are pulled
off the queue by epoll busy polling.  Life is great [1].  But suppose
that, due to a tiny hiccup or simply user code spending some cycles
processing those packets, an rx interrupt fires.  Now cpu 0 starts
pulling packets off the queue via NAPI, right?  So both NUMA nodes are
fi

Re: [net-next PATCH v2 8/8] net: Introduce SO_INCOMING_NAPI_ID

2017-03-23 Thread Andy Lutomirski
On Thu, Mar 23, 2017 at 2:38 PM, Alexander Duyck
 wrote:
> From: Sridhar Samudrala 
>
> This socket option returns the NAPI ID associated with the queue on which
> the last frame is received. This information can be used by the apps to
> split the incoming flows among the threads based on the Rx queue on which
> they are received.
>
> If the NAPI ID actually represents a sender_cpu then the value is ignored
> and 0 is returned.

This may be more of a naming / documentation issue than a
functionality issue, but to me this reads as:

"This socket option returns an internal implementation detail that, if
you are sufficiently clueful about the current performance heuristics
used by the Linux networking stack, just might give you a hint as to
which epoll set to put the socket in."  I've done some digging into
Linux networking stuff, but not nearly enough to have the slighest
clue what you're supposed to do with the NAPI ID.

It would be nice to make this a bit more concrete and a bit less tied
in Linux innards.  Perhaps a socket option could instead return a hint
saying "for best results, put this socket in an epoll set that's on
cpu N"?  After all, you're unlikely to do anything productive with
busy polling at all, even on a totally different kernel
implementation, if you have more than one epoll set per CPU.  I can
see cases where you could plausibly poll with fewer than one set per
CPU, I suppose.

Again, though, from the description, it's totally unclear what a user
is supposed to do.

--Andy


Re: [PATCH v5 06/10] seccomp,landlock: Handle Landlock events per process hierarchy

2017-03-02 Thread Andy Lutomirski
On Thu, Mar 2, 2017 at 4:48 PM, Mickaël Salaün <m...@digikod.net> wrote:
>
> On 02/03/2017 17:36, Andy Lutomirski wrote:
>> On Wed, Mar 1, 2017 at 3:28 PM, Mickaël Salaün <m...@digikod.net> wrote:
>>>
>>>
>>> On 01/03/2017 23:20, Andy Lutomirski wrote:
>>>> On Wed, Mar 1, 2017 at 2:14 PM, Mickaël Salaün <m...@digikod.net> wrote:
>>>>>
>>>>> On 28/02/2017 21:01, Andy Lutomirski wrote:
>>>>>> On Tue, Feb 21, 2017 at 5:26 PM, Mickaël Salaün <m...@digikod.net> wrote:
>>>>> This design makes it possible for a process to add more constraints to
>>>>> its children on the fly. I think it is a good feature to have and a
>>>>> safer default inheritance mechanism, but it could be guarded by an
>>>>> option flag if we want both mechanism to be available. The same design
>>>>> could be used by seccomp filter too.
>>>>>
>>>>
>>>> Then let's do it right.
>>>>
>>>> Currently each task has an array of seccomp filter layers.  When a
>>>> task forks, the child inherits the layers.  All the layers are
>>>> presently immutable.  With Landlock, a layer can logically be a
>>>> syscall fitler layer or a Landlock layer.  This fits in to the
>>>> existing model just fine.
>>>>
>>>> If we want to have an interface to allow modification of an existing
>>>> layer, let's make it so that, when a layer is added, you have to
>>>> specify a flag to make the layer modifiable (by current, presumably,
>>>> although I can imagine other policies down the road).  Then have a
>>>> separate API that modifies a layer.
>>>>
>>>> IOW, I think your patch is bad for three reasons, all fixable:
>>>>
>>>> 1. The default is wrong.  A layer should be immutable to avoid an easy
>>>> attack in which you try to sandbox *yourself* and then you just modify
>>>> the layer to weaken it.
>>>
>>> This is not possible, there is only an operation for now:
>>> SECCOMP_ADD_LANDLOCK_RULE. You can only add more rules to the list (as
>>> for seccomp filter). There is no way to weaken a sandbox. The question
>>> is: how do we want to handle the rules *tree* (from the kernel point of
>>> view)?
>>>
>>
>> Fair enough.  But I still think that immutability (like regular
>> seccomp) should be the default.  For security, simplicity is
>> important.  I guess there could be two ways to relax immutability:
>> allowing making the layer stricter and allowing any change at all.
>>
>> As a default, though, programs should be able to expect that:
>>
>> seccomp(SECCOMP_ADD_WHATEVER, ...);
>> fork();
>>
>> [parent gets compromised]
>> [in parent]seccomp(anything whatsoever);
>>
>> will not affect the child,  If the parent wants to relax that, that's
>> fine, but I think it should be explicit.
>
> Good point. However the term "immutability" doesn't fit right because
> the process is still allowed to add more rules to itself (as for
> seccomp). The difference lays in the way a rule may be "appended" (by
> the current process) or "inserted" (by a parent process).
>
> I think three or four kind of operations (through the seccomp syscall)
> make sense:
> * append a rule (for the current process and its future children)

Sure, but this operation should *never* affect existing children,
existing seccomp layers, existing nodes, etc.  It should affect
current and future children only.  Or it could simply not exist for
Landlock and instead you'd have to add a layer (see below) and then
program that layer.

> * add a node (insert point), from which the inserted rules will be tied
> * insert a rule in the node, which will be inherited by futures children

I would advocate calling this a "seccomp layer" and making creation
and manipulation of them generic.

> * (maybe a "lock" command to make a layer immutable for the current
> process and its children)

Hmm, maybe.

>
> Doing so, a process is only allowed to insert a rule if a node was
> previously added. To forbid itself to insert new rules to one of its
> children, a process just need to not add a node before forking. Like
> this, there is no need for special rule flags nor default behavior,
> everything is explicit.

This is still slightly too complicated.  If you really want an
operation that adds a layer (please don't call it a node in the ABI)
and adds a rule to that layer in a single operation, it should be a
separate operation.

Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-03-02 Thread Andy Lutomirski
On Tue, Feb 28, 2017 at 7:28 PM, David Miller <da...@davemloft.net> wrote:
> From: Andy Lutomirski <l...@amacapital.net>
> Date: Tue, 28 Feb 2017 13:06:49 -0800
>
>> On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn
>> <willemdebruijn.ker...@gmail.com> wrote:
>>> On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski <l...@amacapital.net> 
>>> wrote:
>>>> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
>>>> <mtk.manpa...@gmail.com> wrote:
>>>>> [CC += linux-...@vger.kernel.org]
>>>>>
>>>>> Hi Willem
>>>>>
>>>>
>>>>>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>>>>>> creates skbuff fragments directly from these pages. On tx completion,
>>>>>> it notifies the socket owner that it is safe to modify memory by
>>>>>> queuing a completion notification onto the socket error queue.
>>>>
>>>> What happens if the user writes to the pages while it's not safe?
>>>>
>>>> How about if you're writing to an interface or a route that has crypto
>>>> involved and a malicious user can make the data change in the middle
>>>> of a crypto operation, thus perhaps leaking the entire key?  (I
>>>> wouldn't be at all surprised if a lot of provably secure AEAD
>>>> constructions are entirely compromised if an attacker can get the
>>>> ciphertext and tag computed from a message that changed during the
>>>> computation.
>>>
>>> Operations that read or write payload, such as this crypto example,
>>> but also ebpf in tc or iptables, for instance, demand a deep copy using
>>> skb_copy_ubufs before the operation.
>>>
>>> This blacklist approach requires caution, but these paths should be
>>> few and countable. It is not possible to predict at the socket layer
>>> whether a packet will encounter any such operation, so white-listing
>>> a subset of end-to-end paths is not practical.
>>
>> How about hardware that malfunctions if the packet changes out from
>> under it?  A whitelist seems quite a bit safer.
>
> These device are already choking, because as I stated this can already
> be done via sendfile().
>
> Networking card wise this isn't an issue, chips bring the entire packet
> into their FIFO, compute checksums on the fly mid-stream, and then write
> the 16-bit checksum field before starting to write the packet onto the
> wire.
>
> I think this is completely a non-issue, and we thought about this right
> from the start when sendfile() support was added nearly two decades ago.
> If network cards from back then didn't crap out in this situation I
> think the ones out there now are probably ok.

Fair enough.


Re: [PATCH v5 06/10] seccomp,landlock: Handle Landlock events per process hierarchy

2017-03-02 Thread Andy Lutomirski
On Wed, Mar 1, 2017 at 3:28 PM, Mickaël Salaün <m...@digikod.net> wrote:
>
>
> On 01/03/2017 23:20, Andy Lutomirski wrote:
>> On Wed, Mar 1, 2017 at 2:14 PM, Mickaël Salaün <m...@digikod.net> wrote:
>>>
>>> On 28/02/2017 21:01, Andy Lutomirski wrote:
>>>> On Tue, Feb 21, 2017 at 5:26 PM, Mickaël Salaün <m...@digikod.net> wrote:
>>> This design makes it possible for a process to add more constraints to
>>> its children on the fly. I think it is a good feature to have and a
>>> safer default inheritance mechanism, but it could be guarded by an
>>> option flag if we want both mechanism to be available. The same design
>>> could be used by seccomp filter too.
>>>
>>
>> Then let's do it right.
>>
>> Currently each task has an array of seccomp filter layers.  When a
>> task forks, the child inherits the layers.  All the layers are
>> presently immutable.  With Landlock, a layer can logically be a
>> syscall fitler layer or a Landlock layer.  This fits in to the
>> existing model just fine.
>>
>> If we want to have an interface to allow modification of an existing
>> layer, let's make it so that, when a layer is added, you have to
>> specify a flag to make the layer modifiable (by current, presumably,
>> although I can imagine other policies down the road).  Then have a
>> separate API that modifies a layer.
>>
>> IOW, I think your patch is bad for three reasons, all fixable:
>>
>> 1. The default is wrong.  A layer should be immutable to avoid an easy
>> attack in which you try to sandbox *yourself* and then you just modify
>> the layer to weaken it.
>
> This is not possible, there is only an operation for now:
> SECCOMP_ADD_LANDLOCK_RULE. You can only add more rules to the list (as
> for seccomp filter). There is no way to weaken a sandbox. The question
> is: how do we want to handle the rules *tree* (from the kernel point of
> view)?
>

Fair enough.  But I still think that immutability (like regular
seccomp) should be the default.  For security, simplicity is
important.  I guess there could be two ways to relax immutability:
allowing making the layer stricter and allowing any change at all.

As a default, though, programs should be able to expect that:

seccomp(SECCOMP_ADD_WHATEVER, ...);
fork();

[parent gets compromised]
[in parent]seccomp(anything whatsoever);

will not affect the child,  If the parent wants to relax that, that's
fine, but I think it should be explicit.

>>
>> 2. The API that adds a layer should be different from the API that
>> modifies a layer.
>
> Right, but it doesn't apply now because we can only add rules.

That's not what the code appears to do, though.  Sometimes it makes a
new layer without modifying tasks that share the layer and sometimes
it modifies the layer.

Both operations are probably okay, but they're not the same operation
and they shouldn't pretend to be.


>
>>
>> 3. The whole modification mechanism should be a separate patch to be
>> reviewed on its own merits.
>
> For a rule *replacement*, sure!

And for modification of policy for non-current tasks.  That's a big
departure from normal seccomp and should be reviewed as such.


Re: [PATCH v5 06/10] seccomp,landlock: Handle Landlock events per process hierarchy

2017-03-01 Thread Andy Lutomirski
On Wed, Mar 1, 2017 at 2:14 PM, Mickaël Salaün <m...@digikod.net> wrote:
>
> On 28/02/2017 21:01, Andy Lutomirski wrote:
>> On Tue, Feb 21, 2017 at 5:26 PM, Mickaël Salaün <m...@digikod.net> wrote:
>>> The seccomp(2) syscall can be use to apply a Landlock rule to the
>>> current process. As with a seccomp filter, the Landlock rule is enforced
>>> for all its future children. An inherited rule tree can be updated
>>> (append-only) by the owner of inherited Landlock nodes (e.g. a parent
>>> process that create a new rule)
>>
>> Can you clarify exaclty what this type of update does?  Is it
>> something that should be supported by normal seccomp rules as well?
>
> There is two main structures involved here: struct landlock_node and
> struct landlock_rule, both defined in include/linux/landlock.h [02/10].
>
> Let's take an example with seccomp filter and then Landlock:
> * seccomp filter: Process P1 creates and applies a seccomp filter F1 to
> itself. Then it forks and creates a child P2, which inherits P1's
> filters, hence F1. Now, if P1 add a new seccomp filter F2 to itself, P2
> *won't get it*. The P2's filter list will still only contains F1 but not
> F2. If P2 sets up and applies a new filter F3 to itself, its filter list
> will contains F1 and F3.
> * Landlock: Process P1 creates and applies a Landlock rule R1 to itself.
> Underneath the kernel creates a new node N1 dedicated to P1, which
> contains all its rules. Then P1 forks and creates a child P2, which
> inherits P1's rules, hence R1. Underneath P2 inherited N1. Now, if P1
> add a new Landlock rule R2 to itself, P2 *will get it* as well (because
> R2 is part of N1). If P2 creates and applies a new rule R3 to itself,
> its rules will contains R1, R2 and R3. Underneath the kernel created a
> new node N2 for P2, which only contains R3 but inherits/links to N1.
>
> This design makes it possible for a process to add more constraints to
> its children on the fly. I think it is a good feature to have and a
> safer default inheritance mechanism, but it could be guarded by an
> option flag if we want both mechanism to be available. The same design
> could be used by seccomp filter too.
>

Then let's do it right.

Currently each task has an array of seccomp filter layers.  When a
task forks, the child inherits the layers.  All the layers are
presently immutable.  With Landlock, a layer can logically be a
syscall fitler layer or a Landlock layer.  This fits in to the
existing model just fine.

If we want to have an interface to allow modification of an existing
layer, let's make it so that, when a layer is added, you have to
specify a flag to make the layer modifiable (by current, presumably,
although I can imagine other policies down the road).  Then have a
separate API that modifies a layer.

IOW, I think your patch is bad for three reasons, all fixable:

1. The default is wrong.  A layer should be immutable to avoid an easy
attack in which you try to sandbox *yourself* and then you just modify
the layer to weaken it.

2. The API that adds a layer should be different from the API that
modifies a layer.

3. The whole modification mechanism should be a separate patch to be
reviewed on its own merits.

> The current inheritance mechanism doesn't enable to only add a rule to
> the current process. The rule will be inherited by its children
> (starting from the children created after the first applied rule). An
> option flag NEW_RULE_HIERARCHY (or maybe another seccomp operation)
> could enable to create a new node for the current process, and then
> makes it not inherited by the previous children.

I like my proposal above much better.  "Add a layer" and "change a
layer" should be different operations.

--Andy


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Andy Lutomirski
On Tue, Feb 28, 2017 at 2:40 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> On Tue, 2017-02-28 at 14:25 -0800, Andy Lutomirski wrote:
>> On Tue, Feb 28, 2017 at 1:47 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>> > On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote:
>> >
>> >> Does this mean that a user program that does a zerocopy send can cause
>> >> a retransmitted segment to contain different data than the original
>> >> segment?  If so, is that okay?
>> >
>> > Same remark applies to sendfile() already
>>
>> True.
>>
>> >, or other zero copy modes
>> > (vmsplice() + splice() )
>>
>> I hate vmsplice().  I thought I remembered it being essentially
>> disabled at some point due to security problems.
>
> Right, zero copy is hard ;)
>
> vmsplice() is not disabled in current kernels, unless I missed
> something.
>

I think you're right.  That being said, from the man page:

The user pages are a gift to the kernel.  The application  may  not
modify this memory ever, otherwise the page cache and on-disk data may
differ.

This is just not okay IMO.


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Andy Lutomirski
On Tue, Feb 28, 2017 at 1:47 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote:
>
>> Does this mean that a user program that does a zerocopy send can cause
>> a retransmitted segment to contain different data than the original
>> segment?  If so, is that okay?
>
> Same remark applies to sendfile() already

True.

>, or other zero copy modes
> (vmsplice() + splice() )

I hate vmsplice().  I thought I remembered it being essentially
disabled at some point due to security problems.

>


Re: [PATCH v5 06/10] seccomp,landlock: Handle Landlock events per process hierarchy

2017-02-28 Thread Andy Lutomirski
On Tue, Feb 21, 2017 at 5:26 PM, Mickaël Salaün  wrote:
> The seccomp(2) syscall can be use to apply a Landlock rule to the
> current process. As with a seccomp filter, the Landlock rule is enforced
> for all its future children. An inherited rule tree can be updated
> (append-only) by the owner of inherited Landlock nodes (e.g. a parent
> process that create a new rule)

Can you clarify exaclty what this type of update does?  Is it
something that should be supported by normal seccomp rules as well?

> +/**
> + * landlock_run_prog - run Landlock program for a syscall

Unless this is actually specific to syscalls, s/for a syscall//, perhaps?

> +   if (new_events->nodes[event_idx]->owner ==
> +   _events->nodes[event_idx]) {
> +   /* We are the owner, we can then update the node. */
> +   add_landlock_rule(new_events, rule);

This is the part I don't get.  Adding a rule if you're the owner (BTW,
why is ownership visible to userspace at all?) for just yourself and
future children is very different from adding it so it applies to
preexisting children too.


> +   } else if (atomic_read(_events->usage) == 1) {
> +   WARN_ON(new_events->nodes[event_idx]->owner);
> +   /*
> +* We can become the new owner if no other task use 
> it.
> +* This avoid an unnecessary allocation.
> +*/
> +   new_events->nodes[event_idx]->owner =
> +   _events->nodes[event_idx];
> +   add_landlock_rule(new_events, rule);
> +   } else {
> +   /*
> +* We are not the owner, we need to fork 
> current_events
> +* and then add a new node.
> +*/
> +   struct landlock_node *node;
> +   size_t i;
> +
> +   node = kmalloc(sizeof(*node), GFP_KERNEL);
> +   if (!node) {
> +   new_events = ERR_PTR(-ENOMEM);
> +   goto put_rule;
> +   }
> +   atomic_set(>usage, 1);
> +   /* set the previous node after the new_events
> +* allocation */
> +   node->prev = NULL;
> +   /* do not increment the previous node usage */
> +   node->owner = _events->nodes[event_idx];
> +   /* rule->prev is already NULL */
> +   atomic_set(>usage, 1);
> +   node->rule = rule;
> +
> +   new_events = new_raw_landlock_events();
> +   if (IS_ERR(new_events)) {
> +   /* put the rule as well */
> +   put_landlock_node(node);
> +   return ERR_PTR(-ENOMEM);
> +   }
> +   for (i = 0; i < ARRAY_SIZE(new_events->nodes); i++) {
> +   new_events->nodes[i] =
> +   lockless_dereference(
> +   
> current_events->nodes[i]);
> +   if (i == event_idx)
> +   node->prev = new_events->nodes[i];
> +   if (!WARN_ON(!new_events->nodes[i]))
> +   
> atomic_inc(_events->nodes[i]->usage);
> +   }
> +   new_events->nodes[event_idx] = node;
> +
> +   /*
> +* @current_events will not be freed here because 
> it's usage
> +* field is > 1. It is only prevented to be freed by 
> another
> +* subject thanks to the caller of 
> landlock_append_prog() which
> +* should be locked if needed.
> +*/
> +   put_landlock_events(current_events);
> +   }
> +   }
> +   return new_events;
> +
> +put_prog:
> +   bpf_prog_put(prog);
> +   return new_events;
> +
> +put_rule:
> +   put_landlock_rule(rule);
> +   return new_events;
> +}
> +
> +/**
> + * landlock_seccomp_append_prog - attach a Landlock rule to the current 
> process
> + *
> + * current->seccomp.landlock_events is lazily allocated. When a process fork,
> + * only a pointer is copied. When a new event is added by a process, if there
> + * is other references to this process' landlock_events, then a new 
> allocation
> + * is made to contains an array pointing to Landlock rule lists. This design
> + * has low-performance impact and is memory efficient while keeping the
> + * property of append-only rules.
> + *
> 

Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Andy Lutomirski
On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn
 wrote:
>
>> I can see this working if you have a special type of skb that
>> indicates that the data might be concurrently written and have all the
>> normal skb APIs (including, especially, anything that clones it) make
>> a copy first.
>
> Support for cloned skbs is required for TCP, both at tcp_transmit_skb
> and segmentation offload. Patch 4 especially adds reference counting
> of shared pages across clones and other sk_buff operations like
> pskb_expand_head. This still allows for deep copy (skb_copy_ubufs)
> on clones in specific datapaths like the above.

Does this mean that a user program that does a zerocopy send can cause
a retransmitted segment to contain different data than the original
segment?  If so, is that okay?


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Andy Lutomirski
On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski <l...@amacapital.net> wrote:
>> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
>> <mtk.manpa...@gmail.com> wrote:
>>> [CC += linux-...@vger.kernel.org]
>>>
>>> Hi Willem
>>>
>>
>>>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>>>> creates skbuff fragments directly from these pages. On tx completion,
>>>> it notifies the socket owner that it is safe to modify memory by
>>>> queuing a completion notification onto the socket error queue.
>>
>> What happens if the user writes to the pages while it's not safe?
>>
>> How about if you're writing to an interface or a route that has crypto
>> involved and a malicious user can make the data change in the middle
>> of a crypto operation, thus perhaps leaking the entire key?  (I
>> wouldn't be at all surprised if a lot of provably secure AEAD
>> constructions are entirely compromised if an attacker can get the
>> ciphertext and tag computed from a message that changed during the
>> computation.
>
> Operations that read or write payload, such as this crypto example,
> but also ebpf in tc or iptables, for instance, demand a deep copy using
> skb_copy_ubufs before the operation.
>
> This blacklist approach requires caution, but these paths should be
> few and countable. It is not possible to predict at the socket layer
> whether a packet will encounter any such operation, so white-listing
> a subset of end-to-end paths is not practical.

How about hardware that malfunctions if the packet changes out from
under it?  A whitelist seems quite a bit safer.

--Andy


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-28 Thread Andy Lutomirski
On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
 wrote:
> [CC += linux-...@vger.kernel.org]
>
> Hi Willem
>

>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>> creates skbuff fragments directly from these pages. On tx completion,
>> it notifies the socket owner that it is safe to modify memory by
>> queuing a completion notification onto the socket error queue.

What happens if the user writes to the pages while it's not safe?

How about if you're writing to an interface or a route that has crypto
involved and a malicious user can make the data change in the middle
of a crypto operation, thus perhaps leaking the entire key?  (I
wouldn't be at all surprised if a lot of provably secure AEAD
constructions are entirely compromised if an attacker can get the
ciphertext and tag computed from a message that changed during the
computation.

I can see this working if you have a special type of skb that
indicates that the data might be concurrently written and have all the
normal skb APIs (including, especially, anything that clones it) make
a copy first.

--Andy


Re: [PATCH v5 10/10] landlock: Add user and kernel documentation for Landlock

2017-02-21 Thread Andy Lutomirski
On Tue, Feb 21, 2017 at 5:26 PM, Mickaël Salaün <m...@digikod.net> wrote:
> This documentation can be built with the Sphinx framework.
>
> Signed-off-by: Mickaël Salaün <m...@digikod.net>
> Cc: Alexei Starovoitov <a...@kernel.org>
> Cc: Andy Lutomirski <l...@amacapital.net>
> Cc: Daniel Borkmann <dan...@iogearbox.net>
> Cc: David S. Miller <da...@davemloft.net>
> Cc: James Morris <james.l.mor...@oracle.com>
> Cc: Jonathan Corbet <cor...@lwn.net>
> Cc: Kees Cook <keesc...@chromium.org>
> Cc: Serge E. Hallyn <se...@hallyn.com>


> +
> +Writing a rule
> +--
> +
> +To enforce a security policy, a thread first needs to create a Landlock rule.
> +The easiest way to write an eBPF program depicting a security rule is to 
> write
> +it in the C language.  As described in *samples/bpf/README.rst*, LLVM can
> +compile such programs.  Files *samples/bpf/landlock1_kern.c* and those in
> +*tools/testing/selftests/landlock/rules/* can be used as examples.  The
> +following example is a simple rule to forbid file creation, whatever syscall
> +may be used (e.g. open, mkdir, link...).
> +
> +.. code-block:: c
> +
> +static int deny_file_creation(struct landlock_context *ctx)
> +{
> +if (ctx->arg2 & LANDLOCK_ACTION_FS_NEW)
> +return 1;
> +return 0;
> +}
> +

Would it make sense to define landlock_context (or at least a prefix
thereof) in here?  Also, can't "arg2" have a better name?

Can you specify what the return value means?  Are 0 and 1 the only
choices?  Would "KILL" be useful?  How about "COREDUMP"?

> +File system action types
> +
> +
> +Flags are used to express actions.  This makes it possible to compose actions
> +and leaves room for future improvements to add more fine-grained action 
> types.
> +
> +.. kernel-doc:: include/uapi/linux/bpf.h
> +:doc: landlock_action_fs
> +
> +.. flat-table:: FS action types availability
> +
> +* - flags
> +  - since
> +
> +* - LANDLOCK_ACTION_FS_EXEC
> +  - v1
> +
> +* - LANDLOCK_ACTION_FS_WRITE
> +  - v1
> +
> +* - LANDLOCK_ACTION_FS_READ
> +  - v1
> +
> +* - LANDLOCK_ACTION_FS_NEW
> +  - v1
> +
> +* - LANDLOCK_ACTION_FS_GET
> +  - v1
> +
> +* - LANDLOCK_ACTION_FS_REMOVE
> +  - v1
> +
> +* - LANDLOCK_ACTION_FS_IOCTL
> +  - v1
> +
> +* - LANDLOCK_ACTION_FS_LOCK
> +  - v1
> +
> +* - LANDLOCK_ACTION_FS_FCNTL
> +  - v1

What happens if you run an old program on a new kernel?  Can you get
unexpected action types?

> +
> +
> +Ability types
> +-
> +
> +The ability of a Landlock rule describes the available features (i.e. context
> +fields and helpers).  This is useful to abstract user-space privileges for
> +Landlock rules, which may not need all abilities (e.g. debug).  Only the
> +minimal set of abilities should be used (e.g. disable debug once in
> +production).
> +
> +
> +.. kernel-doc:: include/uapi/linux/bpf.h
> +:doc: landlock_subtype_ability
> +
> +.. flat-table:: Ability types availability
> +
> +* - flags
> +  - since
> +  - capability
> +
> +* - LANDLOCK_SUBTYPE_ABILITY_WRITE
> +  - v1
> +  - CAP_SYS_ADMIN
> +
> +* - LANDLOCK_SUBTYPE_ABILITY_DEBUG
> +  - v1
> +  - CAP_SYS_ADMIN
> +

What do "WRITE" and "DEBUG" mean in this context?  I'm totally lost.

Hmm.  Reading below, "WRITE" seems to mean "modify state".  Would that
be accurate?

> +
> +Helper functions
> +
> +
> +See *include/uapi/linux/bpf.h* for functions documentation.
> +
> +.. flat-table:: Generic functions availability
> +

> +
> +* - bpf_get_current_comm
> +  - v1
> +  - LANDLOCK_SUBTYPE_ABILITY_DEBUG

What would this be used for?

> +* - bpf_get_trace_printk
> +  - v1
> +  - LANDLOCK_SUBTYPE_ABILITY_DEBUG
> +

This is different from the other DEBUG stuff in that it has side
effects.  I wonder if it should have a different flag.

--Andy


Re: [PATCH v4 net] bpf: add bpf_sk_netns_id() helper

2017-02-15 Thread Andy Lutomirski
On Wed, Feb 15, 2017 at 7:18 PM, David Ahern <d...@cumulusnetworks.com> wrote:
> On 2/15/17 8:08 PM, Eric W. Biederman wrote:
>> David Ahern <d...@cumulusnetworks.com> writes:
>>
>>> On 2/14/17 12:21 AM, Eric W. Biederman wrote:
>>>>> in cases where bpf programs are looking at sockets and packets
>>>>> that belong to different netns, it could be useful to get an id
>>>>> that uniquely identify a netns within the whole system.
>>>> It could be useful but there is no unique namespace id.
>>>>
>>>
>>> Have you given thought to a unique namespace id? Networking tracepoints
>>> for example could really benefit from a unique id.
>>
>> An id from the perspective of a process in the initial instance of every
>> namespace is certainly possible.
>>
>> A truly unique id is just not maintainable.  Think of the question how
>> do you assign every device in the world a rguaranteed unique ip address
>> without coordination, that is routable.  It is essentially the same
>> problem.
>>
>> AKA it is theoretically possible and very expensive.  It is much easier
>> and much more maintainable for identifiers to have scope and only be
>> unique within that scope.
>
>
> I don't mean unique in the entire world, I mean unique within a single
> system.
>
> Tracepoints are code based and have global scope. I would like to be
> able to correlate, for example, FIB lookups within a single network
> namespace. Having an id that I could filter on when collecting or match
> when dumping them goes a long way.

Why wouldn't an id relative to your logging program work?  Global ids
are problematic because they are incompatible with tools like CRIU.

-- 
Andy Lutomirski
AMA Capital Management, LLC


Re: [PATCH v2 net] bpf: introduce BPF_F_ALLOW_OVERRIDE flag

2017-02-12 Thread Andy Lutomirski
;> +++ b/samples/bpf/test_cgrp2_attach2.c
>> @@ -79,11 +79,12 @@ int main(int argc, char **argv)
>>   if (join_cgroup(FOO))
>>   goto err;
>>
>> - if (bpf_prog_attach(drop_prog, foo, BPF_CGROUP_INET_EGRESS)) {
>> + if (bpf_prog_attach(drop_prog, foo, BPF_CGROUP_INET_EGRESS, 1)) {
>>   log_err("Attaching prog to /foo");
>>   goto err;
>>   }
>>
>> + printf("Attached DROP prog. This ping in cgroup /foo should 
>> fail...\n");
>>   assert(system(PING_CMD) != 0);
>>
>>   /* Create cgroup /foo/bar, get fd, and join it */
>> @@ -94,24 +95,27 @@ int main(int argc, char **argv)
>>   if (join_cgroup(BAR))
>>   goto err;
>>
>> + printf("Attached DROP prog. This ping in cgroup /foo/bar should 
>> fail...\n");
>>   assert(system(PING_CMD) != 0);
>>
>> - if (bpf_prog_attach(allow_prog, bar, BPF_CGROUP_INET_EGRESS)) {
>> + if (bpf_prog_attach(allow_prog, bar, BPF_CGROUP_INET_EGRESS, 1)) {
>>   log_err("Attaching prog to /foo/bar");
>>   goto err;
>>   }
>>
>> + printf("Attached PASS prog. This ping in cgroup /foo/bar should 
>> pass...\n");
>>   assert(system(PING_CMD) == 0);
>>
>> -
>>   if (bpf_prog_detach(bar, BPF_CGROUP_INET_EGRESS)) {
>>   log_err("Detaching program from /foo/bar");
>>   goto err;
>>   }
>>
>> + printf("Detached PASS from /foo/bar while DROP is attached to /foo.\n"
>> +"This ping in cgroup /foo/bar should fail...\n");
>>   assert(system(PING_CMD) != 0);
>>
>> - if (bpf_prog_attach(allow_prog, bar, BPF_CGROUP_INET_EGRESS)) {
>> + if (bpf_prog_attach(allow_prog, bar, BPF_CGROUP_INET_EGRESS, 1)) {
>>   log_err("Attaching prog to /foo/bar");
>>   goto err;
>>   }
>> @@ -121,8 +125,60 @@ int main(int argc, char **argv)
>>   goto err;
>>   }
>>
>> + printf("Attached PASS from /foo/bar and detached DROP from /foo.\n"
>> +"This ping in cgroup /foo/bar should pass...\n");
>>   assert(system(PING_CMD) == 0);
>>
>> + if (bpf_prog_attach(allow_prog, bar, BPF_CGROUP_INET_EGRESS, 1)) {
>> + log_err("Attaching prog to /foo/bar");
>> + goto err;
>> + }
>> +
>> + if (!bpf_prog_attach(allow_prog, bar, BPF_CGROUP_INET_EGRESS, 0)) {
>> + errno = 0;
>> + log_err("Unexpected success attaching prog to /foo/bar");
>> + goto err;
>> + }
>> +
>> + if (bpf_prog_detach(bar, BPF_CGROUP_INET_EGRESS)) {
>> + log_err("Detaching program from /foo/bar");
>> + goto err;
>> + }
>> +
>> + if (!bpf_prog_detach(foo, BPF_CGROUP_INET_EGRESS)) {
>> + errno = 0;
>> + log_err("Unexpected success in double detach from /foo");
>> + goto err;
>> + }
>> +
>> + if (bpf_prog_attach(allow_prog, foo, BPF_CGROUP_INET_EGRESS, 0)) {
>> + log_err("Attaching non-overridable prog to /foo");
>> + goto err;
>> + }
>> +
>> + if (!bpf_prog_attach(allow_prog, bar, BPF_CGROUP_INET_EGRESS, 0)) {
>> + errno = 0;
>> + log_err("Unexpected success attaching non-overridable prog to 
>> /foo/bar");
>> + goto err;
>> + }
>> +
>> + if (!bpf_prog_attach(allow_prog, bar, BPF_CGROUP_INET_EGRESS, 1)) {
>> + errno = 0;
>> + log_err("Unexpected success attaching overridable prog to 
>> /foo/bar");
>> + goto err;
>> + }
>> +
>> + if (!bpf_prog_attach(allow_prog, foo, BPF_CGROUP_INET_EGRESS, 1)) {
>> + errno = 0;
>> + log_err("Unexpected success attaching overridable prog to 
>> /foo");
>> + goto err;
>> + }
>> +
>> + if (bpf_prog_attach(drop_prog, foo, BPF_CGROUP_INET_EGRESS, 0)) {
>> + log_err("Attaching different non-overridable prog to /foo");
>> + goto err;
>> + }
>> +
>>   goto out;
>>
>>  err:
>> @@ -132,5 +188,9 @@ int main(int argc, char **argv)
>>   close(foo);
>>   close(bar);
>>   cleanup_cgroup_environment();
>> + if (!rc)
>> + printf("PASS\n");
>> + else
>> + printf("FAIL\n");
>>   return rc;
>>  }
>> diff --git a/samples/bpf/test_cgrp2_sock.c b/samples/bpf/test_cgrp2_sock.c
>> index 0791b949cbe4..c3cfb23e23b5 100644
>> --- a/samples/bpf/test_cgrp2_sock.c
>> +++ b/samples/bpf/test_cgrp2_sock.c
>> @@ -75,7 +75,7 @@ int main(int argc, char **argv)
>>   return EXIT_FAILURE;
>>   }
>>
>> - ret = bpf_prog_attach(prog_fd, cg_fd, BPF_CGROUP_INET_SOCK_CREATE);
>> + ret = bpf_prog_attach(prog_fd, cg_fd, BPF_CGROUP_INET_SOCK_CREATE, 0);
>>   if (ret < 0) {
>>   printf("Failed to attach prog to cgroup: '%s'\n",
>>  strerror(errno));
>> diff --git a/samples/bpf/test_cgrp2_sock2.c b/samples/bpf/test_cgrp2_sock2.c
>> index 455ef0d06e93..db036077b644 100644
>> --- a/samples/bpf/test_cgrp2_sock2.c
>> +++ b/samples/bpf/test_cgrp2_sock2.c
>> @@ -55,7 +55,7 @@ int main(int argc, char **argv)
>>   }
>>
>>   ret = bpf_prog_attach(prog_fd[filter_id], cg_fd,
>> -   BPF_CGROUP_INET_SOCK_CREATE);
>> +   BPF_CGROUP_INET_SOCK_CREATE, 0);
>>   if (ret < 0) {
>>   printf("Failed to attach prog to cgroup: '%s'\n",
>>  strerror(errno));
>> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
>> index 3ddb58a36d3c..ae752fa4eaa7 100644
>> --- a/tools/lib/bpf/bpf.c
>> +++ b/tools/lib/bpf/bpf.c
>> @@ -168,7 +168,8 @@ int bpf_obj_get(const char *pathname)
>>   return sys_bpf(BPF_OBJ_GET, , sizeof(attr));
>>  }
>>
>> -int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
>> +int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type,
>> + unsigned int flags)
>>  {
>>   union bpf_attr attr;
>>
>> @@ -176,6 +177,7 @@ int bpf_prog_attach(int prog_fd, int target_fd, enum 
>> bpf_attach_type type)
>>   attr.target_fd = target_fd;
>>   attr.attach_bpf_fd = prog_fd;
>>   attr.attach_type   = type;
>> + attr.attach_flags  = flags;
>>
>>   return sys_bpf(BPF_PROG_ATTACH, , sizeof(attr));
>>  }
>> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
>> index a2f9853dd882..4ac6c4b84100 100644
>> --- a/tools/lib/bpf/bpf.h
>> +++ b/tools/lib/bpf/bpf.h
>> @@ -41,7 +41,8 @@ int bpf_map_delete_elem(int fd, void *key);
>>  int bpf_map_get_next_key(int fd, void *key, void *next_key);
>>  int bpf_obj_pin(int fd, const char *pathname);
>>  int bpf_obj_get(const char *pathname);
>> -int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type 
>> type);
>> +int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type 
>> type,
>> + unsigned int flags);
>>  int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
>>
>>
>>
>



-- 
Andy Lutomirski
AMA Capital Management, LLC


Re: [RFC PATCH net] bpf: introduce BPF_F_ALLOW_OVERRIDE flag

2017-02-10 Thread Andy Lutomirski
t; +
> +   if (!overridable && prog)
> +   return -EPERM;
> +
> +   if (prog)
> +   overridable = new_overridable;
>
> old_prog = xchg(cgrp->bpf.prog + type, prog);
>
> @@ -101,11 +111,13 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
> struct cgroup *desc = container_of(pos, struct cgroup, self);
>
> /* skip the subtree if the descendant has its own program */
> -   if (desc->bpf.prog[type] && desc != cgrp)
> +   if (desc->bpf.prog[type] && desc != cgrp) {
> pos = css_rightmost_descendant(pos);
> -   else
> +   } else {
> rcu_assign_pointer(desc->bpf.effective[type],
>effective);
> +   desc->bpf.disallow_override[type] = !overridable;
> +   }
> }
>
> if (prog)
> @@ -115,6 +127,7 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
> bpf_prog_put(old_prog);
> static_branch_dec(_bpf_enabled_key);
> }
> +   return 0;
>  }
>
>  /**
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 19b6129eab23..bbb016adbaeb 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -920,13 +920,14 @@ static int bpf_obj_get(const union bpf_attr *attr)
>
>  #ifdef CONFIG_CGROUP_BPF
>
> -#define BPF_PROG_ATTACH_LAST_FIELD attach_type
> +#define BPF_PROG_ATTACH_LAST_FIELD attach_flags
>
>  static int bpf_prog_attach(const union bpf_attr *attr)
>  {
> +   enum bpf_prog_type ptype;
> struct bpf_prog *prog;
> struct cgroup *cgrp;
> -   enum bpf_prog_type ptype;
> +   int ret;
>
> if (!capable(CAP_NET_ADMIN))
> return -EPERM;
> @@ -934,6 +935,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> if (CHECK_ATTR(BPF_PROG_ATTACH))
> return -EINVAL;
>
> +   if (attr->attach_flags & ~BPF_F_ALLOW_OVERRIDE)
> +   return -EINVAL;
> +
> switch (attr->attach_type) {
> case BPF_CGROUP_INET_INGRESS:
> case BPF_CGROUP_INET_EGRESS:
> @@ -956,10 +960,13 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> return PTR_ERR(cgrp);
> }
>
> -   cgroup_bpf_update(cgrp, prog, attr->attach_type);
> +   ret = cgroup_bpf_update(cgrp, prog, attr->attach_type,
> +   attr->attach_flags & BPF_F_ALLOW_OVERRIDE);
> +   if (ret)
> +   bpf_prog_put(prog);
> cgroup_put(cgrp);
>
> -   return 0;
> +   return ret;
>  }
>
>  #define BPF_PROG_DETACH_LAST_FIELD attach_type
> @@ -967,6 +974,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>  static int bpf_prog_detach(const union bpf_attr *attr)
>  {
> struct cgroup *cgrp;
> +   int ret;
>
> if (!capable(CAP_NET_ADMIN))
> return -EPERM;
> @@ -982,7 +990,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
> if (IS_ERR(cgrp))
> return PTR_ERR(cgrp);
>
> -   cgroup_bpf_update(cgrp, NULL, attr->attach_type);
> +   ret = cgroup_bpf_update(cgrp, NULL, attr->attach_type, false);
> cgroup_put(cgrp);
> break;
>
> @@ -990,7 +998,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
> return -EINVAL;
> }
>
> -   return 0;
> +   return ret;
>  }
>  #endif /* CONFIG_CGROUP_BPF */
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 688dd02af985..53bbca7c4859 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -6498,15 +6498,16 @@ static __init int cgroup_namespaces_init(void)
>  subsys_initcall(cgroup_namespaces_init);
>
>  #ifdef CONFIG_CGROUP_BPF
> -void cgroup_bpf_update(struct cgroup *cgrp,
> -  struct bpf_prog *prog,
> -  enum bpf_attach_type type)
> +int cgroup_bpf_update(struct cgroup *cgrp, struct bpf_prog *prog,
> + enum bpf_attach_type type, bool overridable)
>  {
> struct cgroup *parent = cgroup_parent(cgrp);
> +   int ret;
>
> mutex_lock(_mutex);
> -   __cgroup_bpf_update(cgrp, parent, prog, type);
> +   ret = __cgroup_bpf_update(cgrp, parent, prog, type, overridable);
> mutex_unlock(_mutex);
> +   return ret;
>  }
>  #endif /* CONFIG_CGROUP_BPF */
>
> --
> 2.8.0
>



-- 
Andy Lutomirski
AMA Capital Management, LLC


Re: [PATCH v2 net] bpf: add bpf_sk_netns_id() helper

2017-02-06 Thread Andy Lutomirski
On Mon, Feb 6, 2017 at 5:42 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Sat, Feb 04, 2017 at 08:17:57PM -0800, Andy Lutomirski wrote:
>> On Sat, Feb 4, 2017 at 8:05 PM, Alexei Starovoitov
>> <alexei.starovoi...@gmail.com> wrote:
>> > On Sat, Feb 04, 2017 at 07:33:14PM -0800, Andy Lutomirski wrote:
>> >> What does "bpf programs are global" mean?  I am genuinely unable to
>> >> figure out what you mean.  Here are some example guesses of what you
>> >> might mean:
>> >>
>> >>  - BPF programs are compiled independently of a namespace.  This is
>> >> certainly true, but I don't think it matters.
>> >>
>> >>  - You want BPF programs to affect everything on the system.  But this
>> >> doesn't seem right to be -- they only affect things in the relevant
>> >> cgroup, so they're not global in that sense.
>> >
>> > All bpf program types are global in the sense that you can
>> > make all of them to operate across all possible scopes and namespaces.
>>
>> I still don't understand what you mean here.  A seccomp program runs
>> in the process that installs it and children -- it does not run in
>> "all possible scopes".
>
> seccomp and classic bpf is different, since there is no concept
> of the program there. cbpf is a stateless set of instructions
> that belong to some entity like seccomp or socket. ebpf is stateful
> and starts with the program, then hook and then scope.

So... are you saying that a loaded eBPF object is global in the sense
that if you attach the same object to more than one thing (two
sockets, say), the *same* program runs and shares its state?  If so, I
agree, but that's still not an argument that the *same* attachment of
an eBPF program to a cgroup should run in multiple network namespaces.
You could also attach the (same) program once per netns and its state
would be shared.

I'm pretty sure I've never suggested that an eBPF program be bound to
a namespace.  I just think that a lot of things get more
straightforward if an *attachment* of an eBPF program to a cgroup is
bound to a single namespace.

>
>> A socket filter runs on a single socket and
>> therefore runs in a single netns.  So presumably I'm still
>> misunderstanding you
>
> in classic - yes. ebpf can have the same program attached to
> multiple sockets in different netns.
> For classic - the object is the socket and the user can only
> manipulate that socket. For extended - the object is the program
> and it can exist on its own. Like the program can be pinned in bpffs
> while it's not attached to any hook.
> For classic bpf the ideas of CRIU naturally apply, since
> it checkpoints the socket and it happens that socket has
> a set of statless cbpf instructions within. So it's
> expected to save/restore cbpf as part of socket save/restore.
> In case of ebpf the program exists independently of the socket.

True.

> Doing save/restore of the ebpf program attached to a socket
> is meaningless, since it could be pinned in bpffs, attached
> to other sockets, has state in bpf maps, some other process
> might be actively talking to that program and so on.
> ebpf is a graph of interconnected pieces. To criu such thing
> one really need to freeze the whole system, all of it processes,
> and stop the kernel. I don't see criu ever be applied to ebpf.

Not true, I think.  CRIU could figure out which eBPF program is
attached to a socket and snapshot the attachment and (separately) the
eBPF object.  If the eBPF object were to be attached to two sockets,
CRIU would notice and only snaptshot the eBPF program once.  I don't
see why the whole system would need to be snapshotted.

Obviously this code isn't written yet.  But if eBPF were to be widely
used for, say, seccomp, I bet the CRIU code would show up in short
order.

>
> Potentially we can also allow a combination of scopes where
> single bpf program is attached to a hook while scoped by
> both netns and cgroup at the same time.
> I see it as an extension to prog_attach command where
> user will specify two fds (one for cgroup and one for netns) to make
> given prog_fd program scoped by both.
> Nothing in the current api prevents such future extensions.

I think this might be sufficient for all usecases.  Facebook's use
case of monitoring everything (if I understood correctly) could be
addressed by just attaching the program to the cgroup once for each
namespace.  The benefit would be a good deal of simplicity: you could
relax the capable() call to ns_capable() in the future, you wouldn't
need this patch, and you wouldn't be creating the first major
non-netns-specific network hook in the kernel.

>
> And from the other thread:
>
>

Re: [PATCH v3 net] bpf: add bpf_sk_netns_id() helper

2017-02-05 Thread Andy Lutomirski
On Sun, Feb 5, 2017 at 11:24 AM, David Ahern  wrote:
> On 2/3/17 8:34 PM, Alexei Starovoitov wrote:
>> Therefore introduce 'u64 bpf_sk_netns_id(sk);' helper. It returns
>> unique value that identifies netns of given socket or dev_net(skb->dev)
>> The upper 32-bits of the return value contain device id where namespace
>> filesystem resides and lower 32-bits contain inode number within that 
>> filesystem.
>> It's the same as
>>  struct stat st;
>>  stat("/proc/pid/ns/net", );
>>  return (st->st_dev << 32)  | st->st_ino;
>>
> ...
>
>> can be considered a new feature, whereas for cgroup_sock
>> programs that modify sk->bound_dev_if (like 'ip vrf' does)
>> it's a bug fix, since 'ip vrf' needs to be netns aware.
>>
>
>
> LGTM.
>
> Reviewed-by: David Ahern 
> Tested-by: David Ahern 
>
> Updated patches for vrf in network namespaces are here:
> https://github.com/dsahern/iproute2 vrf/ip-vrf
>
>
> root@kenny-jessie2:~# ip vrf exec red bash
>
> root@kenny-jessie2:red:~# ping -c1 -w1 10.100.1.254
> PING 10.100.1.254 (10.100.1.254) 56(84) bytes of data.
> 64 bytes from 10.100.1.254: icmp_seq=1 ttl=64 time=0.230 ms
>
> --- 10.100.1.254 ping statistics ---
> 1 packets transmitted, 1 received, 0% packet loss, time 0ms
> rtt min/avg/max/mdev = 0.230/0.230/0.230/0.000 ms
>
> root@kenny-jessie2:red:~# unshare -n
>
> root@kenny-jessie2:red:~# ping -c1 -w1 10.100.1.254
> connect: Network is unreachable
>
>
> Andy: thank you for the feedback on the 'ip vrf' use case. I believe this 
> kernel patch + my iproute2 patches address the issues mentioned so far. 
> Specifically, the transcript above shows your concern about 'unshare -n' case 
> is handled. In one of the many responses last night, you mentioned I have 'a 
> somewhat kludgey fix that gets the "ip netns" case'. If you can elaborate on 
> 'somewhat kludgey', I can fix those this week as well.

What I meant was: fixing it in iproute2 without kernel changes would
be kludgey and incomplete.  You seem to have fixed it by depending on
this kernel patch.

--Andy


Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-02-04 Thread Andy Lutomirski
On Sat, Feb 4, 2017 at 8:37 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Sat, Feb 04, 2017 at 07:54:20PM -0800, Andy Lutomirski wrote:
>>
>> I've repeatedly asked how you plan to make a "don't override" flag
>> have sensible semantics when someone tries to add a new flag or change
>> the behavior to "don't override but, rather then rejecting programs
>> down the hierarchy, just run them all".  You haven't answered that
>> question.
>
> I explained already that I need to do combining on the bpf side instead
> of running the list, since running several programs where 90% of
> the logic is the same is the overhead that is not acceptable
> for production server. It may be fine for desktop, but not
> when every cycle matters. With one program per cgroup I can
> combine multiple of them on bpf side. In networking the most of
> the prep work like packet parsing and lookups are common,
> but the actions are different, so for one part of the hieararchy
> I can have program A monitoring tcp and in other
> part I can have program B monitoring tcp and udp.
> What you're saying that for tcp and udp monitoring
> run two programs. One for udp and one for tcp.
> That is not efficient. Hence the need to combine
> the programs on bpf side and attach only one with override.

I'm not saying that at all.  I'm saying that this use case sounds
valid, but maybe it could be solved differently.  Here are some ideas:

 - Expose the actual cgroup (relative to the hooked cgroup) to the BPF
program.  Then you could parse the headers, check what cgroup you're
in, and proceed accordingly.  This could potentially be even faster
for your use case if done carefully because it will stress the
instruction cache less.

 - Have a (non-default) flag that says "overridable".  The effect is
that, if /A and /A/B both have overridable programs attached, then
sockets in /A/B don't run /A's program.  If, however, /A has a
non-overridable program and /A/B has an overridable program, then
sockets in /A/B run both programs.  IOW overridable programs override
other overridable programs, but non-overridable programs never
override anything and are never overridden by anything.

>
> The "dont override flag" was also explained before. Here it is again:
> Without breaking above "override" scenario we can add a flag
> that when the program is attached with that flag to one part of
> the hierarchy the override of it will not be allowed in the descendent.
> This extension can be done at any time in the future.
> The only question is what is the default when such flag
> is not present. The current default is override allowed.
> You insist on changing the default right now.
> I don't mind, therefore I'm working on such "dont override flag",
> since if we need to change the default it needs to be done now,
> but if it doesn't happen for 4.10, it's absolutely fine too.
> For security use cases the flag will be added later
> and sandboxing use cases will use that flag too.
> There are no expections that if cgroup is there the program
> attach command must always succeed. That's why we have error codes
> and we can dissallow attach based on this flag or any future
> restrictions. All of it is root now anyway and sandboxing/security
> use cases need to wait until bpf side can be made unprivileged.
> I see current api to have a ton of room for future extensions.
>

Suppose someone wants to make CGROUP_BPF work right when a container
and a container manager both use it.  It'll have to run both programs.
What would the semantics be if this were to be added?  This is really
a question, not an indictment of your approach.  For all I know,
you're proposing exactly what I suggested above.

And sandboxing needn't, and won't, wait until unprivileged bpf
programs are added.  Plenty of sandboxes run as root.


Re: [PATCH v2 net] bpf: add bpf_sk_netns_id() helper

2017-02-04 Thread Andy Lutomirski
On Sat, Feb 4, 2017 at 8:05 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Sat, Feb 04, 2017 at 07:33:14PM -0800, Andy Lutomirski wrote:
>> On Sat, Feb 4, 2017 at 7:25 PM, Alexei Starovoitov
>> <alexei.starovoi...@gmail.com> wrote:
>> > On Sat, Feb 04, 2017 at 09:15:10AM -0800, Andy Lutomirski wrote:
>> >> On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <a...@fb.com> wrote:
>> >> > Note that all bpf programs types are global.
>> >>
>> >> I don't think this has a clear enough meaning to work with.  In
>> >
>> > Please clarify what you mean. The quoted part says
>> > "bpf programs are global". What is not "clear enough" there?
>>
>> What does "bpf programs are global" mean?  I am genuinely unable to
>> figure out what you mean.  Here are some example guesses of what you
>> might mean:
>>
>>  - BPF programs are compiled independently of a namespace.  This is
>> certainly true, but I don't think it matters.
>>
>>  - You want BPF programs to affect everything on the system.  But this
>> doesn't seem right to be -- they only affect things in the relevant
>> cgroup, so they're not global in that sense.
>
> All bpf program types are global in the sense that you can
> make all of them to operate across all possible scopes and namespaces.

I still don't understand what you mean here.  A seccomp program runs
in the process that installs it and children -- it does not run in
"all possible scopes".  A socket filter runs on a single socket and
therefore runs in a single netns.  So presumably I'm still
misunderstanding you

> cgroup only gives a scope for the program to run, but it's
> not limited by it. The user can have the same program
> attached to two or more different cgroups, so one program
> will run across multiple cgroups.

Does this mean "BPF programs are compiled independently of a
namespace?"  If so, I don't see why it's relevant at all.  Sure, you
could compile a BPF program once and install it in two different
scopes, but that doesn't mean that the kernel should *run* it globally
in any sense.  Can you clarify?

>
>>  - The set of BPF program types and the verification rules are
>> independent of cgroup and namespace.  This is true, but I don't think
>> it matters.
>
> It matters. That's actually the key to understand. The loading part
> verifies correctness for particular program type.
> Afterwards the same program can be attached to any place.
> Including different cgroups and different namespaces.
> The 'attach' part is like 'switch on' that enables program
> on particular hook. The scope (whether it's socket or netdev or cgroup)
> is a scope that program author uses to narrow down the hook,
> but it's not an ultimate restriction.
> For example the socket program can be attached to sockets and
> share information with cls_bpf program attached to netdev.
> The kprobe tracing program can peek into kernel internal data
> and share it with cls_bpf or any other type as long as
> everything is root. The information flow is global to the whole system.

Why does any of this imply that a cgroup+bpf program that is attached
once should run for all network namespaces?

>
>> Because we're one week or so from 4.10 final, the 4.10-rc code is
>> problematic even for ip vrf, and there isn't a clear solution yet.
>> There are a bunch of requirements that seem to conflict, and something
>> has to give.
>
> let's go back to the beginning:
> - you've identified a 'malfunction' in ip vrf. It's valid one. Thank you.
> - can it be fixed without kernel changes ? Yes. David offered to do so.

He has (I think) a somewhat kludgey fix that gets the "ip netns" case
right but not the "unshare -n" case.  I think the latter can't be
fixed without kernel changes.


Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-02-04 Thread Andy Lutomirski
On Sat, Feb 4, 2017 at 7:48 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Sat, Feb 04, 2017 at 07:27:01PM -0800, Andy Lutomirski wrote:
>> On Sat, Feb 4, 2017 at 7:10 PM, Alexei Starovoitov
>> <alexei.starovoi...@gmail.com> wrote:
>> > On Sat, Feb 04, 2017 at 09:07:19AM -0800, Andy Lutomirski wrote:
>> >> >> can see a namespaced view of the world.  For this to work, presumably
>> >> >> we need to make sure that eBPF programs that are installed by programs
>> >> >> that are in a container don't see traffic that isn't in that
>> >> >> container.
>> >> >
>> >> > such approach will break existing users.
>> >>
>> >> What existing users?  The whole feature has never been in a released 
>> >> kernel.
>> >
>> > the v3 patch commit log explains the use cases that work across netns.
>> > And they would break if netns boundary is suddenly started to be
>> > enforced for networking program types.
>>
>> This is why these issues should be addressed carefully before this
>> feature is stabilized in 4.10, and I'm increasingly unconvinced that
>> this will happen.  There may be a single week left, and the cgroup
>> override issue is still entirely unaddressed, for example.
>
> imo cgroup override issue is not an issue and the api can be extended
> at any time. I offered to change the default to disable override,
> but I need override to be there, since it's the fastest and most
> flexible way. So the sooner we can land bpf_sk_netns_id() patch
> the faster I can post override disable.

Override is barely the fastest way.  Long ago in one of these threads,
I showed how the full run-all-the-in-scope-programs behavior could be
implemented with a single correctly-predicted branch worth of overhead
compared to the current behavior.  The run-them-all behavior also
seems like the least surprising, most secure, and most generally
useful behavior.

I've repeatedly asked how you plan to make a "don't override" flag
have sensible semantics when someone tries to add a new flag or change
the behavior to "don't override but, rather then rejecting programs
down the hierarchy, just run them all".  You haven't answered that
question.

Given that we're doing API design and it's way past the merge
window, I just don't see how any of this is ready for 4.10.

>
>> Also, the v3 commit log's example seems a bit weak, since seccomp can
>> handle that case (block non-init-netns raw sockets) and it's not clear
>> to me that a filter like that serves a purpose regardless besides
>> kernel attack surface reduction.
>
> seccomp doesn't even support extended bpf yet and that makes it
> quite inflexible.

You can trivially write a seccomp filter that blocks raw sockets.

>
>> > If you're suggesting to limit root netns for cgroup_sock program type only
>> > then David explained weeks ago why it's not acceptable for vrf use
>> > case which was the main reason to add that program type in the first place.
>> > Also it would make one specific bpf program type to be different
>> > from all others which will be confusing.
>> >
>>
>> I'm suggesting limiting it to the init netns for *all* cases.
>> (Specifically, limiting installing programs to the init netns.  But
>> perhaps limiting running of programs to the netns in which they're
>> installed would be the best solution after all.)  If there isn't a
>> well-understood solution that plays well with netns and that satisfies
>> ip vrf, the it's worth seriously thinking about whether ip vrf is
>> really ready for 4.10.
>
> The proposed v3 is imo solves the ip vrf 'malfunction' that you found.
> v1 was also solving it and iproute2 patch was ready, but as Eric pointed
> out just ns.inum is not enough.
> The 'malfunction' can also be addressed _without_ kernel changes,
> but the kernel can make it easier, hence v3 patch.

I don't know whether Eric will be okay with your v3.  I pointed out a
possible problem.

>
> Andy, can you please reply in one thread? In particular in v3 patch?
> I'm not sure why you making more or less the same points in
> different threads. I'm sure it's hard for everyone to follow.
>

I can try.  Can you try the same thing so I can reply once?


Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-02-04 Thread Andy Lutomirski
On Sat, Feb 4, 2017 at 7:35 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Sat, Feb 04, 2017 at 07:22:03PM -0800, Andy Lutomirski wrote:
>> On Sat, Feb 4, 2017 at 7:18 PM, Alexei Starovoitov
>> <alexei.starovoi...@gmail.com> wrote:
>> > On Sat, Feb 04, 2017 at 09:08:38AM -0800, Andy Lutomirski wrote:
>> >> > So use-case would be that someone wants to attach the very same
>> >> > prog via tc to various netdevs sitting in different netns, and
>> >> > that prog looks up a map, controlled by initns, with skb->netns_inum
>> >> > as key and the resulting value could contain allowed feature bits
>> >> > for that specific netns prog the skbs goes through? That would be
>> >> > a feature, not "concern", no? At the same time, it's up to the
>> >> > user or mgmt app what gets loaded so f.e. it might just as well
>> >> > tailor/optimize the progs individually for the devs sitting in
>> >> > netns-es to avoid such map lookup.
>> >>
>> >> Agreed.  I don't see why you would install the exact same program on
>> >> two sockets in different netnses if the program contains, say, an
>> >> ifindex.  Why not just install a variant with the right ifindex into
>> >> each socket?
>> >
>> > In other cases people prefer to have one program compiled once
>> > and thoroughly tested offline, since some folks are worried
>> > that on-the-fly compilation may cause generated code to
>> > be rejected by the verifier.
>>
>> I would be rather surprised if someone wrote a program that did had an
>> expression like (ifindex == 17), tested it offline, and refused to
>> update the 17 based on the actual ifindex in use.
>
> All programs have bugs. What bpf subsytem is trying to do is
> to be flexible and satisfy as many use cases as possible.
> Boxing users into "one way to do things" isn't a successful
> strategy in my opinion. perl vs python, if you like :)
> bpf is perl like. We don't enforce spaces vs tabs :)
>

Daniel's asking what the netns query is good for in programs that are
attached to sockets.  I think your example isn't relevant here.  In
fact, I think your example of pre-tested programs is even less
relevant, since a there's no way to know what the netns id will be
prior to the creation of a netns, so you can't usefully hard-code a
netns id into a precompiled BPF program.


Re: [PATCH v2 net] bpf: add bpf_sk_netns_id() helper

2017-02-04 Thread Andy Lutomirski
On Sat, Feb 4, 2017 at 7:25 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Sat, Feb 04, 2017 at 09:15:10AM -0800, Andy Lutomirski wrote:
>> On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov <a...@fb.com> wrote:
>> > Note that all bpf programs types are global.
>>
>> I don't think this has a clear enough meaning to work with.  In
>
> Please clarify what you mean. The quoted part says
> "bpf programs are global". What is not "clear enough" there?

What does "bpf programs are global" mean?  I am genuinely unable to
figure out what you mean.  Here are some example guesses of what you
might mean:

 - BPF programs are compiled independently of a namespace.  This is
certainly true, but I don't think it matters.

 - You want BPF programs to affect everything on the system.  But this
doesn't seem right to be -- they only affect things in the relevant
cgroup, so they're not global in that sense.

 - You want BPF programs to affect everything in their cgroup
regardless of namespace.  This does seem to be what you think, but it
doesn't say *why*, which is the relevant bit.

 - The set of BPF program types and the verification rules are
independent of cgroup and namespace.  This is true, but I don't think
it matters.

That's all I came up with.  So, I'll repeat: what does "bpf programs
are global" mean?

>
>> I think that this patch plus a minor change to prevent installing
>> cgroup+bpf programs if the installer isn't in the init netns + fs ns
>> would work because it would allow new, migratable semantics to be
>> added down the road to relax the restriction.
>
> Forcing installer to be in init netns is not acceptable to David
> who added cgroup_sock in the first place. I'm not sure why
> we have to discuss that bit in circles.
>

Because we're one week or so from 4.10 final, the 4.10-rc code is
problematic even for ip vrf, and there isn't a clear solution yet.
There are a bunch of requirements that seem to conflict, and something
has to give.

--Andy


Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-02-04 Thread Andy Lutomirski
On Sat, Feb 4, 2017 at 7:10 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Sat, Feb 04, 2017 at 09:07:19AM -0800, Andy Lutomirski wrote:
>> >> can see a namespaced view of the world.  For this to work, presumably
>> >> we need to make sure that eBPF programs that are installed by programs
>> >> that are in a container don't see traffic that isn't in that
>> >> container.
>> >
>> > such approach will break existing users.
>>
>> What existing users?  The whole feature has never been in a released kernel.
>
> the v3 patch commit log explains the use cases that work across netns.
> And they would break if netns boundary is suddenly started to be
> enforced for networking program types.

This is why these issues should be addressed carefully before this
feature is stabilized in 4.10, and I'm increasingly unconvinced that
this will happen.  There may be a single week left, and the cgroup
override issue is still entirely unaddressed, for example.

Also, the v3 commit log's example seems a bit weak, since seccomp can
handle that case (block non-init-netns raw sockets) and it's not clear
to me that a filter like that serves a purpose regardless besides
kernel attack surface reduction.

> If you're suggesting to limit root netns for cgroup_sock program type only
> then David explained weeks ago why it's not acceptable for vrf use
> case which was the main reason to add that program type in the first place.
> Also it would make one specific bpf program type to be different
> from all others which will be confusing.
>

I'm suggesting limiting it to the init netns for *all* cases.
(Specifically, limiting installing programs to the init netns.  But
perhaps limiting running of programs to the netns in which they're
installed would be the best solution after all.)  If there isn't a
well-understood solution that plays well with netns and that satisfies
ip vrf, the it's worth seriously thinking about whether ip vrf is
really ready for 4.10.

--Andy


Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-02-04 Thread Andy Lutomirski
On Sat, Feb 4, 2017 at 7:18 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Sat, Feb 04, 2017 at 09:08:38AM -0800, Andy Lutomirski wrote:
>> > So use-case would be that someone wants to attach the very same
>> > prog via tc to various netdevs sitting in different netns, and
>> > that prog looks up a map, controlled by initns, with skb->netns_inum
>> > as key and the resulting value could contain allowed feature bits
>> > for that specific netns prog the skbs goes through? That would be
>> > a feature, not "concern", no? At the same time, it's up to the
>> > user or mgmt app what gets loaded so f.e. it might just as well
>> > tailor/optimize the progs individually for the devs sitting in
>> > netns-es to avoid such map lookup.
>>
>> Agreed.  I don't see why you would install the exact same program on
>> two sockets in different netnses if the program contains, say, an
>> ifindex.  Why not just install a variant with the right ifindex into
>> each socket?
>
> In other cases people prefer to have one program compiled once
> and thoroughly tested offline, since some folks are worried
> that on-the-fly compilation may cause generated code to
> be rejected by the verifier.

I would be rather surprised if someone wrote a program that did had an
expression like (ifindex == 17), tested it offline, and refused to
update the 17 based on the actual ifindex in use.

--Andy


Re: [PATCH v2 net] bpf: add bpf_sk_netns_id() helper

2017-02-04 Thread Andy Lutomirski
On Fri, Feb 3, 2017 at 5:22 PM, Alexei Starovoitov  wrote:
> Note that all bpf programs types are global.

I don't think this has a clear enough meaning to work with.  In
particular, I think that, if you have some software that installs
cgroup+bpf programs and you run it in a container, then I have no idea
what "global" means in this context, and you may run into trouble with
this patch once namespace ids become migratable because the cgroup+bpf
program in the container would potentially see dev+ino numbers from
*outside* the container.  What happens when you migrate it?

I think that this patch plus a minor change to prevent installing
cgroup+bpf programs if the installer isn't in the init netns + fs ns
would work because it would allow new, migratable semantics to be
added down the road to relax the restriction.

Eric, what do you think?


Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-02-04 Thread Andy Lutomirski
On Fri, Feb 3, 2017 at 3:21 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Fri, Feb 03, 2017 at 01:07:39PM -0800, Andy Lutomirski wrote:
>>
>> Is there any plan to address this?  If not, I'll try to write that
>> patch this weekend.
>
> yes. I'm working on 'disallow program override' flag.
> It got stalled, because netns discussion got stalled.
> Later today will send a patch for dev_id+inode and
> will continue on the flag patch.
>

Would it make sense to try to document what your proposal does before
writing the code?  I don't yet see how to get semantics that are both
simple and sensible with a "disallow override" flag.

I *do* see how to get simple, sensible semantics with an approach
where all the programs in scope for the cgroup in question get called.
If needed, I can imagine a special "overridable" program that would
not be run if the socket in question is bound to a descendent cgroup
that also has an "overridable" program but would still let all the
normal hierarchical programs in scope get called.


Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-02-04 Thread Andy Lutomirski
On Fri, Feb 3, 2017 at 3:42 PM, Daniel Borkmann  wrote:
> On 02/04/2017 12:06 AM, Alexei Starovoitov wrote:
>>
>> On Fri, Feb 03, 2017 at 10:56:43PM +0100, Daniel Borkmann wrote:
>>>
>>> On 01/26/2017 04:27 AM, Alexei Starovoitov wrote:

 in cases where bpf programs are looking at sockets and packets
 that belong to different netns, it could be useful to read netns inode,
 so that programs can make intelligent decisions.
 For example to disallow raw sockets in all non-init netns the program
 can do:
 if (sk->type == SOCK_RAW && sk->netns_inum != 0xf075)
return 0;
 where 0xf075 inode comes from /proc/pid/ns/net

 Similarly TC cls_bpf/act_bpf and socket filters can do
 if (skb->netns_inum == expected_inode)

 The lack of netns awareness was a concern even for socket filters,
 since the application can attach the same bpf program to sockets
 in a different netns. Just like tc cls_bpf program can work in
 different netns as well, so it has to be addressed uniformly
 across all types of bpf programs.
>>>
>>>
>>> Sorry for jumping in late, but my question is, isn't this helper
>>> really only relevant for BPF_PROG_TYPE_CGROUP_* typed programs?
>>> Thus other prog types making use of bpf_convert_ctx_access()
>>> should probably reject that in .is_valid_access() callback?
>>>
>>> Reason why I'm asking is that for sockets or tc progs, you
>>> already have a netns context where you're attached to, and f.e.
>>> skbs leaving that netns context will be orphaned. Thus, why
>>> would tc or sock filter tailor a program with such a check,
>>> if it can only match/mismatch its own netns inum eventually?
>>
>>
>> Please see the example I provided earlier.
>
>
> That example for both socket filter and tc progs specifically
> wasn't quite clear to me, hence my question wrt why it's right
> now a "concern" for these ones. (Again, clear to me for cgroups
> progs.)
>
>> We can have the same cls_bpf attached to all netns-es.
>> Same for socket filters and everything else.
>
>
> So use-case would be that someone wants to attach the very same
> prog via tc to various netdevs sitting in different netns, and
> that prog looks up a map, controlled by initns, with skb->netns_inum
> as key and the resulting value could contain allowed feature bits
> for that specific netns prog the skbs goes through? That would be
> a feature, not "concern", no? At the same time, it's up to the
> user or mgmt app what gets loaded so f.e. it might just as well
> tailor/optimize the progs individually for the devs sitting in
> netns-es to avoid such map lookup.

Agreed.  I don't see why you would install the exact same program on
two sockets in different netnses if the program contains, say, an
ifindex.  Why not just install a variant with the right ifindex into
each socket?


Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-02-04 Thread Andy Lutomirski
On Fri, Feb 3, 2017 at 3:08 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Fri, Feb 03, 2017 at 01:00:47PM -0800, Andy Lutomirski wrote:
>>
>> ISTM any ability to migrate namespaces and to migrate eBPF programs
>> that know about namespaces needs to have the eBPF program firmly
>> rooted in some namespace (or perhaps cgroup in this case) so that it
>
> programs are already global. We cannot break that.

I don't know what you mean here.  It ought to be possible to have a
(privileged) program that installs and uses cgroup+bpf programs run
under CRIU and get migrated.  Maybe not yet, but some day.  This
should be doable without the program noticing, and it would be
unfortunate if the API makes this harder than necessary.

>
>> can see a namespaced view of the world.  For this to work, presumably
>> we need to make sure that eBPF programs that are installed by programs
>> that are in a container don't see traffic that isn't in that
>> container.
>
> such approach will break existing users.

What existing users?  The whole feature has never been in a released kernel.


Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-02-03 Thread Andy Lutomirski
On Mon, Jan 23, 2017 at 12:20 PM, Andy Lutomirski <l...@amacapital.net> wrote:
> On Sun, Jan 22, 2017 at 8:31 PM, Alexei Starovoitov
> <alexei.starovoi...@gmail.com> wrote:
>> On Thu, Jan 19, 2017 at 08:04:59PM -0800, Andy Lutomirski wrote:
>>> restricting the types of sockets that can be created, then you do want
>>> the filter to work across namespaces, but seccomp can do that too and
>>> the current code doesn't handle netns correctly.
>>
>> are you saying that seccomp supports netns filtering? please show the proof.
>
> It can trivially restruct the types of sockets that are created by
> filtering on socket(2) syscall parameters, at least on sane
> architectures that don't use socketcall().

I think this is actually wrong -- the socket creation filter appears
to be called only on inet sockets.  Is there a good reason for that?

>
>> To summarize, I think the 'prog override is not allowed' flag would be
>> ok feature to have and I wouldn't mind making it the default when no 'flag'
>> field is passed to bpf syscall, but it's not acceptable to remove current
>> 'prog override is allowed' behavior.
>> So if you insist on changing the default, please implement the flag asap.
>> Though from security point of view and ABI point of view there is absolutely
>> no difference whether this flag is added today or a year later while
>> the default is kept as-is.
>
> It's too late and I have too little time.  I'll try to write a patch
> to change the default to just deny attempts to override.  Better
> behavior can be added later.
>
> IMO your suggestions about priorities are overcomplicated.  For your
> instrumentation needs, I can imagine that a simple "make this hook not
> run if a descendent has a hook" would do it.  For everything else, run
> them all in tree order (depending on filter type) and let the eBPF
> code do whatever it wants to do.

Is there any plan to address this?  If not, I'll try to write that
patch this weekend.


Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-02-03 Thread Andy Lutomirski
On Thu, Feb 2, 2017 at 8:33 PM, Eric W. Biederman <ebied...@xmission.com> wrote:
> Alexei Starovoitov <a...@fb.com> writes:
>
>> On 1/26/17 11:07 AM, Andy Lutomirski wrote:
>>> On Thu, Jan 26, 2017 at 10:32 AM, Alexei Starovoitov <a...@fb.com> wrote:
>>>> On 1/26/17 10:12 AM, Andy Lutomirski wrote:
>>>>>
>>>>> On Thu, Jan 26, 2017 at 9:46 AM, Alexei Starovoitov <a...@fb.com> wrote:
>>>>>>
>>>>>> On 1/26/17 8:37 AM, Andy Lutomirski wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Think of bpf programs as safe kernel modules. They don't have
>>>>>>>> confined boundaries and program authors, if not careful, can shoot
>>>>>>>> themselves in the foot. We're not trying to prevent that because
>>>>>>>> it's impossible to check that the program is sane. Just like
>>>>>>>> it's impossible to check that kernel module is sane.
>>>>>>>> But in case of bpf we check that bpf program is _safe_ from the kernel
>>>>>>>> point of view. If it's doing some garbage, it's program's business.
>>>>>>>> Does it make more sense now?
>>>>>>>>
>>>>>>>
>>>>>>> With all due respect, I think this is not an acceptable way to think
>>>>>>> about BPF at all.  If you think of BPF this way, I think there needs
>>>>>>> to be a real discussion at KS or similar as to whether this is okay.
>>>>>>> The reason is simple: the kernel promises a stable ABI to userspace
>>>>>>> but not to kernel modules.  By thinking of BPF as more like a module,
>>>>>>> you're taking a big shortcut that will either result in ABI breakage
>>>>>>> down the road or in committing to a problematic stable ABI.
>>>>>>
>>>>>>
>>>>>>
>>>>>> you misunderstood the analogy.
>>>>>> bpf abi is certainly stable. that's why we were careful of not
>>>>>> exposing anything to it that is not already stable.
>>>>>>
>>>>>
>>>>> In that case I don't understand what you're trying to say.  Eric
>>>>> thinks your patch exposes a bad interface.  A bad interface for
>>>>> userspace is a very different thing from a bad interface available to
>>>>> kernel modules.  Are you saying that BPF is kernel-module-like in that
>>>>> the ABI exposed to BPF programs doesn't need to meet the same quality
>>>>> standards as userspace ABIs?
>>>>
>>>>
>>>> of course not.
>>>> ns.inum is already exposed to user space as a value.
>>>> This patch exposes it to bpf program in a convenient and stable way,
>>>
>>> Here's what I'm imaging Eric is thinking:
>>>
>>> ns.inum is currently exposed to userspace via procfs.  In principle,
>>> the value could be local to a namespace, though, which would enable
>>> CRIU to be able to preserve namespace inode numbers across a
>>> checkpoint+restore operation.  If this happened, the contained and
>>> restored procfs would see a different inode number than the outermost
>>> procfs.
>>
>> sure. there are many different ways for the program to see inode
>> that either was already reused or disappeared.
>> What I'm saying that it is expected. We cannot prevent that from
>> bpf side. Just like ifindex value read by the program can be bogus
>> as in the example I just provided.
>
> The point is that we can make the inode number stable across migration
> and the user space API for namespaces has been designed with that
> possibility in mind.

How does it help if BPF starts exposing both inode number and device number?

ISTM any ability to migrate namespaces and to migrate eBPF programs
that know about namespaces needs to have the eBPF program firmly
rooted in some namespace (or perhaps cgroup in this case) so that it
can see a namespaced view of the world.  For this to work, presumably
we need to make sure that eBPF programs that are installed by programs
that are in a container don't see traffic that isn't in that
container.  This is part of why I think that we should consider
preventing programs that aren't in the root namespace (perhaps *all*
the root namespaces) from installing bpf+cgroup programs in the first
place until there's a clearer understanding of how this all fits
together.


Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-01-26 Thread Andy Lutomirski
On Thu, Jan 26, 2017 at 10:32 AM, Alexei Starovoitov <a...@fb.com> wrote:
> On 1/26/17 10:12 AM, Andy Lutomirski wrote:
>>
>> On Thu, Jan 26, 2017 at 9:46 AM, Alexei Starovoitov <a...@fb.com> wrote:
>>>
>>> On 1/26/17 8:37 AM, Andy Lutomirski wrote:
>>>>>
>>>>>
>>>>> Think of bpf programs as safe kernel modules. They don't have
>>>>> confined boundaries and program authors, if not careful, can shoot
>>>>> themselves in the foot. We're not trying to prevent that because
>>>>> it's impossible to check that the program is sane. Just like
>>>>> it's impossible to check that kernel module is sane.
>>>>> But in case of bpf we check that bpf program is _safe_ from the kernel
>>>>> point of view. If it's doing some garbage, it's program's business.
>>>>> Does it make more sense now?
>>>>>
>>>>
>>>> With all due respect, I think this is not an acceptable way to think
>>>> about BPF at all.  If you think of BPF this way, I think there needs
>>>> to be a real discussion at KS or similar as to whether this is okay.
>>>> The reason is simple: the kernel promises a stable ABI to userspace
>>>> but not to kernel modules.  By thinking of BPF as more like a module,
>>>> you're taking a big shortcut that will either result in ABI breakage
>>>> down the road or in committing to a problematic stable ABI.
>>>
>>>
>>>
>>> you misunderstood the analogy.
>>> bpf abi is certainly stable. that's why we were careful of not
>>> exposing anything to it that is not already stable.
>>>
>>
>> In that case I don't understand what you're trying to say.  Eric
>> thinks your patch exposes a bad interface.  A bad interface for
>> userspace is a very different thing from a bad interface available to
>> kernel modules.  Are you saying that BPF is kernel-module-like in that
>> the ABI exposed to BPF programs doesn't need to meet the same quality
>> standards as userspace ABIs?
>
>
> of course not.
> ns.inum is already exposed to user space as a value.
> This patch exposes it to bpf program in a convenient and stable way,

Here's what I'm imaging Eric is thinking:

ns.inum is currently exposed to userspace via procfs.  In principle,
the value could be local to a namespace, though, which would enable
CRIU to be able to preserve namespace inode numbers across a
checkpoint+restore operation.  If this happened, the contained and
restored procfs would see a different inode number than the outermost
procfs.

If you start exposing the raw ns.inum field to BPF programs and those
programs are not themselves scoped to a namespace, then this could
create a problem for CRIU.

But you told Eric that his nack doesn't matter, and maybe it would be
nice to ask him to clarify instead.


Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-01-26 Thread Andy Lutomirski
On Thu, Jan 26, 2017 at 9:46 AM, Alexei Starovoitov <a...@fb.com> wrote:
> On 1/26/17 8:37 AM, Andy Lutomirski wrote:
>>>
>>> Think of bpf programs as safe kernel modules. They don't have
>>> confined boundaries and program authors, if not careful, can shoot
>>> themselves in the foot. We're not trying to prevent that because
>>> it's impossible to check that the program is sane. Just like
>>> it's impossible to check that kernel module is sane.
>>> But in case of bpf we check that bpf program is _safe_ from the kernel
>>> point of view. If it's doing some garbage, it's program's business.
>>> Does it make more sense now?
>>>
>>
>> With all due respect, I think this is not an acceptable way to think
>> about BPF at all.  If you think of BPF this way, I think there needs
>> to be a real discussion at KS or similar as to whether this is okay.
>> The reason is simple: the kernel promises a stable ABI to userspace
>> but not to kernel modules.  By thinking of BPF as more like a module,
>> you're taking a big shortcut that will either result in ABI breakage
>> down the road or in committing to a problematic stable ABI.
>
>
> you misunderstood the analogy.
> bpf abi is certainly stable. that's why we were careful of not
> exposing anything to it that is not already stable.
>

In that case I don't understand what you're trying to say.  Eric
thinks your patch exposes a bad interface.  A bad interface for
userspace is a very different thing from a bad interface available to
kernel modules.  Are you saying that BPF is kernel-module-like in that
the ABI exposed to BPF programs doesn't need to meet the same quality
standards as userspace ABIs?

--Andy


Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-01-26 Thread Andy Lutomirski
Hi Linus-

Can you weigh in here before we get stuck in a potentially unfortunate place?

On Wed, Jan 25, 2017 at 10:23 PM, Alexei Starovoitov  wrote:
> On 1/25/17 9:46 PM, Eric W. Biederman wrote:
>>
>> Alexei Starovoitov  writes:
>>

[...]

>>> Similarly TC cls_bpf/act_bpf and socket filters can do
>>> if (skb->netns_inum == expected_inode)
>>
>>
>> Nacked-by: "Eric W. Biederman" 
>
>
> I very much value your opinion, but your ack or nack doesn't apply here.

I'm confused.  Eric, the namespace maintainer, says nack.  This is
namespace code regardless of what file it lives in.

>> The only sane thing is to interpret whatever your bpf program in the
>> context of the program which installs it.
>
>
> that's impossible. The programs are operating in the context that
> is disjoined from the app that installs it.
>
>> If you can't do that you have a very broken piece of userspace
>> interface.  Which appears to be the case here.
>
>
> Call it rubbish, but this is how it is.
> cls_bpf is operating on packets. xdp_bpf is operating on raw dma buffers
> and there we might need eventually lookup sockets and net namespaces.
> Think of bpf programs as safe kernel modules. They don't have
> confined boundaries and program authors, if not careful, can shoot
> themselves in the foot. We're not trying to prevent that because
> it's impossible to check that the program is sane. Just like
> it's impossible to check that kernel module is sane.
> But in case of bpf we check that bpf program is _safe_ from the kernel
> point of view. If it's doing some garbage, it's program's business.
> Does it make more sense now?
>

With all due respect, I think this is not an acceptable way to think
about BPF at all.  If you think of BPF this way, I think there needs
to be a real discussion at KS or similar as to whether this is okay.
The reason is simple: the kernel promises a stable ABI to userspace
but not to kernel modules.  By thinking of BPF as more like a module,
you're taking a big shortcut that will either result in ABI breakage
down the road or in committing to a problematic stable ABI.

In fact, this was discussed a bit at KS in the context of tracepoints.
The general consensus seemed to be that tracepoints should not be
considered fully stable in large part because they're a debugging
feature.  But these BPF hooks are very much not a debugging feature.

Concretely, iproute2 git contains an eBPF program.  If that program
were just a "safe kernel module", then it would be totally fine
(expected, even) for future kernel versions to break it outright.  I
don't think this is how it works.

--Andy


Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-25 Thread Andy Lutomirski
On Tue, Jan 24, 2017 at 4:11 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Tue, Jan 24, 2017 at 01:24:54PM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 24, 2017 at 12:29 PM, David Ahern <d...@cumulusnetworks.com> 
>> wrote:
>> >
>> > Users do not run around exec'ing commands in random network contexts 
>> > (namespace, vrf, device, whatever) and expect them to just work.
>>

>> setns() is a bit more complicated, but it should still be fine.
>> netns_install() requires CAP_NET_ADMIN over the target netns, so you
>> can only switch in to a netns if you already have privilege in that
>> netns.
>
> it's not fine. Consider our main use case which is cgroup-scoped traffic
> monitoring and enforcement for well behaving apps.

[...]

That does indeed sound like a sensible use case.  Thanks!

>
>> But perhaps they should be less disjoint.  As far as I know,
>> cgroup+bpf is the *only* network configuration that is being set up to
>> run across all network namespaces.  No one has said why this is okay,
>> let alone why it's preferable to making it work per-netns just like
>> basically all other Linux network configuration.
>
> Single cls_bpf program attached to all netdevs in tc egress
> hook can call bpf_skb_under_cgroup() to check whether given skb
> is under given cgroup and afterwards decide to drop/redirect.
> In terms of 'run across all netns' it's exactly the same.
> Our monitoring use case came out of it.
> Unfortunately bpf_skb_under_cgroup() in cls_bpf is not scalable.
> It works fine for quick on the box monitoring where user logs in
> to the box and can see what particular job is doing,
> but it's not usable when there are thousands of cgroups and
> we need to monitor them all.
>
>> I was hoping for an actual likely use case for the bpf hooks to be run
>> in all namespaces.  You're arguing that iproute2 can be made to work
>> mostly okay if bpf hooks can run in all namespaces, but the use case
>> of intentionally making sk_bound_dev_if invalid across all namespaces
>> seems dubious.
>
> I think what Tejun is proposing regarding adding global_ifindex
> is a great idea and it will make ifindex interaction cleaner not only
> for cgroup+bpf, but for socket and cls_bpf programs too.
> I think it would be ok to disallow unprivileged programs (whenever
> they come) to use of bpf_sock->bound_dev_if and only
> allow bpf_sock->global_bound_dev_if and that should solve
> your security concern for future unpriv programs.
>
> I think bpf_get_sock_netns_id() helper or sk->netns_id field would
> be good addition as well, since it will allow 'ip vrf' to be smarter today.
> It's also more flexible, since bpf_type_cgroup_sock program can make
> its own decision when netns_id != expecte_id instead of hard coded.
> Today the ip vrf use case does: 'sk->bound_dev_if = idx; return OK;'
> it will be able to:
>   if (sk->netns_id != expected_id)
> return DROP;
>   sk->bound_dev_if = idx;
>   return OK;
> or
>   if (sk->netns_id != expected_id)
> return OK;
>   sk->bound_dev_if = idx;
>   return OK;
> or it will be able to bpf_trace_printk() or bpf_perf_event_output()
> to send notification to vrf user space daemon and so on.
> Such checks will run at the same speed as checks inside
> __cgroup_bpf_run_filter_sk(), but all other users won't pay
> for them when not in use. imo it's a win-win.

Eric, does this sound okay to you?  You're the authority on exposing
things like namespace ids to users.

>
> In parallel I'm planning to work on 'disallow override' flag
> for bpf_prog_attach. That should solve remaining concern.
> I still disagree about urgency of implementing it, but
> it's simple enough, so why not now.
>

Can you describe the exact semantics?  I really think there should be
room for adding a mode where all the relevant programs are invoked and
that this should eventually be the default.  It's the obvious behavior
and it has many fewer footguns.

I would propose the following:

 - By default, socket creation and egress filters call all registered
programs, innermost cgroup first.  If any of them return a reject
code, stop processing further filters and reject.  By default, ingress
filters call all registered programs, innermost cgroup first.  If any
of them return a reject code, stop processing further filters and
reject.  This is indended to Do The Right Thing (TM) for both
monitoring systems and for sandboxing.  For monitoring, if you stick
your hooks in the /a cgroup, you presumably want to see all traffic
that actually goes in and out.  For ingress, you want to see what's
really there before it gets further modified by hooks set up by the
monitored application (whic

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-24 Thread Andy Lutomirski
On Tue, Jan 24, 2017 at 12:29 PM, David Ahern  wrote:
>
> Users do not run around exec'ing commands in random network contexts 
> (namespace, vrf, device, whatever) and expect them to just work.

I worked on some code once (deployed in production, even!) that calls
unshare() to make a netns and creates an interface and runs code in
there and expects it to just work.  It wouldn't work the outer program
were run under current ip vrf.

>
>>
>> Maybe you can argue that this is a missing feature in cgroup+bpf (no
>> API to query which netns is in use) and a corresponding bug in 'ip
>> vrf', but I see this as evidence that cgroup+bpf as it exists in 4.10
>> is not carefully enough throught through.  The only non-example user
>> of it that I can find (ip vrf) is buggy and can't really be fixed
>> using mechanisms that exist in 4.10-rc.
>
> The argument is that cgroups and namespaces are completely disjoint 
> infrastructure and that users need to know what they are doing.

But perhaps they should be less disjoint.  As far as I know,
cgroup+bpf is the *only* network configuration that is being set up to
run across all network namespaces.  No one has said why this is okay,
let alone why it's preferable to making it work per-netns just like
basically all other Linux network configuration.

>
>>
>>>
 things up so that unshare will malfunction.  It should avoid
 malfunctioning when running Linux programs that are unaware of it.
>>>
>>> I agree that for VRF use case it will help to make programs netns
>>> aware by adding new bpf_get_current_netns_id() or something helper,
>>> but it's up to the program to function properly or be broken.
>>
>> This will cause David's code to run slower, and I think he wants very
>> high performance.
>
> This is a socket create path not a packet path. While overhead should be 
> contained, a few extra cycles should be fine.
>
> Adding the capability to allow users to check the netns id would offer a 
> solution to the namespace problem, but there is nothing that *requires* a bpf 
> program to do it.
>
> Who's to say an admin does not *want* all processes in a group to have 
> sockets bound to a non-existent device? 'ip vrf' restricts the device index 
> to a VRF device because as a management tool I want it to be user friendly, 
> but generically the BPF code does not have any restrictions. ifindex can be 
> any u32 value.

I was hoping for an actual likely use case for the bpf hooks to be run
in all namespaces.  You're arguing that iproute2 can be made to work
mostly okay if bpf hooks can run in all namespaces, but the use case
of intentionally making sk_bound_dev_if invalid across all namespaces
seems dubious.

But all of what you're suggesting doing would still work fine and
would result in less kernel code *and* less eBPF code if the hooks
were per netns.

--Andy


Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-24 Thread Andy Lutomirski
On Tue, Jan 24, 2017 at 9:48 AM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Mon, Jan 23, 2017 at 08:32:02PM -0800, Andy Lutomirski wrote:
>> On Mon, Jan 23, 2017 at 8:05 PM, David Ahern <d...@cumulusnetworks.com> 
>> wrote:
>> > On 1/23/17 8:37 PM, Andy Lutomirski wrote:
>> >> Yes, it is a bug because cgroup+bpf causes unwitting programs to be
>> >> subject to BPF code installed by a different, potentially unrelated
>> >> process.  That's a new situation.  The failure can happen when a
>> >> privileged supervisor (whoever runs ip vrf) runs a clueless or
>> >> unprivileged program (the thing calling unshare()).
>> >
>> > There are many, many ways to misconfigure networking and to run programs 
>> > in a context or with an input argument that causes the program to not work 
>> > at all, not work as expected or stop working. This situation is no 
>> > different.
>> >
>> > For example, the only aspect of BPF_PROG_TYPE_CGROUP_SOCK filters that are 
>> > namespace based is the ifindex. You brought up the example of changing 
>> > namespaces where the ifindex is not defined. Alexei mentioned an example 
>> > where interfaces can be moved to another namespace breaking any ifindex 
>> > based programs. Another example is the interface can be deleted. Deleting 
>> > an interface with sockets bound to it does not impact the program in any 
>> > way - no notification, no wakeup, nothing. The sockets just don't work.
>>
>> And if you use 'ip vrf' to bind to a vrf with ifindex 4 and a program
>> unshares netns and creates an interface with ifindex 4, then that
>> program will end up with its sockets magically bound to ifindex 4 and
>> will silently malfunction.
>>
>> I can think of multiple ways to address this problem.  You could scope
>> the hooks to a netns (which is sort of what my patch does).  You could
>> find a way to force programs in a given cgroup to only execute in a
>> single netns, although that would probably cause other breakage.  You
>> could improve the BPF hook API to be netns-aware, which could plausbly
>> address issues related to unshare() but might get very tricky when
>> setns() is involved.
>
> scoping cgroup to netns will create weird combination of cgroup and netns
> which was never done before. cgroup and netns scopes have to be able
> to overlap in arbitrary way. Application shouldn't not be able to escape
> cgroup scope by changing netns.

I had assumed that too, but I'm not longer at all convinced that this
is a problem.  It's certainly the case that, if you put an application
into a restrictive cgroup, it shouldn't be able to bypass those
restrictions using unshare() or setns().  But I don't think this is
really possible regardless.  The easy way to try to escape is using
unshare(), but this doesn't actually buy you anything.

$ unshare -Urn ip link
1: lo:  mtu 65536 qdisc noop state DOWN mode DEFAULT group
default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Maybe I just escaped the cgroup restrictions on ingress and egress,
but that doesn't seem to matter because I also no longer have any
interfaces that have any traffic.  And, if I created a socket before
unsharing, it's still safe because that socket is bound to the old
netns and would still be subject to the cgroup rules with my patch
applied.

setns() is a bit more complicated, but it should still be fine.
netns_install() requires CAP_NET_ADMIN over the target netns, so you
can only switch in to a netns if you already have privilege in that
netns.

>> My point is that, in 4.10-rc, it doesn't work right, and I doubt this
>> problem is restricted to just 'ip vrf'.  Without some kind of change
>> to the way that netns and cgroup+bpf interact, anything that uses
>> sk_bound_dev_if or reads the ifindex on an skb will be subject to a
>> huge footgun that unprivileged programs can trigger and any future
>> attempt to make the cgroup+bpf work for unprivileged users is going to
>> be more complicated than it deserves to be.
>
> For n-th time, the current BPF_PROG_TYPE_CGROUP* is root only and
> speculation about unprivileged usage are not helping the discussion.

...which has nothing to do with my example of how it's broken.  I used
'ip vrf' the way it was intended to be used and then I ran code in it
that uses only APIs that predate eBPF.  The result was that the
program obtained a broken socket that had sk_bound_dev_if filled in
with a nonsense index.  There's no speculation -- it's just broken.

Maybe you can argue that this is a missing feature in cgroup+bpf (no
API to query which netns is in use) and a corresponding bug in 'ip

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread Andy Lutomirski
On Mon, Jan 23, 2017 at 8:05 PM, David Ahern <d...@cumulusnetworks.com> wrote:
> On 1/23/17 8:37 PM, Andy Lutomirski wrote:
>> Yes, it is a bug because cgroup+bpf causes unwitting programs to be
>> subject to BPF code installed by a different, potentially unrelated
>> process.  That's a new situation.  The failure can happen when a
>> privileged supervisor (whoever runs ip vrf) runs a clueless or
>> unprivileged program (the thing calling unshare()).
>
> There are many, many ways to misconfigure networking and to run programs in a 
> context or with an input argument that causes the program to not work at all, 
> not work as expected or stop working. This situation is no different.
>
> For example, the only aspect of BPF_PROG_TYPE_CGROUP_SOCK filters that are 
> namespace based is the ifindex. You brought up the example of changing 
> namespaces where the ifindex is not defined. Alexei mentioned an example 
> where interfaces can be moved to another namespace breaking any ifindex based 
> programs. Another example is the interface can be deleted. Deleting an 
> interface with sockets bound to it does not impact the program in any way - 
> no notification, no wakeup, nothing. The sockets just don't work.

And if you use 'ip vrf' to bind to a vrf with ifindex 4 and a program
unshares netns and creates an interface with ifindex 4, then that
program will end up with its sockets magically bound to ifindex 4 and
will silently malfunction.

I can think of multiple ways to address this problem.  You could scope
the hooks to a netns (which is sort of what my patch does).  You could
find a way to force programs in a given cgroup to only execute in a
single netns, although that would probably cause other breakage.  You
could improve the BPF hook API to be netns-aware, which could plausbly
address issues related to unshare() but might get very tricky when
setns() is involved.

My point is that, in 4.10-rc, it doesn't work right, and I doubt this
problem is restricted to just 'ip vrf'.  Without some kind of change
to the way that netns and cgroup+bpf interact, anything that uses
sk_bound_dev_if or reads the ifindex on an skb will be subject to a
huge footgun that unprivileged programs can trigger and any future
attempt to make the cgroup+bpf work for unprivileged users is going to
be more complicated than it deserves to be.

(Aside: I wonder if a similar goal to 'ip vrf' could be achieved by
moving the vrf to a different network namespace and putting programs
that you want to be subject to the VRF into that namespace.)

>
> The point of using a management tool like ip is to handle the details to make 
> things sane -- which is the consistent with the whole point of ip netns vs 
> running unshare -n.

I still don't understand how you're going to make 'ip netns' make this
problem go away.  Keep in mind that there are real programs that call
the unshare() syscall directly.

>
>>
>> If the way cgroup+bpf worked was that a program did a prctl() or
>> similar to opt in to being managed by a provided cgroup-aware BPF
>> program, then I'd agree with everything you're saying.  But that's not
>> at all how this code works.
>
> I don't want an opt-in approach. The program does not need to know or even 
> care that it has some restriction. Today, someone runs 'ip netns exec NAME 
> COMMAND' the command does not need to know or care that the network namespace 
> was changed on it. Same with 'ip vrf exec' -- it is a network policy that is 
> forced on the program by the user.

I absolutely agree.  My point is that cgroup+bpf is *not* opt-in, so
the bar should be higher.  You can't blame the evil program that
called unshare() for breaking things when 'ip vrf' unavoidably sets
things up so that unshare will malfunction.  It should avoid
malfunctioning when running Linux programs that are unaware of it.
This should include programs like unshare and 'ip netns'.


Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread Andy Lutomirski
On Mon, Jan 23, 2017 at 8:10 PM, David Ahern <d...@cumulusnetworks.com> wrote:
> On 1/23/17 7:39 PM, Andy Lutomirski wrote:
>> I'm not sure I followed what you meant.  If I understood right (which
>> is a big if) you're saying that ip vrf when run in a netns works
>> correctly in that netns.  I agree, but I think that it would continue
>> to work (even more reliably) if the hooks only executed in the netns
>> in which they were created.  I think that would probably be a good
>> improvement, and it could be done on top of my patch, but I'm not
>> about to write that code for 4.10.
>
> Sounds like an efficiency on the implementation that does not require 
> limiting the code today to just init namespace.
>

It's an ABI change, not an implementation change.  If someone wants to
make the code fully netns-aware in time for 4.10, that sounds great.
I'm not going to do that.


Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread Andy Lutomirski
On Mon, Jan 23, 2017 at 7:13 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Mon, Jan 23, 2017 at 06:42:27PM -0800, Andy Lutomirski wrote:
>> Please explain how the change results in a broken ABI and how the
>> current ABI is better.  I gave a fully worked out example of how the
>> current ABI misbehaves using only standard Linux networking tools.
>
> I gave an example in the other thread that demonstrated
> cgroup escape by the appliction when it changes netns.
> If application became part of cgroup, it has to stay there,
> no matter setns() calls afterwards.

The other thread is out of control.  Can you restate your example?
Please keep in mind that uprivileged programs can unshare their netns.

>
>> >
>> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> >> index e89acea22ecf..c0bbc55e244d 100644
>> >> --- a/kernel/bpf/syscall.c
>> >> +++ b/kernel/bpf/syscall.c
>> >> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr 
>> >> *attr)
>> >>   struct cgroup *cgrp;
>> >>   enum bpf_prog_type ptype;
>> >>
>> >> + /*
>> >> +  * For now, socket bpf hooks attached to cgroups can only be
>> >> +  * installed in the init netns and only affect the init netns.
>> >> +  * This could be relaxed in the future once some semantic issues
>> >> +  * are resolved.  For example, ifindexes belonging to one netns
>> >> +  * should probably not be visible to hooks installed by programs
>> >> +  * running in a different netns.
>> >
>> > the comment is bogus and shows complete lack of understanding
>> > how bpf is working and how it's being used.
>> > See SKF_AD_IFINDEX that was in classic bpf forever.
>> >
>>
>> I think I disagree completely.
>>
>> With classic BPF, a program creates a socket and binds a bpf program
>> to it.  That program can see the ifindex, which is fine because that
>> ifindex is scoped to the socket's netns, which is (unless the caller
>> uses setns() or unshare()) the same as the caller's netns, so the
>> ifindex means exactly what the caller thinks it means.
>>
>> With cgroup+bpf, the program installing the hook can be completely
>> unrelated to the program whose sockets are subject to the hook, and,
>> if they're using different namespaces, it breaks.
>
> Please also see ingress_ifindex, ifindex, bpf_redirect(), bpf_clone_redirect()
> that all operate on the ifindex while the program can be detached from
> the application. In general bpf program and application that loaded and
> attached it are mostly disjoint in case of networking. We have tail_call
> mechanism and master bpf prog may call programs installed by other apps
> that may have exited already.

If program A acquires a BPF object from program B where program B runs
in a different netns from program A, and program B uses or tail-calls
that BPF object, then A is either doing it intentionally and is
netns-aware or it is terminally buggy and deserves what it gets.

> cls_bpf is scoped by netdev that belongs to some netns.
> If after attaching a program with hardcoded if(ifindex==3) check
> to such netdev, this netdev is moved into different netns, this 'if' check
> and the program become bogus and won't do what program author
> expected. It is expected behavior. The same thing with current 'ip vrf'
> implementation that Dave is adjusting. It's not a bug in cgroup+bpf behavior.
>

Yes, it is a bug because cgroup+bpf causes unwitting programs to be
subject to BPF code installed by a different, potentially unrelated
process.  That's a new situation.  The failure can happen when a
privileged supervisor (whoever runs ip vrf) runs a clueless or
unprivileged program (the thing calling unshare()).

If the way cgroup+bpf worked was that a program did a prctl() or
similar to opt in to being managed by a provided cgroup-aware BPF
program, then I'd agree with everything you're saying.  But that's not
at all how this code works.


Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread Andy Lutomirski
On Mon, Jan 23, 2017 at 6:09 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Mon, Jan 23, 2017 at 12:36:08PM -0800, Andy Lutomirski wrote:
>> To see how cgroup+bpf interacts with network namespaces, I wrote a
>> little program called show_bind that calls getsockopt(...,
>> SO_BINDTODEVICE, ...) and prints the result.  It did this:
>>
>>  # ./ip link add dev vrf0 type vrf table 10
>>  # ./ip vrf exec vrf0 ./show_bind
>>  Default binding is "vrf0"
>>  # ./ip vrf exec vrf0 unshare -n ./show_bind
>>  show_bind: getsockopt: No such device
>>
>> What's happening here is that "ip vrf" looks up vrf0's ifindex in
>> the init netns and installs a hook that binds sockets to that
>> ifindex.  When the hook runs in a different netns, it sets
>> sk_bound_dev_if to an ifindex from the wrong netns, resulting in
>> incorrect behavior.  In this particular example, the ifindex was 4
>> and there was no ifindex 4 in the new netns.  If there had been,
>> this test would have malfunctioned differently
>>
>> Since it's rather late in the release cycle, let's punt.  This patch
>> makes it impossible to install cgroup+bpf hooks outside the init
>> netns and makes them not run on sockets that aren't in the init
>> netns.
>>
>> In a future release, it should be relatively straightforward to make
>> these hooks be local to a netns and, if needed, to add a flag so
>> that hooks can be made global if necessary.  Global hooks should
>> presumably be constrained so that they can't write to any ifindex
>> fields.
>>
>> Cc: Daniel Borkmann <dan...@iogearbox.net>
>> Cc: Alexei Starovoitov <a...@kernel.org>
>> Cc: David Ahern <d...@cumulusnetworks.com>
>> Signed-off-by: Andy Lutomirski <l...@kernel.org>
>> ---
>>
>> DaveM, this mitigates a bug in a feature that's new in 4.10, and the
>> bug can be hit using current iproute2 -git.  please consider this for
>> -net.
>>
>> Changes from v1:
>>  - Fix the commit message.  'git commit' was very clever and thought that
>>all the interesting bits of the test case were intended to be comments
>>and stripped them.  Whoops!
>>
>> kernel/bpf/cgroup.c  | 21 +
>>  kernel/bpf/syscall.c | 11 +++
>>  2 files changed, 32 insertions(+)
>>
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index a515f7b007c6..a824f543de69 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
>>   if (!sk || !sk_fullsock(sk))
>>   return 0;
>>
>> + /*
>> +  * For now, socket bpf hooks attached to cgroups can only be
>> +  * installed in the init netns and only affect the init netns.
>> +  * This could be relaxed in the future once some semantic issues
>> +  * are resolved.  For example, ifindexes belonging to one netns
>> +  * should probably not be visible to hooks installed by programs
>> +  * running in a different netns.
>> +  */
>> + if (sock_net(sk) != _net)
>> + return 0;
>
> Nack.
> Such check will create absolutely broken abi that we won't be able to fix 
> later.

Please explain how the change results in a broken ABI and how the
current ABI is better.  I gave a fully worked out example of how the
current ABI misbehaves using only standard Linux networking tools.

>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index e89acea22ecf..c0bbc55e244d 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>>   struct cgroup *cgrp;
>>   enum bpf_prog_type ptype;
>>
>> + /*
>> +  * For now, socket bpf hooks attached to cgroups can only be
>> +  * installed in the init netns and only affect the init netns.
>> +  * This could be relaxed in the future once some semantic issues
>> +  * are resolved.  For example, ifindexes belonging to one netns
>> +  * should probably not be visible to hooks installed by programs
>> +  * running in a different netns.
>
> the comment is bogus and shows complete lack of understanding
> how bpf is working and how it's being used.
> See SKF_AD_IFINDEX that was in classic bpf forever.
>

I think I disagree completely.

With classic BPF, a program creates a socket and binds a bpf program
to it.  That program can see the ifindex, which is fine because that
ifindex is scoped to the socket's netns, which is (unless the caller
uses setns() or unshare()) the same as the caller's netns, so the
ifindex means exactly what the caller thinks it means.

With cgroup+bpf, the program installing the hook can be completely
unrelated to the program whose sockets are subject to the hook, and,
if they're using different namespaces, it breaks.

--Andy


Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread Andy Lutomirski
On Mon, Jan 23, 2017 at 6:31 PM, David Ahern  wrote:
> On 1/23/17 7:09 PM, Alexei Starovoitov wrote:
>>> + */
>>> +if (current->nsproxy->net_ns != _net)
>>> +return -EINVAL;
>>
>> this restriction I actually don't mind, since it indeed can be
>> relaxed later, but please make it proper with net_eq()
>>
>
> I do mind. Why have different restrictions for the skb and sk filters?
>
> I have already shown that ip can alleviate the management aspects of the 
> implementation -- just like ip netns does.

I'm not sure I followed what you meant.  If I understood right (which
is a big if) you're saying that ip vrf when run in a netns works
correctly in that netns.  I agree, but I think that it would continue
to work (even more reliably) if the hooks only executed in the netns
in which they were created.  I think that would probably be a good
improvement, and it could be done on top of my patch, but I'm not
about to write that code for 4.10.

--Andy


Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread Andy Lutomirski
On Mon, Jan 23, 2017 at 1:03 PM, David Ahern <d...@cumulusnetworks.com> wrote:
> On 1/23/17 1:36 PM, Andy Lutomirski wrote:
>> To see how cgroup+bpf interacts with network namespaces, I wrote a
>> little program called show_bind that calls getsockopt(...,
>> SO_BINDTODEVICE, ...) and prints the result.  It did this:
>>
>>  # ./ip link add dev vrf0 type vrf table 10
>>  # ./ip vrf exec vrf0 ./show_bind
>>  Default binding is "vrf0"
>>  # ./ip vrf exec vrf0 unshare -n ./show_bind
>>  show_bind: getsockopt: No such device
>>
>> What's happening here is that "ip vrf" looks up vrf0's ifindex in
>> the init netns and installs a hook that binds sockets to that
>
> It looks up the device name in the current namespace.
>
>> ifindex.  When the hook runs in a different netns, it sets
>> sk_bound_dev_if to an ifindex from the wrong netns, resulting in
>> incorrect behavior.  In this particular example, the ifindex was 4
>> and there was no ifindex 4 in the new netns.  If there had been,
>> this test would have malfunctioned differently
>
> While the cgroups and network namespace interaction needs improvement, a 
> management tool can workaround the deficiencies:
>
> A shell in the default namespace, mgmt vrf (PS1 tells me the network context):
> dsa@kenny:mgmt:~$
>
> Switch to a different namespace (one that I run VMs for network testing):
> dsa@kenny:mgmt:~$ sudo ip netns exec vms su - dsa
>
> And then bind the shell to vrf2
> dsa@kenny:vms:~$ sudo ip vrf exec vrf2 su - dsa
> dsa@kenny:vms:vrf2:~$
>
> Or I can go straight to vrf2:
> dsa@kenny:mgmt:~$ sudo ip netns exec vms ip vrf exec vrf2 su - dsa
> dsa@kenny:vms:vrf2:~$

Indeed, if you're careful to set up the vrf cgroup in the same netns
that you end up using it in, it'll work.  But there's a bigger footgun
there than I think is warranted, and I'm not sure how iproute2 is
supposed to do all that much better given that the eBPF program can
neither see what namespace a socket is bound to nor can it act in a
way that works correctly in any namespace.

Long-term, I think the real fix is to make the hook work on a
per-netns basis and, if needed, add an interface for a cross-netns
hook to work sensibly.  But I think it's a bit late to do that for
4.10, so instead I'm proposing to limit the API to the case where it
works and the semantics are unambiguous and to leave further
improvements for later.

It's a bit unfortunate that there seems to be an impedance mismatch in
that "ip vrf" acts on cgroups and that cgroups are somewhat orthogonal
to network namespaces.

>
>
> I am testing additional iproute2 cleanups which will be sent before 4.10 is 
> released.
>
> -8<-
>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index e89acea22ecf..c0bbc55e244d 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>>   struct cgroup *cgrp;
>>   enum bpf_prog_type ptype;
>>
>> + /*
>> +  * For now, socket bpf hooks attached to cgroups can only be
>> +  * installed in the init netns and only affect the init netns.
>> +  * This could be relaxed in the future once some semantic issues
>> +  * are resolved.  For example, ifindexes belonging to one netns
>> +  * should probably not be visible to hooks installed by programs
>> +  * running in a different netns.
>> +  */
>> + if (current->nsproxy->net_ns != _net)
>> + return -EINVAL;
>> +
>>   if (!capable(CAP_NET_ADMIN))
>>   return -EPERM;
>>
>
> But should this patch be taken, shouldn't the EPERM out rank the namespace 
> check.
>

I could see that going either way.  If the hook becomes per-netns,
then the capable() check could potentially become ns_capable() and it
would start succeeding.  I'd be happy to change it, though.

--Andy
-- 
Andy Lutomirski
AMA Capital Management, LLC


[PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread Andy Lutomirski
To see how cgroup+bpf interacts with network namespaces, I wrote a
little program called show_bind that calls getsockopt(...,
SO_BINDTODEVICE, ...) and prints the result.  It did this:

 # ./ip link add dev vrf0 type vrf table 10
 # ./ip vrf exec vrf0 ./show_bind
 Default binding is "vrf0"
 # ./ip vrf exec vrf0 unshare -n ./show_bind
 show_bind: getsockopt: No such device

What's happening here is that "ip vrf" looks up vrf0's ifindex in
the init netns and installs a hook that binds sockets to that
ifindex.  When the hook runs in a different netns, it sets
sk_bound_dev_if to an ifindex from the wrong netns, resulting in
incorrect behavior.  In this particular example, the ifindex was 4
and there was no ifindex 4 in the new netns.  If there had been,
this test would have malfunctioned differently

Since it's rather late in the release cycle, let's punt.  This patch
makes it impossible to install cgroup+bpf hooks outside the init
netns and makes them not run on sockets that aren't in the init
netns.

In a future release, it should be relatively straightforward to make
these hooks be local to a netns and, if needed, to add a flag so
that hooks can be made global if necessary.  Global hooks should
presumably be constrained so that they can't write to any ifindex
fields.

Cc: Daniel Borkmann <dan...@iogearbox.net>
Cc: Alexei Starovoitov <a...@kernel.org>
Cc: David Ahern <d...@cumulusnetworks.com>
Signed-off-by: Andy Lutomirski <l...@kernel.org>
---

DaveM, this mitigates a bug in a feature that's new in 4.10, and the
bug can be hit using current iproute2 -git.  please consider this for
-net.

Changes from v1:
 - Fix the commit message.  'git commit' was very clever and thought that
   all the interesting bits of the test case were intended to be comments
   and stripped them.  Whoops!

kernel/bpf/cgroup.c  | 21 +
 kernel/bpf/syscall.c | 11 +++
 2 files changed, 32 insertions(+)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index a515f7b007c6..a824f543de69 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
if (!sk || !sk_fullsock(sk))
return 0;
 
+   /*
+* For now, socket bpf hooks attached to cgroups can only be
+* installed in the init netns and only affect the init netns.
+* This could be relaxed in the future once some semantic issues
+* are resolved.  For example, ifindexes belonging to one netns
+* should probably not be visible to hooks installed by programs
+* running in a different netns.
+*/
+   if (sock_net(sk) != _net)
+   return 0;
+
if (sk->sk_family != AF_INET &&
sk->sk_family != AF_INET6)
return 0;
@@ -186,6 +197,16 @@ int __cgroup_bpf_run_filter_sk(struct sock *sk,
struct bpf_prog *prog;
int ret = 0;
 
+   /*
+* For now, socket bpf hooks attached to cgroups can only be
+* installed in the init netns and only affect the init netns.
+* This could be relaxed in the future once some semantic issues
+* are resolved.  For example, ifindexes belonging to one netns
+* should probably not be visible to hooks installed by programs
+* running in a different netns.
+*/
+   if (sock_net(sk) != _net)
+   return 0;
 
rcu_read_lock();
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e89acea22ecf..c0bbc55e244d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
struct cgroup *cgrp;
enum bpf_prog_type ptype;
 
+   /*
+* For now, socket bpf hooks attached to cgroups can only be
+* installed in the init netns and only affect the init netns.
+* This could be relaxed in the future once some semantic issues
+* are resolved.  For example, ifindexes belonging to one netns
+* should probably not be visible to hooks installed by programs
+* running in a different netns.
+*/
+   if (current->nsproxy->net_ns != _net)
+   return -EINVAL;
+
if (!capable(CAP_NET_ADMIN))
return -EPERM;
 
-- 
2.9.3



[PATCH] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread Andy Lutomirski
To see how cgroup+bpf interacts with network namespaces, I wrote a
little program called show_bind that calls getsockopt(...,
SO_BINDTODEVICE, ...) and prints the result.  It did this:

Default binding is "vrf0"
show_bind: getsockopt: No such device

What's happening here is that "ip vrf" looks up vrf0's ifindex in
the init netns and installs a hook that binds sockets to that
ifindex.  When the hook runs in a different netns, it sets
sk_bound_dev_if to an ifindex from the wrong netns, resulting in
incorrect behavior.  In this particular example, the ifindex was 4
and there was no ifindex 4 in the new netns.  If there had been,
this test would have malfunctioned differently

Since it's rather late in the release cycle, let's punt.  This patch
makes it impossible to install cgroup+bpf hooks outside the init
netns and makes them not run on sockets that aren't in the init
netns.

In a future release, it should be relatively straightforward to make
these hooks be local to a netns and, if needed, to add a flag so
that hooks can be made global if necessary.  Global hooks should
presumably be constrained so that they can't write to any ifindex
fields.

Cc: Daniel Borkmann <dan...@iogearbox.net>
Cc: Alexei Starovoitov <a...@kernel.org>
Cc: David Ahern <d...@cumulusnetworks.com>
Signed-off-by: Andy Lutomirski <l...@kernel.org>
---

DaveM, this mitigates a bug in a feature that's new in 4.10, and the
bug can be hit using current iproute2 -git.  please consider this for
-net.

 kernel/bpf/cgroup.c  | 21 +
 kernel/bpf/syscall.c | 11 +++
 2 files changed, 32 insertions(+)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index a515f7b007c6..a824f543de69 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
if (!sk || !sk_fullsock(sk))
return 0;
 
+   /*
+* For now, socket bpf hooks attached to cgroups can only be
+* installed in the init netns and only affect the init netns.
+* This could be relaxed in the future once some semantic issues
+* are resolved.  For example, ifindexes belonging to one netns
+* should probably not be visible to hooks installed by programs
+* running in a different netns.
+*/
+   if (sock_net(sk) != _net)
+   return 0;
+
if (sk->sk_family != AF_INET &&
sk->sk_family != AF_INET6)
return 0;
@@ -186,6 +197,16 @@ int __cgroup_bpf_run_filter_sk(struct sock *sk,
struct bpf_prog *prog;
int ret = 0;
 
+   /*
+* For now, socket bpf hooks attached to cgroups can only be
+* installed in the init netns and only affect the init netns.
+* This could be relaxed in the future once some semantic issues
+* are resolved.  For example, ifindexes belonging to one netns
+* should probably not be visible to hooks installed by programs
+* running in a different netns.
+*/
+   if (sock_net(sk) != _net)
+   return 0;
 
rcu_read_lock();
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e89acea22ecf..c0bbc55e244d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
struct cgroup *cgrp;
enum bpf_prog_type ptype;
 
+   /*
+* For now, socket bpf hooks attached to cgroups can only be
+* installed in the init netns and only affect the init netns.
+* This could be relaxed in the future once some semantic issues
+* are resolved.  For example, ifindexes belonging to one netns
+* should probably not be visible to hooks installed by programs
+* running in a different netns.
+*/
+   if (current->nsproxy->net_ns != _net)
+   return -EINVAL;
+
if (!capable(CAP_NET_ADMIN))
return -EPERM;
 
-- 
2.9.3



Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-23 Thread Andy Lutomirski
On Sun, Jan 22, 2017 at 8:31 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Thu, Jan 19, 2017 at 08:04:59PM -0800, Andy Lutomirski wrote:
>> On Thu, Jan 19, 2017 at 6:39 PM, Alexei Starovoitov
>> <alexei.starovoi...@gmail.com> wrote:
>> > On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote:
>> >> I think it could work by making a single socket cgroup controller that
>> >> handles all cgroup things that are bound to a socket.  Using
>> >
>> > Such 'socket cgroup controller' would limit usability of the feature
>> > to sockets and force all other use cases like landlock to invent
>> > their own wheel, which is undesirable. Everyone will be
>> > inventing new 'foo cgroup controller', while all of them
>> > are really bpf features. They are different bpf program
>> > types that attach to different hooks and use cgroup for scoping.
>>
>> Can you elaborate on why that would be a problem?  In a cgroup v1
>> world, users who want different hierarchies for different types of
>> control could easily want one hierarchy for socket hooks and a
>> different hierarchy for lsm hooks.  In a cgroup v2 delegation world, I
>> could easily imagine the decision to delegate socket hooks being
>> different from the decision to delegate lsm hooks.  Almost all of the
>> code would be shared between different bpf-using cgroup controllers.
>
> how do you think it can be enforced when directory is chowned?

A combination of delegation policy (a la subtree_control in the
parent) and MAC policy.  I'm quite confident that apparmor can apply
policy to cgroupfs and I'm reasonably confident (although slightly
less so) that selinux can as well.  The docs suck :(

But if there's only one file in there to apply MAC policy to, the
ability to differentiate between subsystems is quite limited.   In the
current API, there are no files to apply policy to, so it won't work
at all.

>
> ... and open() of the directory done by the current api will preserve
> cgroup delegation when and only when bpf_prog_type_cgroup_*
> becomes unprivileged.
> I'm not proposing creating new api here.

I don't know what you mean.  open() of the directory can't check
anything useful because it has to work for programs that just want to
read the directory.

>> Meanwhile, cgroup+bpf actually appears to be buggy in this regard even
>> regardless of what semantics you think are better.  sk_bound_dev_if is
>> exposed as a u32 value, but sk_bound_dev_if only has meaning within a
>> given netns.  The "ip vrf" stuff will straight-up malfunction if a
>> process affected by its hook runs in a different netns from the netns
>> that "ip vrf" was run in.
>
> how is that any different from normal 'ip netns exec'?
> that is expected user behavior.

# ./ip link add dev vrf0 type vrf table 10
# ./ip vrf exec vrf0 ./show_bind
Default binding is "vrf0"
# ./ip vrf exec vrf0 unshare -n ./show_bind
show_bind: getsockopt: No such device

The expected behavior to me is that ip netns exec (or equivalently
unshare -n) gives a clean slate.  Actually malfunctioning (which this
example using the latest iproute2 and linux just did) is not expected.

I'm done with this part of this thread and I'm sending a patch.

>
>> restricting the types of sockets that can be created, then you do want
>> the filter to work across namespaces, but seccomp can do that too and
>> the current code doesn't handle netns correctly.
>
> are you saying that seccomp supports netns filtering? please show the proof.

It can trivially restruct the types of sockets that are created by
filtering on socket(2) syscall parameters, at least on sane
architectures that don't use socketcall().

> To summarize, I think the 'prog override is not allowed' flag would be
> ok feature to have and I wouldn't mind making it the default when no 'flag'
> field is passed to bpf syscall, but it's not acceptable to remove current
> 'prog override is allowed' behavior.
> So if you insist on changing the default, please implement the flag asap.
> Though from security point of view and ABI point of view there is absolutely
> no difference whether this flag is added today or a year later while
> the default is kept as-is.

It's too late and I have too little time.  I'll try to write a patch
to change the default to just deny attempts to override.  Better
behavior can be added later.

IMO your suggestions about priorities are overcomplicated.  For your
instrumentation needs, I can imagine that a simple "make this hook not
run if a descendent has a hook" would do it.  For everything else, run
them all in tree order (depending on filter type) and let the eBPF
code do whatever it wants to do.


Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-19 Thread Andy Lutomirski
On Thu, Jan 19, 2017 at 6:39 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote:
>> I think it could work by making a single socket cgroup controller that
>> handles all cgroup things that are bound to a socket.  Using
>
> Such 'socket cgroup controller' would limit usability of the feature
> to sockets and force all other use cases like landlock to invent
> their own wheel, which is undesirable. Everyone will be
> inventing new 'foo cgroup controller', while all of them
> are really bpf features. They are different bpf program
> types that attach to different hooks and use cgroup for scoping.

Can you elaborate on why that would be a problem?  In a cgroup v1
world, users who want different hierarchies for different types of
control could easily want one hierarchy for socket hooks and a
different hierarchy for lsm hooks.  In a cgroup v2 delegation world, I
could easily imagine the decision to delegate socket hooks being
different from the decision to delegate lsm hooks.  Almost all of the
code would be shared between different bpf-using cgroup controllers.

>
>> Having thought about this some more, I think that making it would
>> alleviate a bunch of my concerns, as it would make the semantics if
>> the capable() check were relaxed to ns_capable() be sane.  Here's what
>
> here we're on the same page. For any meaningful discussion about
> 'bpf cgroup controller' to happen bpf itself needs to become
> delegatable in cgroup sense. In other words BPF_PROG_TYPE_CGROUP*
> program types need to become available for unprivileged users.
> The only unprivileged prog type today is BPF_PROG_TYPE_SOCKET_FILTER.
> To make it secure we severely limited its functionality.
> All bpf advances since then (like new map types and verifier extensions)
> were done for root only. If early on the priv vs unpriv bpf features
> were 80/20. Now it's close to 95/5. No work has been done to
> make socket filter type more powerful. It still has to use
> slow-ish ld_abs skb access while tc/xdp have direct packet access.
> Things like register value tracking is root only as well and so on
> and so forth.
> We cannot just flip the switch and allow type_cgroup* to unpriv
> and I don't see any volunteers willing to do this work.
> Until that happens there is no point coming up with designs
> for 'cgroup bpf controller'... whatever that means.

Sure there is.  If delegation can be turned on without changing the
API, then the result will be easier to work with and have fewer
compatibility issues.

>
>> I currently should happen before bpf+cgroup is enabled in a release:
>>
>> 1. Make it netns-aware.  This could be as simple as making it only
>> work in the root netns because then real netns awareness can be added
>> later without breaking anything.  The current situation is bad in that
>> network namespaces are just ignored and it's plausible that people
>> will start writing user code that depends on having network namespaces
>> be ignored.
>
> nothing in bpf today is netns-aware and frankly I don't see
> how cgroup+bpf has anything to do with netns.
> For regular sockets+bpf we don't check netns.
> When tcpdump opens raw socket and attaches bpf there are no netns
> checks, since socket itself gives a scope for the program to run.
> Same thing applies to cgroup+bpf. cgroup gives a scope for the program.
> But, say, we indeed add 'if !root ns' check to BPF_CGROUP_INET_*
> hooks.


Here I completely disagree with you.  tcpdump sees packets in its
network namespace.  Regular sockets apply bpf filters to the packets
seen by that socket, and the socket itself is scoped to a netns.

Meanwhile, cgroup+bpf actually appears to be buggy in this regard even
regardless of what semantics you think are better.  sk_bound_dev_if is
exposed as a u32 value, but sk_bound_dev_if only has meaning within a
given netns.  The "ip vrf" stuff will straight-up malfunction if a
process affected by its hook runs in a different netns from the netns
that "ip vrf" was run in.

IOW, the current code is buggy.

> Then if the hooks are used for security, the process
> only needs to do setns() to escape security sandbox. Obviously
> broken semantics.

This could go both ways.  If the goal is to filter packets, then it's
not really important to have the filter keep working if the sandboxed
task unshares netns -- in the new netns, there isn't any access to the
network at all.  If the goal is to reduce attack surface by
restricting the types of sockets that can be created, then you do want
the filter to work across namespaces, but seccomp can do that too and
the current code doesn't handle netns correctly.

>
>> 2. Make it inherit properly.  Inner cgroups should not overri

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-18 Thread Andy Lutomirski
On Wed, Jan 18, 2017 at 4:59 PM, Tejun Heo <t...@kernel.org> wrote:
> Hello, Andy.
>
> On Wed, Jan 18, 2017 at 04:18:04PM -0800, Andy Lutomirski wrote:
>> To map cgroup -> hook, a simple array in each cgroup structure works.
>> To map (cgroup, netns) -> hook function, the natural approach would be
>> to have some kind of hash, and that would be slower.  This code is
>> intended to be *fast*.
>
> Updating these programs isn't a frequent operation.  We can easily
> calculate a perfect (or acceptable) hash per-cgroup and rcu swap the
> new hashtable.
>

Fair enough.

>
>> I think it's currently a broken cgroup feature.  I think it could be
>> made into a properly working cgroup feature (which I favor) or a
>> properly working net feature.  Can you articulate why you prefer
>> having it be a net feature instead of a cgroup feature?  I tend to
>
> I thought that's what I was doing in the previous message.
>
>> favor making it be a working cgroup feature (by giving it a file or
>> files in cgroupfs and maybe even a controller) because the result
>> would have very obvious semantics.
>
> I'm fine with both directions but one seems far out.

I thought you were saying why you thought it wasn't a cgroup feature,
but I'm not sure I understand why you thought it shouldn't be a cgroup
feature.

When I chatted with Alexei, I had the impression that you and he had
wanted it to be a cgroup controller but thought it wouldn't work well.
I think it could work by making a single socket cgroup controller that
handles all cgroup things that are bound to a socket.  Using
individual files would play nicer with cgroup delegation within a
single netns.

>
>> But maybe this is just a false dichotomy.  Could this feature be a
>> per-netns configuration where you can give instructions like "run this
>> hook if you're in such-and-such netns and in a given cgroup or one of
>> its descendents"?  This would prevent it from being a direct analogue
>> to systemd's RestrictAddressFamilies= option, but that may be okay.
>> This would side-step issues in the current code where a hook can't
>> rely on ifindex meaning what the hook thinks it means.
>
> How is this different from making the current code netns aware?

The descendents part is important.

>
>> Actually, I think I like that approach.  What if it we had a "socket"
>> controller and files like "socket.create_socket", "socket.ingress" and
>> "socket.egress"?  You could use the bpf() syscall to install a bpf
>> filter into "socket.egress" and that filter would filter egress for
>> the target cgroup and its descendents on the current netns.  As a
>> first pass, the netns issue could be sidestepped by making it only
>> work in the root netns (i.e. the bpf() call would fail if you're not
>> in the root netns and the hooks wouldn't run if you're not in the root
>> netns.)  It *might* even be possible to retrofit in the socket
>> controller by saying that the default hierarchy is used if the socket
>> controller isn't mounted.
>
> I don't know.  You're throwing out too many ideas too fast and it's
> difficult to keep up with what the differences are between those
> ideas.  But if we're doing cgroup controllers, shouldn't cgroup ns
> support be sufficient?  We can consider the product of cgroup and net
> namespaces but that doesn't seem necessary given that people usually
> set up these namespaces in conjunction.

Fair enough.  See way below.

>
>> What *isn't* possible to cleanly fix after the fact is the current
>> semantics that cgroup hooks override the hooks in their ancestors.
>> IMO that is simply broken.  The example you gave (perf_event) is very
>> careful not to have this problem.
>
> That's like saying installing an iptables rule for a more specific
> target is broken.  As a cgroup controller, it is not an acceptable
> behavior given how delegation works.  As something similar to
> iptables, it is completely fine.

Even the xt_cgroup code checks cgroup_is_descendent().

>
>> > * My impression with bpf is that while delegation is something
>> >   theoretically possible it is not something which is gonna take place
>> >   in any immediate time frame.  If I'm wrong on this, please feel free
>> >   to correct me.
>>
>> But the issue isn't *BPF* delegation.  It's cgroup delegation or netns
>> creation, both of which exist today.
>
> No, the issue is bpf delegation.  If bpf were fully delegatable in
> practical sense, we could just do straight forward cgroup bpf
> controller.  Well, we'll have to think about how to chain the programs
> which woul

  1   2   3   >