Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-05-02 Thread Jesper Dangaard Brouer

On Tue, 6 Mar 2018 15:42:41 -0800 Chris Mason  wrote:
> On 6 Mar 2018, at 11:12, Linus Torvalds wrote:
> 
[...]
> >
> > I do *not* want this to be a magical way to hide things.  
> 
> Especially early on, this makes a lot of sense.  But I wanted to plug 
> bps and the hopefully growing set of bpf introspection tools:
> 
> https://github.com/iovisor/bcc/blob/master/introspection/bps_example.txt
> 
> Long term these are probably a good place to tell the admin what's going 
> on.
(related to bpf itself not modprobe subject)

Hi Chris,

I just want to point out that the tool 'bpftool', is currently the
dominating tool for eBPF introspection.  And the 'bps' tool you mention
seems to have gained less (open source) traction.

The bpftool is part of the kernel git-tree: tools/bpf/bpftool
 https://github.com/torvalds/linux/blob/master/tools/bpf/bpftool/

And it even have bash-completion and man-pages in RST format so they
even render nicely when viewed via github:

 
https://github.com/torvalds/linux/blob/master/tools/bpf/bpftool/Documentation/bpftool.rst
 
https://github.com/torvalds/linux/blob/master/tools/bpf/bpftool/Documentation/bpftool-prog.rst
 
https://github.com/torvalds/linux/blob/master/tools/bpf/bpftool/Documentation/bpftool-map.rst

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-22 Thread Luis R. Rodriguez
On Thu, Mar 22, 2018 at 3:15 PM, Andy Lutomirski  wrote:
>  All we need to do is to make sure that, if this is
> distributed as a module, that it's init routine doesn't wait for a
> long time, right?

Yeap.

 Luis


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-22 Thread Alexei Starovoitov

On 3/22/18 3:15 PM, Andy Lutomirski wrote:

On Thu, Mar 22, 2018 at 8:54 PM, Luis R. Rodriguez  wrote:

If we can ensure that these usermode modules don't take *any time at all* on
their init *from the start*, it would be wonderful and we'd end up avoiding
some really odd corner case issues later.



I don't see why this issue needs to exist at all for the new stuff.
After all, the new things aren't usermode modules per se.  They're
regular kernel code (modular or otherwise) that loads a usermode
helper.  All we need to do is to make sure that, if this is
distributed as a module, that it's init routine doesn't wait for a
long time, right?


I've implemented all of the previous suggestions and
now there are zero changes to kernel/module.c
I still need to finish tracpoint stuff first and polish umh code a bit
before sending new version.
Let's hold on this thread until then.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-22 Thread Andy Lutomirski
On Thu, Mar 22, 2018 at 8:54 PM, Luis R. Rodriguez  wrote:
> If we can ensure that these usermode modules don't take *any time at all* on
> their init *from the start*, it would be wonderful and we'd end up avoiding
> some really odd corner case issues later.
>

I don't see why this issue needs to exist at all for the new stuff.
After all, the new things aren't usermode modules per se.  They're
regular kernel code (modular or otherwise) that loads a usermode
helper.  All we need to do is to make sure that, if this is
distributed as a module, that it's init routine doesn't wait for a
long time, right?


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-22 Thread Luis R. Rodriguez
On Sat, Mar 10, 2018 at 03:16:52PM +, Luis R. Rodriguez wrote:
> On Sat, Mar 10, 2018 at 02:08:43PM +, Luis R. Rodriguez wrote:
> > The alternative to this would be a simple equivalent of 
> > try_then_request_module()
> > for UMH modules: try_umhm_then_request_umh_module() or whatever. So just as 
> > I
> > argued earlier over UMH limitations, this is not the end of the world for 
> > umh
> > modules, and it doesn't mean you can't get *properly* add umh modules 
> > upstream,
> > it would *just mean* we'd be perpetuating today's (IMHO) horrible and loose
> > semantics.
> 
> I was about to suggest that perhaps a try_umhm_then_request_umh_module() or
> whatever should not be a macro -- but instead an actual routine, and we don't
> export say the simple form to avoid non-deterministic uses of it from the
> start... but the thing is *it'd have to be a macro* given that the *check* for
> the module *has to be loose*, just as try_then_request_module()...
> 
> *Ugh* gross.
> 
> Another reason for me to want an actual deterministic clean proper solution
> from the start.

I just thought of another consideration which should be made here for the long
term.

Some init systems have a timeout for kmod workers, that is the userspace
process which issues the modprobe call.

That was very well intentioned, however it ended up being nonsense, so at least
on SLE systemd we disable the timeout for kmod workers.  What others do... is
unclear to me.  Upstream wise the timeout was increased considerably, however,
*if* such timeout is in effect for users it has some implicit implications on
the number of devices a driver could support:

number_devices =  systemd_timeout   
  - 
  max known probe time for driver  

I've documented the logic to these conclusions [0].

It sounds like we *do* want a full sync wait mechanism, and as I noted I think
we should fix the determinism aspect of it. Since no aliases will be supported
for usermode modules this will be much easier to support, and I can volunteer
to help with that.

However given the above... if we're going to use request_module() API (or a
really fixed deterministic version of it later) for usermode kernel modules,
the limitation above still applies.

Are these usermode modules doing all the handy work on init? Or can it be
deferred once loaded? How much loading on init should a usermode module need?

If we can ensure that these usermode modules don't take *any time at all* on
their init *from the start*, it would be wonderful and we'd end up avoiding
some really odd corner case issues later.

[0] http://www.do-not-panic.com/2015/12/linux-asynchronous-probe.html

  Luis


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-13 Thread Greg Kroah-Hartman
On Mon, Mar 12, 2018 at 10:22:00AM -0700, Alexei Starovoitov wrote:
> On 3/10/18 7:34 AM, Luis R. Rodriguez wrote:
> > Also,
> > 
> > Alexei you never answered my questions out aliases with the umh modules.
> > Long term this important to consider.
> 
> aliases always felt like a crutch to me.
> I can see an argument when they're used as 'alias pci:* foo'
> but the way it's used in networking with ip_set_* and nf-* is
> something I prefer not to ever do again.
> Definitely no aliases for bpfilter umh.

I agree, let's not do that if at all possible for these types of
binaries.

greg k-h


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-12 Thread Alexei Starovoitov

On 3/12/18 5:02 AM, Edward Cree wrote:

On 09/03/18 18:58, Alexei Starovoitov wrote:

It's not waiting for the whole thing, because once bpfilter starts it
stays running/sleeping because it's stateful.

So, this has been bugging me a bit.
If bpfilter takes a signal and crashes, all that state goes away.
Does that mean your iptables/netfilter config just got forgotten and next
 time you run iptables it disappears, so you have to re-apply it all again?

It needs normal
malloc-ed memory to keep the state of iptable->bpf translation that
it will use later during subsequent translation calls.
Theoretically it can use bpf maps pinned in kernel memory to keep
this state, but then it's non-swappable. It's better to keep bpfilter
state in its own user memory.

Perhaps the state should live in swappable kernel memory (e.g. a tmpfs
 thing, which bpfilter could access through a mount).  It'd be read-only
 to userspace, listing the existing rules (in untranslated form), and be
 updated to reflect the new rule after bpfilter has supplied the updated
 translation.
Then bpfilter can cache things if it wants, but the kernel remains the
 ultimate arbiter of the state and maintains it over a bpfilter crash.


seems like overkill.
I consider crashing bpfilter same severity as kernel bug.
Whatever firewall rules already installed will continue to work,
but new ones won't be able to load and current set cannot be queried.
Control plane crashed, dataplane continues to work.
Still a ton better than whole system crash.
We have plenty of work ahead of us without worrying about restarting
that umh and reloading its state from tmpfs.
Something to consider for later phases of the project.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-12 Thread Alexei Starovoitov

On 3/10/18 7:34 AM, Luis R. Rodriguez wrote:

Also,

Alexei you never answered my questions out aliases with the umh modules.
Long term this important to consider.


aliases always felt like a crutch to me.
I can see an argument when they're used as 'alias pci:* foo'
but the way it's used in networking with ip_set_* and nf-* is
something I prefer not to ever do again.
Definitely no aliases for bpfilter umh.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-12 Thread Edward Cree
On 09/03/18 18:58, Alexei Starovoitov wrote:
> It's not waiting for the whole thing, because once bpfilter starts it
> stays running/sleeping because it's stateful.
So, this has been bugging me a bit.
If bpfilter takes a signal and crashes, all that state goes away.
Does that mean your iptables/netfilter config just got forgotten and next
 time you run iptables it disappears, so you have to re-apply it all again?
> It needs normal
> malloc-ed memory to keep the state of iptable->bpf translation that
> it will use later during subsequent translation calls.
> Theoretically it can use bpf maps pinned in kernel memory to keep
> this state, but then it's non-swappable. It's better to keep bpfilter
> state in its own user memory.
Perhaps the state should live in swappable kernel memory (e.g. a tmpfs
 thing, which bpfilter could access through a mount).  It'd be read-only
 to userspace, listing the existing rules (in untranslated form), and be
 updated to reflect the new rule after bpfilter has supplied the updated
 translation.
Then bpfilter can cache things if it wants, but the kernel remains the
 ultimate arbiter of the state and maintains it over a bpfilter crash.

Sound reasonable?

-Ed


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-11 Thread Luis R. Rodriguez
On Sat, Mar 10, 2018 at 02:08:43PM +, Luis R. Rodriguez wrote:
> The alternative to this would be a simple equivalent of 
> try_then_request_module()
> for UMH modules: try_umhm_then_request_umh_module() or whatever. So just as I
> argued earlier over UMH limitations, this is not the end of the world for umh
> modules, and it doesn't mean you can't get *properly* add umh modules 
> upstream,
> it would *just mean* we'd be perpetuating today's (IMHO) horrible and loose
> semantics.

I was about to suggest that perhaps a try_umhm_then_request_umh_module() or
whatever should not be a macro -- but instead an actual routine, and we don't
export say the simple form to avoid non-deterministic uses of it from the
start... but the thing is *it'd have to be a macro* given that the *check* for
the module *has to be loose*, just as try_then_request_module()...

*Ugh* gross.

Another reason for me to want an actual deterministic clean proper solution
from the start.

  Luis


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-11 Thread Luis R. Rodriguez
Also,

Alexei you never answered my questions out aliases with the umh modules.
Long term this important to consider.

  Luis


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-10 Thread Andy Lutomirski
On Sat, Mar 10, 2018 at 1:43 AM, Alexei Starovoitov  wrote:
> On 3/9/18 11:37 AM, Andy Lutomirski wrote:
>>
>> On Fri, Mar 9, 2018 at 6:55 PM, David Miller  wrote:
>>>
>>> From: Alexei Starovoitov 
>>> Date: Fri, 9 Mar 2018 10:50:49 -0800
>>>
 On 3/9/18 10:23 AM, Andy Lutomirski wrote:
>
> It might not be totally crazy to back it by tmpfs.


 interesting. how do you propose to do it?
 Something like:
 - create /umh_module_tempxxx dir
 - mount tmpfs there
 - copy elf into it and exec it?
>>>
>>>
>>> I think the idea is that it's an internal tmpfs mount that only
>>> the kernel has access too.
>>
>>
>> That's what I was imagining.  There's precedent.  For example, there's
>> a very short piece of code that does it in
>> drivers/gpu/drm/i915/i915_gemfs.c.
>
>
> I can do "monkey see monkey do" approach which will look like:
> type = get_fs_type("tmpfs");
> fs = kern_mount(type);
>
> /* for each request_umh("foo") */
> file = shmem_file_setup_with_mnt(fs, "umh_foo");
> do {
>   pagecache_write_begin(file,...);
>   memcpy()
>   pagecache_write_end();
> } while (umh_elf_size);
> do_execve_file(file);
> fput(file);
>
> while keeping fs mounted forever?
> is there better way?
>

Nice!  I'm definitely not a pagecache expert, but it looks generally
sane.  Once the thing is actually functional, we can bang on it, and
I'm sure that linux-mm will have some suggestions to tidy it up.

As for the actual lifetime of the filesystem, I think it should be
mounted once and never unmounted.  Whenever it gains a second user,
the whole thing can be moved to mm/ or lib/ and all the users can
share the same mount.

Minor caveat: I would arrange the code a bit differently, like this:

static (or extern) unsigned char __initdata the_blob[];
static struct file *umh_blob_file;

static int __init my_module_init_function(void)
{
 /* for each request_umh("foo") */
 umh_blob_file = shmem_file_setup_with_mnt(fs, "umh_foo");
 do {
   pagecache_write_begin(umh_file,...);
   memcpy()
   pagecache_write_end();
 } while (umh_elf_size);

 /* the_blob is implicitly freed after this returns */
}

and then actually use the struct file later on.  If and when you're
sure you're not going to spawn another copy, you can fput() it.  This
way the memory becomes swappable immediately on load.

As for request_module() vs request_module_umh(), my advice would be to
write the code and then see what interface makes sense.  I wouldn't be
surprised if it ends up making more sense to keep all of this entirely
independent from the module system.

P.S. I suspect that, before this hits a release, someone's going to
have to fiddle with the LSM hooks in do_execve() a bit to make sure
that LSM unconditionally approves this type of umh program.  Otherwise
there might be pointless failures on some more locked down
configurations.  But that can wait until it's more final and the
security folks review the code.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-10 Thread Luis R. Rodriguez
On Fri, Mar 09, 2018 at 06:34:18PM -0800, Alexei Starovoitov wrote:
> On 3/9/18 11:38 AM, Linus Torvalds wrote:
> > On Fri, Mar 9, 2018 at 11:12 AM, Linus Torvalds
> >  wrote:
> > > 
> > > How are you going to handle five processes doing the same setup
> > > concurrently?

Let's keep in mind we don't have a formal way to specify this as well
for modules as well, other than kconfig. Ie it'd be up to the author
of the modules to pick this up and make things mutually exclusive.

> > Side note: it's not just serialization. It's also "is it actually up
> > and running".
> > 
> > The rule for "request_module()" (for a real module) has been that it
> > returns when the module is actually alive and active and have done
> > their initcalls.

Unfortunately this is not accurate though, the loose grammar around this fact
is one of the reasons why long term I think either new API should be added, or
the existing request_module() API modified.

request_module() is not deterministic today, try_then_request_module() *is* but
its IMHO its a horrible grammatical solution, and I'm not a fan of the idea of
umh modules perpetuating this nonsense *long term*. Details below.

> > The UMH_WAIT_EXEC behavior (ignore the serialization - you could do
> > that in the caller) behavior doesn't actually have any semantics AT
> > ALL. It only means that you get the error returns from execve()
> > itself, so you know that the executable file actually existed and
> > parsed right enough to get started.
> > 
> > But you don't actually have any reason to believe that it has *done*
> > anything, and started processing any requests. There's no reason
> > what-so-ever to believe that it has registered itself for any
> > asynchronous requests or anything like that.
> 
> I agree that sync approach nicely fits this use case, but waiting
> for umh brings the whole set of suspend/resume issues that
> Luis explained earlier and if I understood his points that stuff
> is still not quite working right

It works enough that suspend works well enough on Linux today, but the primary
reason is that the big kernel/umh.c API user today is the firmware API for the
old fallback firmware interface and it has in place the sync
usermodehelper_read_trylock() and async async usermodehelper_read_lock_wait().
Note that in the fallback case there is just the uevent call.

> and he's planning a set of fixes.

The move of the umh locks out of the non-fallback case was a long term goal I
had which I was planning on doing slowly, but recently got jump started via
v4.14 via commit f007cad159e9 ("Revert "firmware: add sanity check on
shutdown/suspend"). As such that goal is now complete thanks to Linus correctly
pushing it forward.

> I really don't want the unknown timeline of fixes for 'waiting umh'
> to become a blocker for the bpfilter project ...

The reason for me to bring up the suspend stuff was that there no other
*heavy* UMH API users in the kernel today, and just to highlight that
care must be taken if we want to consider in the future further
possibly heavy UMH callers which we *can* expect to be called around
suspend and resume.

*Iff* that will be the case for these new umh userspace modules, we can
evaluate a future plan. But not that this is as vague as suggesting the
same for any further kernel future UMH API user! If the only umh module
we expect for a while will be bpfilter and it we don't expect heavy use
during suspend/resume this is a non-issue and likely won't be for a while,
and all that *should be done* is become aware of the today's limitations
as we *document* any new API, even if its the simple and easy
request_umh_module*() calls.

Today's use of the UMH API to always use helper_lock() and prevent suspend
while a call is being issued should suffice, so long as these umh modules don't
become heavy users during suspend/resume.

Note that there even *further* further advanced suspend/resume considerations
with filesystems but we have a reasonable hand on what to do there [0] and
it frankly isn't *that* much work as I have most of it done already.

The only other suspend optimization I could think of left to evaluate may be
whether or not to we should generalize something like the firmware cache for
other UMH callers but that's really a long term topic.

So I would not say there is pending work left to do, it should suffice
to document the semantics and limitations if the umh modules are to be
added. That's it.

Linus has proven me right that the *concerns* I've had for these corner
cases are just that, and I do believe documenting the limitations should
suffice for new APIs.

[0] https://lkml.kernel.org/r/20180126090923.gd12...@wotan.suse.de

> Also I like Luis suggestion to use some other name than request_module()
> Something like:
> request_umh_module_sync("foo");
> request_umh_module_nowait("foo");
> 
> in both cases the kernel would create a pipe, call umh either
> with UMH_WAIT_PROC or UMH_WAIT_EXEC and make 

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Alexei Starovoitov

On 3/9/18 11:38 AM, Linus Torvalds wrote:

On Fri, Mar 9, 2018 at 11:12 AM, Linus Torvalds
 wrote:


How are you going to handle five processes doing the same setup concurrently?


Side note: it's not just serialization. It's also "is it actually up
and running".

The rule for "request_module()" (for a real module) has been that it
returns when the module is actually alive and active and have done
their initcalls.

The UMH_WAIT_EXEC behavior (ignore the serialization - you could do
that in the caller) behavior doesn't actually have any semantics AT
ALL. It only means that you get the error returns from execve()
itself, so you know that the executable file actually existed and
parsed right enough to get started.

But you don't actually have any reason to believe that it has *done*
anything, and started processing any requests. There's no reason
what-so-ever to believe that it has registered itself for any
asynchronous requests or anything like that.

So in the real module case, you can do

request_module("modulename");

and just start using whatever resource you just requested. So the
netfilter code literally does

request_module("nft-chain-%u-%.*s", family,
   nla_len(nla), (const char *)nla_data(nla));
nfnl_lock(NFNL_SUBSYS_NFTABLES);
type = __nf_tables_chain_type_lookup(nla, family);
if (type != NULL)
return ERR_PTR(-EAGAIN);

and doesn't even care about error handling for request_module()
itself, because it knows that either the module got loaded and is
ready, or something failed. And it needs to look that chain type up
anyway, so the failure is indicated by _that_.

With a UMH_WAIT_EXEC? No. You have *nothing*. You know the thing
started, but it might have SIGSEGV'd immediately, and you have
absolutely no way of knowing, and absolutely no way of even figuring
it out. You can wait - forever - for something to bind to whatever
dynamic resource you're expecting. You'll just fundamentally never
know.

You can try again, of course. Add a timeout, and try again in five
seconds or something. Maybe it will work then. Maybe it won't. You
won't have any way to know the _second_ time around either. Or the
third. Or...

See why I say it has to be synchronous?

If it's synchronous, you can actually do things like

 (a) maybe you only need a one-time thing, and don't have any state
("load fixed tables, be done") and that's it. If the process returns
with no error code, you're all done, and you know it's fine.


I agree that sync approach nicely fits this use case, but waiting
for umh brings the whole set of suspend/resume issues that
Luis explained earlier and if I understood his points that stuff
is still not quite working right and he's planning a set of fixes.
I really don't want the unknown timeline of fixes for 'waiting umh'
to become a blocker for the bpfilter project ...


 (b) maybe the process wants to start a listener daemon or something
like the traditional inetd model. It can open the socket, it can start
listening on it, and it can fork off a child and check it's status. It
can then do exit(0) if everything is fine, and now request_module()
returns.


... while for bpfilter use case we need a daemon and starting umh
with UMH_WAIT_PROC which does fork() right away and parent does exit(0)
is not any different from kernel pov than UMH_WAIT_EXEC.
The kernel will be in the same situation described above. The forked
process could have sigsegv quickly (right after telling parent that
everything is ok) and kernel is waiting forever.

That situation is what I was alluding to regarding that kernel<->umh
need to have some sort of health check protocol.
Regardless of how umh is invoked.

I think what Andy proposing with kernel creating a pipe and giving
it to umh can be used as this health check and means of
'load exclusion' to make sure that only one requested umh is running.

Also I like Luis suggestion to use some other name than request_module()
Something like:
request_umh_module_sync("foo");
request_umh_module_nowait("foo");

in both cases the kernel would create a pipe, call umh either
with UMH_WAIT_PROC or UMH_WAIT_EXEC and make sure that umh
responds to first hello message.
On success they would return a handle to that pipe and umh's pid.
The further interaction between the kernel and umh will be
via that pipe. If kernel->umh request times out the kernel
side of bpfilter would sigkill the task and do request_umh() again.
If that request_umh() fails there will be no retry, since
at this point it's clear that given umh is broken.

I'll implement only _nowait() version, since that's what we need
for bpfilter and when suspend/resume issues are solved somebody
who cares about usb driver in user mode can implement the _sync()
variant.

All that on top of tmpfs->file->execve_file hack that I still
need feedback on.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Alexei Starovoitov

On 3/9/18 11:37 AM, Andy Lutomirski wrote:

On Fri, Mar 9, 2018 at 6:55 PM, David Miller  wrote:

From: Alexei Starovoitov 
Date: Fri, 9 Mar 2018 10:50:49 -0800


On 3/9/18 10:23 AM, Andy Lutomirski wrote:

It might not be totally crazy to back it by tmpfs.


interesting. how do you propose to do it?
Something like:
- create /umh_module_tempxxx dir
- mount tmpfs there
- copy elf into it and exec it?


I think the idea is that it's an internal tmpfs mount that only
the kernel has access too.


That's what I was imagining.  There's precedent.  For example, there's
a very short piece of code that does it in
drivers/gpu/drm/i915/i915_gemfs.c.


I can do "monkey see monkey do" approach which will look like:
type = get_fs_type("tmpfs");
fs = kern_mount(type);

/* for each request_umh("foo") */
file = shmem_file_setup_with_mnt(fs, "umh_foo");
do {
  pagecache_write_begin(file,...);
  memcpy()
  pagecache_write_end();
} while (umh_elf_size);
do_execve_file(file);
fput(file);

while keeping fs mounted forever?
is there better way?



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Andy Lutomirski
On Fri, Mar 9, 2018 at 7:38 PM, Linus Torvalds
 wrote:
> On Fri, Mar 9, 2018 at 11:12 AM, Linus Torvalds
>  wrote:
>>
>> How are you going to handle five processes doing the same setup concurrently?
>
> Side note: it's not just serialization. It's also "is it actually up
> and running".
>

I think the right way to solve this would be to take a hint from
systemd's socket activation model.  The current patch had the module
load process kick off an ELF binary that goes an registers itself to
handle something.  We can turn that around.  Make the module init
function create the socket (or pipe or whatever) receives request and
pass it to the user program as stdin.  Then the kernel can start
queueing requests into the socket immediately, and the user program
will get to them whenever it finishes initializing.  Or it can write
some message to the socket saying "hey, I'm ready".

This also completely avoids the issue where some clever user manually
loads the "module" with exec() ("hey, I'm so clever, I can just run
the damn thing instead if using init_module()!" or writes an
out-of-tree program that uses whatever supposedly secret API the
in-kernel binary is supposed to use to register itself (and I know
people who would do exactly that!) and the kernel does
request_module() at roughly the same time.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Andy Lutomirski
On Fri, Mar 9, 2018 at 6:55 PM, David Miller  wrote:
> From: Alexei Starovoitov 
> Date: Fri, 9 Mar 2018 10:50:49 -0800
>
>> On 3/9/18 10:23 AM, Andy Lutomirski wrote:
>>> It might not be totally crazy to back it by tmpfs.
>>
>> interesting. how do you propose to do it?
>> Something like:
>> - create /umh_module_tempxxx dir
>> - mount tmpfs there
>> - copy elf into it and exec it?
>
> I think the idea is that it's an internal tmpfs mount that only
> the kernel has access too.

That's what I was imagining.  There's precedent.  For example, there's
a very short piece of code that does it in
drivers/gpu/drm/i915/i915_gemfs.c.


>
> And I don't think that even hurts your debuggability concerns.  The
> user can just attach using the foo.ko file in the actual filesystem.
>

Not if the .ko is actually a shim that actually just contains a blob
and a few lines of code to kick off the umh.  But one could still
debug it using kernel debug symbols (like vDSO debugging works right
now, at least if your distro is in a good mood) or by reading the
contents from /proc/PID/exe.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 11:12 AM, Linus Torvalds
 wrote:
>
> How are you going to handle five processes doing the same setup concurrently?

Side note: it's not just serialization. It's also "is it actually up
and running".

The rule for "request_module()" (for a real module) has been that it
returns when the module is actually alive and active and have done
their initcalls.

The UMH_WAIT_EXEC behavior (ignore the serialization - you could do
that in the caller) behavior doesn't actually have any semantics AT
ALL. It only means that you get the error returns from execve()
itself, so you know that the executable file actually existed and
parsed right enough to get started.

But you don't actually have any reason to believe that it has *done*
anything, and started processing any requests. There's no reason
what-so-ever to believe that it has registered itself for any
asynchronous requests or anything like that.

So in the real module case, you can do

request_module("modulename");

and just start using whatever resource you just requested. So the
netfilter code literally does

request_module("nft-chain-%u-%.*s", family,
   nla_len(nla), (const char *)nla_data(nla));
nfnl_lock(NFNL_SUBSYS_NFTABLES);
type = __nf_tables_chain_type_lookup(nla, family);
if (type != NULL)
return ERR_PTR(-EAGAIN);

and doesn't even care about error handling for request_module()
itself, because it knows that either the module got loaded and is
ready, or something failed. And it needs to look that chain type up
anyway, so the failure is indicated by _that_.

With a UMH_WAIT_EXEC? No. You have *nothing*. You know the thing
started, but it might have SIGSEGV'd immediately, and you have
absolutely no way of knowing, and absolutely no way of even figuring
it out. You can wait - forever - for something to bind to whatever
dynamic resource you're expecting. You'll just fundamentally never
know.

You can try again, of course. Add a timeout, and try again in five
seconds or something. Maybe it will work then. Maybe it won't. You
won't have any way to know the _second_ time around either. Or the
third. Or...

See why I say it has to be synchronous?

If it's synchronous, you can actually do things like

 (a) maybe you only need a one-time thing, and don't have any state
("load fixed tables, be done") and that's it. If the process returns
with no error code, you're all done, and you know it's fine.

 (b) maybe the process wants to start a listener daemon or something
like the traditional inetd model. It can open the socket, it can start
listening on it, and it can fork off a child and check it's status. It
can then do exit(0) if everything is fine, and now request_module()
returns.

see the difference? Even if you ended up with a background process
(like in that (b) case), you did so with *error* handling, and you did
so knowing that the state has actually been set up by the time the
request_module() returns.

And if you use the proper module loading exclusion, it also means that
that (b) can know it's the only process starting up, and it's not
racing with another one.  It might still want to do the usual
lock-files in user space to protect against just the admin starting it
manually, but at least you don't have the situation that a hundred
threads just had a thundering herd where they all ended up using the
same kernel facility, and they all independently started a hundred
usermode helpers.

  Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 10:57 AM, David Miller  wrote:
>
> Once the helper UMH is invoked, it runs asynchronously taking eBPF
> translation requests.

How?

Really. See my comment about mutual exclusion. The current patch is
*broken* because it doesn't handle it. Really.

Think of it this way - you may have now started *five* of those things
concurrently by mistake.

The actual module loading case never does that, because the actual
module loading case has per-module serialization that got
short-circuited.

How are you going to handle five processes doing the same setup concurrently?

  Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Alexei Starovoitov

On 3/9/18 10:50 AM, Linus Torvalds wrote:

On Fri, Mar 9, 2018 at 10:43 AM, Kees Cook  wrote:


Module loading (via kernel_read_file()) already uses
deny_write_access(), and so does do_open_execat(). As long as module
loading doesn't call allow_write_access() before the execve() has
started in the new implementation, I think we'd be covered here.


No. kernel_read_file() only does it *during* the read.

So there's a huge big honking gap between the two.

Also, the second part of my suggestion was to be entirely synchronous
with the whole execution of the process, and do it within the "we do
mutual exclusion fo rmodules with the same name" logic.

Note that Andrei's patch uses UMH_WAIT_EXEC. That's basically
"vfork+exec" - it only waits for the exec to have started, it doesn't
wait for the whole thing.


It's not waiting for the whole thing, because once bpfilter starts it
stays running/sleeping because it's stateful. It needs normal
malloc-ed memory to keep the state of iptable->bpf translation that
it will use later during subsequent translation calls.
Theoretically it can use bpf maps pinned in kernel memory to keep
this state, but then it's non-swappable. It's better to keep bpfilter
state in its own user memory.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread David Miller
From: Linus Torvalds 
Date: Fri, 9 Mar 2018 10:53:45 -0800

> Anyway, see my other suggestion that makes this all irrelevant. Just
> wait synchronously (until the exit), and just use deny_write_access().

What exit?

Once the helper UMH is invoked, it runs asynchronously taking eBPF
translation requests.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread David Miller
From: Alexei Starovoitov 
Date: Fri, 9 Mar 2018 10:50:49 -0800

> On 3/9/18 10:23 AM, Andy Lutomirski wrote:
>> It might not be totally crazy to back it by tmpfs.
> 
> interesting. how do you propose to do it?
> Something like:
> - create /umh_module_tempxxx dir
> - mount tmpfs there
> - copy elf into it and exec it?

I think the idea is that it's an internal tmpfs mount that only
the kernel has access too.

And I don't think that even hurts your debuggability concerns.  The
user can just attach using the foo.ko file in the actual filesystem.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Kees Cook
On Fri, Mar 9, 2018 at 10:50 AM, Linus Torvalds
 wrote:
> On Fri, Mar 9, 2018 at 10:43 AM, Kees Cook  wrote:
>>
>> Module loading (via kernel_read_file()) already uses
>> deny_write_access(), and so does do_open_execat(). As long as module
>> loading doesn't call allow_write_access() before the execve() has
>> started in the new implementation, I think we'd be covered here.
>
> No. kernel_read_file() only does it *during* the read.

Ah, true. And looking at this again, shouldn't deny_write_access()
happen _before_ the LSM check in kernel_read_file()? That looks like a
problem...

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 10:48 AM, Andy Lutomirski  wrote:
>> On Mar 9, 2018, at 10:17 AM, Linus Torvalds  
>> wrote:
>>
>> Hmm. I wish we had an "execute blob" model, but we really don't, and
>> it would be hard/impossible to do without pinning the pages in memory.
>>
>
> Why so hard?  We can already execute a struct file for execveat, and Alexei 
> already has this working for umh.
> Surely we can make an immutable (as in even root can’t write it) 
> kernel-internal tmpfs file, execveat it, then unlink it.

And what do you think that does? It pins the memory for the whole
time. As a *copy* of the original file.

Anyway, see my other suggestion that makes this all irrelevant. Just
wait synchronously (until the exit), and just use deny_write_access().

The "synchronous wait" means that you don't have the semantic change
(and really., it's *required* anyway for the whole mutual exclusion
against another thread racing to load the same module), and the
deny_write_access() means that we don't neeed to make another copy.

Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Alexei Starovoitov

On 3/9/18 10:23 AM, Andy Lutomirski wrote:




On Mar 9, 2018, at 10:15 AM, Greg KH  wrote:





Oh, and for the record, I like Andy's proposal as well as dumping this
into a kernel module "blob" with the exception that this now would take
up unswapable memory, which isn't the nicest and is one big reason we
removed the in-kernel-memory firmware blobs many years ago.



It might not be totally crazy to back it by tmpfs.


interesting. how do you propose to do it?
Something like:
- create /umh_module_tempxxx dir
- mount tmpfs there
- copy elf into it and exec it?



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 10:43 AM, Kees Cook  wrote:
>
> Module loading (via kernel_read_file()) already uses
> deny_write_access(), and so does do_open_execat(). As long as module
> loading doesn't call allow_write_access() before the execve() has
> started in the new implementation, I think we'd be covered here.

No. kernel_read_file() only does it *during* the read.

So there's a huge big honking gap between the two.

Also, the second part of my suggestion was to be entirely synchronous
with the whole execution of the process, and do it within the "we do
mutual exclusion fo rmodules with the same name" logic.

Note that Andrei's patch uses UMH_WAIT_EXEC. That's basically
"vfork+exec" - it only waits for the exec to have started, it doesn't
wait for the whole thing.

So I'm saying "use UMH_WAIT_PROC, do it in a different place, and make
sure you cover the whole sequence with deny_write_access()".

  Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Andy Lutomirski


> On Mar 9, 2018, at 10:17 AM, Linus Torvalds  
> wrote:
> 

> 
> Hmm. I wish we had an "execute blob" model, but we really don't, and
> it would be hard/impossible to do without pinning the pages in memory.
> 

Why so hard?  We can already execute a struct file for execveat, and Alexei 
already has this working for umh. Surely we can make an immutable (as in even 
root can’t write it) kernel-internal tmpfs file, execveat it, then unlink it. 
And /proc/PID/exe should be openable and readable.  The blob itself would be 
__initdata so it gets discarded after it lands in tmpfs. 

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Kees Cook
On Fri, Mar 9, 2018 at 10:35 AM, David Miller  wrote:
> From: Linus Torvalds 
> Date: Fri, 9 Mar 2018 10:17:42 -0800
>
>>  - use deny_write_access() to make sure that we don't have active
>> writers and cannot get them during the execve.
>
> I agree that this is necessary for image validation purposes.

Module loading (via kernel_read_file()) already uses
deny_write_access(), and so does do_open_execat(). As long as module
loading doesn't call allow_write_access() before the execve() has
started in the new implementation, I think we'd be covered here.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread David Miller
From: Linus Torvalds 
Date: Fri, 9 Mar 2018 10:17:42 -0800

>  - use deny_write_access() to make sure that we don't have active
> writers and cannot get them during the execve.

I agree that this is necessary for image validation purposes.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Greg KH
On Fri, Mar 09, 2018 at 10:23:27AM -0800, Andy Lutomirski wrote:
> 
> 
> > On Mar 9, 2018, at 10:15 AM, Greg KH  wrote:
> > 
> 
> > 
> > Oh, and for the record, I like Andy's proposal as well as dumping this
> > into a kernel module "blob" with the exception that this now would take
> > up unswapable memory, which isn't the nicest and is one big reason we
> > removed the in-kernel-memory firmware blobs many years ago.
> > 
> 
> It might not be totally crazy to back it by tmpfs.

Ah, yeah, tricky, but yes, I'd be fine with that.

Micro-kernel here we come!  :)


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Andy Lutomirski


> On Mar 9, 2018, at 10:15 AM, Greg KH  wrote:
> 

> 
> Oh, and for the record, I like Andy's proposal as well as dumping this
> into a kernel module "blob" with the exception that this now would take
> up unswapable memory, which isn't the nicest and is one big reason we
> removed the in-kernel-memory firmware blobs many years ago.
> 

It might not be totally crazy to back it by tmpfs. 


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 9:08 PM, Alexei Starovoitov  wrote:
>
> there is not abi breakage and file cannot disappear from running task.
> One cannot umount fs while file is still being used.

I think that "cannot umount" part _is_ the ABI breakage that Andy is
talking about.

> Not only "read twice", but "read many".
> If .text sections of elf that are not yet in memory can be modified
> by malicious user, later they will be brought in with different code.
> I think the easiest fix to tighten this "umh modules" to CAP_SYS_ADMIN.

I don't think it actually fixes anything.  It might just break things.
For all we know, people run modprobe with CAP_SYS_MODULE only, since
that is obviously the only capability it needs.

Hmm. I wish we had an "execute blob" model, but we really don't, and
it would be hard/impossible to do without pinning the pages in memory.

My gut feel is that the right direction to explore is:

 - consider the module loaded for the whole duration of the execve. So
the execution is a *blocking* operation (and we get the correct
exclusion semantics)

 - use deny_write_access() to make sure that we don't have active
writers and cannot get them during the execve.

The above mean that something that executes to load a new ebpf rule
will work very well.  But a "start and forget" will not work (although
you can obviously do so with a internal fork/exec).

Hmm?

   Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Greg KH
On Fri, Mar 09, 2018 at 09:32:36AM -0800, Alexei Starovoitov wrote:
> On 3/9/18 8:24 AM, Andy Lutomirski wrote:
> > On Fri, Mar 9, 2018 at 3:39 PM, Alexei Starovoitov  wrote:
> > > On 3/9/18 7:16 AM, Andy Lutomirski wrote:
> > > > > > 
> > > > > > On Mar 8, 2018, at 9:08 PM, Alexei Starovoitov  wrote:
> > > > > > 
> > > > > > On 3/8/18 7:54 PM, Andy Lutomirski wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > On Mar 8, 2018, at 7:06 PM, Linus Torvalds
> > > > > > >  wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > Honestly, that "read twice" thing may be what scuttles this.
> > > > > > > Initially, I thought it was a non-issue, because anybody who 
> > > > > > > controls
> > > > > > > the module subdirectory enough to rewrite files would be in a 
> > > > > > > position
> > > > > > > to just execute the file itself directly instead.
> > > > > > 
> > > > > > 
> > > > > > On further consideration, I think there’s another showstopper. This
> > > > > > patch is a potentially severe ABI break. Right now, loading a module
> > > > > > *copies* it into memory and does not hold a reference to the 
> > > > > > underlying fs.
> > > > > > With the patch applied, all kinds of use cases can break in gnarly 
> > > > > > ways.
> > > > > > Initramfs is maybe okay, but initrd may be screwed. If you load an 
> > > > > > ET_EXEC
> > > > > > module from initrd, then umount it, then clear the ramdisk, 
> > > > > > something will
> > > > > > go horribly wrong. Exactly what goes wrong depends on whether 
> > > > > > userspace
> > > > > > notices that umount() failed. Similarly, if you load one of these 
> > > > > > modules
> > > > > > over a network and then lose your connection, you have a problem.
> > > > > 
> > > > > 
> > > > > there is not abi breakage and file cannot disappear from running task.
> > > > > One cannot umount fs while file is still being used.
> > > > 
> > > > 
> > > > Sure it is. Without your patch, init_module doesn’t keep using the
> > > > file, so it’s common practice to load a module and then delete or
> > > > unmount it. With your patch, the unmount case breaks. This is likely
> > > > to break existing userspace, so, in Linux speak it’s an ABI break.
> > > 
> > > 
> > > please read the patch again.
> > > file is only used in case of umh modules.
> > > There is zero difference in default case.
> > 
> > Say I'm running some distro or other working Linux setup.  I upgrade
> > my kernel to a kernel that uses umh modules.  The user tooling
> > generates some kind of boot entry that references the new kernel
> > image, and it also generates a list of modules to be loaded at various
> > times in the boot process.  This list might, and probably should,
> > include one or more umh modules.  (You are being very careful to make
> > sure that depmod keeps working, so umh modules are clearly intended to
> > work with existing tooling.)  So now I have a kernel image and some
> > modules to be loaded from various places.  And I have an init script
> > (initramfs's '/init' or similar) that will call init_module() on that
> > .ko file.  That script was certainly written under the assumption
> > that, once init_module() returns, the kernel is done with the .ko
> > file.  With your patch applied, that assumption is no longer true.
> 
> There is no intent to use umh modules during boot process.

For _your_ use case, yes.  For mine and Andy's and someone else's in the
future, it might be :)

You are creating a very generic, new, user/kernel api that a whole bunch
of people are going to want to use.  Let's not hamper the ability for us
all to use this right from the beginning please.

> This is not a replacement for drivers and kernel modules.
> From your earlier comments regarding usb driver as umh module
> I suspect you're assuming that everything will sooner or later
> will convert to umh model.

We have userspace drivers for USB today, being able to drag that
out-of-tree codebase into the kernel is a _HUGE_ bonus, and something
that I would love to do for a lot of reasons.  I also can see moving
some of our existing in-kernel drivers out of the kernel in a way that
provides "it just works" functionality by using this type of feature.

So again, please don't prevent us from using this, otherwise you should
be just calling this "bpf_usermode_helper" :)

Oh, and for the record, I like Andy's proposal as well as dumping this
into a kernel module "blob" with the exception that this now would take
up unswapable memory, which isn't the nicest and is one big reason we
removed the in-kernel-memory firmware blobs many years ago.

thanks,

greg k-h


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Alexei Starovoitov

On 3/9/18 8:24 AM, Andy Lutomirski wrote:

On Fri, Mar 9, 2018 at 3:39 PM, Alexei Starovoitov  wrote:

On 3/9/18 7:16 AM, Andy Lutomirski wrote:


On Mar 8, 2018, at 9:08 PM, Alexei Starovoitov  wrote:

On 3/8/18 7:54 PM, Andy Lutomirski wrote:




On Mar 8, 2018, at 7:06 PM, Linus Torvalds
 wrote:


Honestly, that "read twice" thing may be what scuttles this.
Initially, I thought it was a non-issue, because anybody who controls
the module subdirectory enough to rewrite files would be in a position
to just execute the file itself directly instead.



On further consideration, I think there’s another showstopper. This
patch is a potentially severe ABI break. Right now, loading a module
*copies* it into memory and does not hold a reference to the underlying fs.
With the patch applied, all kinds of use cases can break in gnarly ways.
Initramfs is maybe okay, but initrd may be screwed. If you load an ET_EXEC
module from initrd, then umount it, then clear the ramdisk, something will
go horribly wrong. Exactly what goes wrong depends on whether userspace
notices that umount() failed. Similarly, if you load one of these modules
over a network and then lose your connection, you have a problem.



there is not abi breakage and file cannot disappear from running task.
One cannot umount fs while file is still being used.



Sure it is. Without your patch, init_module doesn’t keep using the
file, so it’s common practice to load a module and then delete or
unmount it. With your patch, the unmount case breaks. This is likely
to break existing userspace, so, in Linux speak it’s an ABI break.



please read the patch again.
file is only used in case of umh modules.
There is zero difference in default case.


Say I'm running some distro or other working Linux setup.  I upgrade
my kernel to a kernel that uses umh modules.  The user tooling
generates some kind of boot entry that references the new kernel
image, and it also generates a list of modules to be loaded at various
times in the boot process.  This list might, and probably should,
include one or more umh modules.  (You are being very careful to make
sure that depmod keeps working, so umh modules are clearly intended to
work with existing tooling.)  So now I have a kernel image and some
modules to be loaded from various places.  And I have an init script
(initramfs's '/init' or similar) that will call init_module() on that
.ko file.  That script was certainly written under the assumption
that, once init_module() returns, the kernel is done with the .ko
file.  With your patch applied, that assumption is no longer true.


There is no intent to use umh modules during boot process.
This is not a replacement for drivers and kernel modules.
From your earlier comments regarding usb driver as umh module
I suspect you're assuming that everything will sooner or later
will convert to umh model.
There is no such intent. umh approach is targeting one specific
use case of converting one stable uapi into another stable uapi.
It's all control plane that can be a slow as it needs to be.
Critical kernel datapath is not going to be affected
(especially the one needed to boot)
because umh is a user mode app running async with the rest of kernel.

With patch applied there are still zero users of it.
bpfilter and nft2bpf are the only two that are going to use
this interface. Every other potential user will be code reviewed
just like everything else in the kernel land.
So your statement that with patch applied there is an ABI breakage
is just false.

At the same time I agree that keeping fs pinned while umh module
started from that fs is not great, so I intend to solve it somehow
in v2 while keeping the approach being elf based for
debuggability reasons explained earlier.


Heck, on my laptop, all my .ko files are labeled
system_u:object_r:modules_object_t:s0.  I wonder how many SELinux
setups (and AppArmor, etc) will actually disallow execve() on modules?


I don't think it's a good idea to move lsm into umh.


Can you please try to have a constructive discussion here?


I'd like to ask the same favor.
Claiming ABI breakage when there is none is not constructive.
Saying that "ohh there must be a security issue here, because it looks
complex" is not constructive either.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Andy Lutomirski
On Fri, Mar 9, 2018 at 3:39 PM, Alexei Starovoitov  wrote:
> On 3/9/18 7:16 AM, Andy Lutomirski wrote:

 On Mar 8, 2018, at 9:08 PM, Alexei Starovoitov  wrote:

 On 3/8/18 7:54 PM, Andy Lutomirski wrote:



> On Mar 8, 2018, at 7:06 PM, Linus Torvalds
>  wrote:
>
>
> Honestly, that "read twice" thing may be what scuttles this.
> Initially, I thought it was a non-issue, because anybody who controls
> the module subdirectory enough to rewrite files would be in a position
> to just execute the file itself directly instead.


 On further consideration, I think there’s another showstopper. This
 patch is a potentially severe ABI break. Right now, loading a module
 *copies* it into memory and does not hold a reference to the underlying fs.
 With the patch applied, all kinds of use cases can break in gnarly ways.
 Initramfs is maybe okay, but initrd may be screwed. If you load an ET_EXEC
 module from initrd, then umount it, then clear the ramdisk, something will
 go horribly wrong. Exactly what goes wrong depends on whether userspace
 notices that umount() failed. Similarly, if you load one of these modules
 over a network and then lose your connection, you have a problem.
>>>
>>>
>>> there is not abi breakage and file cannot disappear from running task.
>>> One cannot umount fs while file is still being used.
>>
>>
>> Sure it is. Without your patch, init_module doesn’t keep using the
>> file, so it’s common practice to load a module and then delete or
>> unmount it. With your patch, the unmount case breaks. This is likely
>> to break existing userspace, so, in Linux speak it’s an ABI break.
>
>
> please read the patch again.
> file is only used in case of umh modules.
> There is zero difference in default case.

Say I'm running some distro or other working Linux setup.  I upgrade
my kernel to a kernel that uses umh modules.  The user tooling
generates some kind of boot entry that references the new kernel
image, and it also generates a list of modules to be loaded at various
times in the boot process.  This list might, and probably should,
include one or more umh modules.  (You are being very careful to make
sure that depmod keeps working, so umh modules are clearly intended to
work with existing tooling.)  So now I have a kernel image and some
modules to be loaded from various places.  And I have an init script
(initramfs's '/init' or similar) that will call init_module() on that
.ko file.  That script was certainly written under the assumption
that, once init_module() returns, the kernel is done with the .ko
file.  With your patch applied, that assumption is no longer true.

RHEL 5 uses initrd and is still supported.  For all I know, some
embedded setups put their initial /lib on some block device that
literally disappears after they finish booting.  There are livecds
that let you boot in a mode that lets you remove the CD after you
finish booting.  Heck, someone must have built something that calls
init_module() on a FUSE filesystem.

Heck, on my laptop, all my .ko files are labeled
system_u:object_r:modules_object_t:s0.  I wonder how many SELinux
setups (and AppArmor, etc) will actually disallow execve() on modules?

>
>>>

 The “read twice” thing is also bad for another reason: containers.
 Suppose I have a setup where a container can load a signed module blob. 
 With
 the read twice code, the container can race and run an entirely different
 blob outside the container.
>>>
>>>
>>> Not only "read twice", but "read many".
>>> If .text sections of elf that are not yet in memory can be modified
>>> by malicious user, later they will be brought in with different code.
>>> I think the easiest fix to tighten this "umh modules" to CAP_SYS_ADMIN.
>>
>>
>> Given this issue, I think the patch would need Kees’s explicit ack. I
>> had initially thought your patch had minimal security impact, but I
>> was wrong Module security is very complicated and needs to satisfy a
>> bunch of requirements. There is a lot of code in the kernel that
>> assumes that it’s sufficient to verify a module once at load time,
>> your patch changes that, and this has all kinds of nasty interactions
>> with autoloading.
>
>
> not true. you misread the patch and making incorrect conclusions.

So please tell my exactly how I misread the patch and why Linus'
conclusion (which is what I'm echoing) is wrong.

>
> I think you need to stop overreacting on non-issue.
>

Can you please try to have a constructive discussion here?  It's not
so enjoyable to review patches when author declares review comments to
be non-issues without actually explaining *why* they're non-issues.

Alexei, I'm willing to be shown that I'm wrong, but you have to show
me how I'm wrong rather than just telling me over and over that I'm
wrong.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Alexei Starovoitov

On 3/9/18 7:16 AM, Andy Lutomirski wrote:

On Mar 8, 2018, at 9:08 PM, Alexei Starovoitov  wrote:

On 3/8/18 7:54 PM, Andy Lutomirski wrote:




On Mar 8, 2018, at 7:06 PM, Linus Torvalds  
wrote:


Honestly, that "read twice" thing may be what scuttles this.
Initially, I thought it was a non-issue, because anybody who controls
the module subdirectory enough to rewrite files would be in a position
to just execute the file itself directly instead.


On further consideration, I think there’s another showstopper. This patch is a 
potentially severe ABI break. Right now, loading a module *copies* it into 
memory and does not hold a reference to the underlying fs. With the patch 
applied, all kinds of use cases can break in gnarly ways. Initramfs is maybe 
okay, but initrd may be screwed. If you load an ET_EXEC module from initrd, 
then umount it, then clear the ramdisk, something will go horribly wrong. 
Exactly what goes wrong depends on whether userspace notices that umount() 
failed. Similarly, if you load one of these modules over a network and then 
lose your connection, you have a problem.


there is not abi breakage and file cannot disappear from running task.
One cannot umount fs while file is still being used.


Sure it is. Without your patch, init_module doesn’t keep using the
file, so it’s common practice to load a module and then delete or
unmount it. With your patch, the unmount case breaks. This is likely
to break existing userspace, so, in Linux speak it’s an ABI break.


please read the patch again.
file is only used in case of umh modules.
There is zero difference in default case.





The “read twice” thing is also bad for another reason: containers. Suppose I 
have a setup where a container can load a signed module blob. With the read 
twice code, the container can race and run an entirely different blob outside 
the container.


Not only "read twice", but "read many".
If .text sections of elf that are not yet in memory can be modified
by malicious user, later they will be brought in with different code.
I think the easiest fix to tighten this "umh modules" to CAP_SYS_ADMIN.


Given this issue, I think the patch would need Kees’s explicit ack. I
had initially thought your patch had minimal security impact, but I
was wrong Module security is very complicated and needs to satisfy a
bunch of requirements. There is a lot of code in the kernel that
assumes that it’s sufficient to verify a module once at load time,
your patch changes that, and this has all kinds of nasty interactions
with autoloading.


not true. you misread the patch and making incorrect conclusions.


Kees is very reasonable, and he’ll change his mind
and ack a patch that he’s nacked when presented with a valid technical
argument.

But I think my ABI break observation is also a major problem, and
Linus is going to be pissed if this thing lands in his tree and breaks
systems due to an issue that was raised during review. So I think you
need to either rework the patch or do a serious survey of how all the


I think you need to stop overreacting on non-issue.


distros deal with modules (dracut, initramfs-tools, all the older
stuff, and probably more) and make sure they can all handle your
patch.  I'd also be concerned about anyone who puts /lib/modules on
less reliable storage than they use for the rest of their system.  (I
know it's quite common to have /boot be the only non-RAID partition on
a system, but modules don't generally live in /boot.)

Also, I think you really ought to explain how your approach will work
with MODULES=n or convince Linus that it’s okay to start adding core
networking features that don’t work with MODULES=n and can't be built
into the main kernel image.


that ship sailed long ago.
config BPF_JIT
bool "enable BPF Just In Time compiler"
depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT
depends on MODULES




Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Andy Lutomirski
>> On Mar 8, 2018, at 9:08 PM, Alexei Starovoitov  wrote:
>>
>> On 3/8/18 7:54 PM, Andy Lutomirski wrote:
>>
>>
>>
>>> On Mar 8, 2018, at 7:06 PM, Linus Torvalds  
>>> wrote:
>>>
>>>
>>> Honestly, that "read twice" thing may be what scuttles this.
>>> Initially, I thought it was a non-issue, because anybody who controls
>>> the module subdirectory enough to rewrite files would be in a position
>>> to just execute the file itself directly instead.
>>
>> On further consideration, I think there’s another showstopper. This patch is 
>> a potentially severe ABI break. Right now, loading a module *copies* it into 
>> memory and does not hold a reference to the underlying fs. With the patch 
>> applied, all kinds of use cases can break in gnarly ways. Initramfs is maybe 
>> okay, but initrd may be screwed. If you load an ET_EXEC module from initrd, 
>> then umount it, then clear the ramdisk, something will go horribly wrong. 
>> Exactly what goes wrong depends on whether userspace notices that umount() 
>> failed. Similarly, if you load one of these modules over a network and then 
>> lose your connection, you have a problem.
>
> there is not abi breakage and file cannot disappear from running task.
> One cannot umount fs while file is still being used.

Sure it is. Without your patch, init_module doesn’t keep using the
file, so it’s common practice to load a module and then delete or
unmount it. With your patch, the unmount case breaks. This is likely
to break existing userspace, so, in Linux speak it’s an ABI break.

>
>>
>> The “read twice” thing is also bad for another reason: containers. Suppose I 
>> have a setup where a container can load a signed module blob. With the read 
>> twice code, the container can race and run an entirely different blob 
>> outside the container.
>
> Not only "read twice", but "read many".
> If .text sections of elf that are not yet in memory can be modified
> by malicious user, later they will be brought in with different code.
> I think the easiest fix to tighten this "umh modules" to CAP_SYS_ADMIN.

Given this issue, I think the patch would need Kees’s explicit ack. I
had initially thought your patch had minimal security impact, but I
was wrong Module security is very complicated and needs to satisfy a
bunch of requirements. There is a lot of code in the kernel that
assumes that it’s sufficient to verify a module once at load time,
your patch changes that, and this has all kinds of nasty interactions
with autoloading.  Kees is very reasonable, and he’ll change his mind
and ack a patch that he’s nacked when presented with a valid technical
argument.

But I think my ABI break observation is also a major problem, and
Linus is going to be pissed if this thing lands in his tree and breaks
systems due to an issue that was raised during review. So I think you
need to either rework the patch or do a serious survey of how all the
distros deal with modules (dracut, initramfs-tools, all the older
stuff, and probably more) and make sure they can all handle your
patch.  I'd also be concerned about anyone who puts /lib/modules on
less reliable storage than they use for the rest of their system.  (I
know it's quite common to have /boot be the only non-RAID partition on
a system, but modules don't generally live in /boot.)

Also, I think you really ought to explain how your approach will work
with MODULES=n or convince Linus that it’s okay to start adding core
networking features that don’t work with MODULES=n and can't be built
into the main kernel image.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Alexei Starovoitov

On 3/8/18 7:54 PM, Andy Lutomirski wrote:





On Mar 8, 2018, at 7:06 PM, Linus Torvalds  
wrote:


Honestly, that "read twice" thing may be what scuttles this.
Initially, I thought it was a non-issue, because anybody who controls
the module subdirectory enough to rewrite files would be in a position
to just execute the file itself directly instead.



On further consideration, I think there’s another showstopper. This patch is a 
potentially severe ABI break. Right now, loading a module *copies* it into 
memory and does not hold a reference to the underlying fs. With the patch 
applied, all kinds of use cases can break in gnarly ways. Initramfs is maybe 
okay, but initrd may be screwed. If you load an ET_EXEC module from initrd, 
then umount it, then clear the ramdisk, something will go horribly wrong. 
Exactly what goes wrong depends on whether userspace notices that umount() 
failed. Similarly, if you load one of these modules over a network and then 
lose your connection, you have a problem.


there is not abi breakage and file cannot disappear from running task.
One cannot umount fs while file is still being used.



The “read twice” thing is also bad for another reason: containers. Suppose I 
have a setup where a container can load a signed module blob. With the read 
twice code, the container can race and run an entirely different blob outside 
the container.


Not only "read twice", but "read many".
If .text sections of elf that are not yet in memory can be modified
by malicious user, later they will be brought in with different code.
I think the easiest fix to tighten this "umh modules" to CAP_SYS_ADMIN.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Andy Lutomirski



> On Mar 8, 2018, at 7:06 PM, Linus Torvalds  
> wrote:
> 
> 
> Honestly, that "read twice" thing may be what scuttles this.
> Initially, I thought it was a non-issue, because anybody who controls
> the module subdirectory enough to rewrite files would be in a position
> to just execute the file itself directly instead.
> 

On further consideration, I think there’s another showstopper. This patch is a 
potentially severe ABI break. Right now, loading a module *copies* it into 
memory and does not hold a reference to the underlying fs. With the patch 
applied, all kinds of use cases can break in gnarly ways. Initramfs is maybe 
okay, but initrd may be screwed. If you load an ET_EXEC module from initrd, 
then umount it, then clear the ramdisk, something will go horribly wrong. 
Exactly what goes wrong depends on whether userspace notices that umount() 
failed. Similarly, if you load one of these modules over a network and then 
lose your connection, you have a problem. 


The “read twice” thing is also bad for another reason: containers. Suppose I 
have a setup where a container can load a signed module blob. With the read 
twice code, the container can race and run an entirely different blob outside 
the container. 

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Alexei Starovoitov
On Fri, Mar 09, 2018 at 02:12:24AM +, Andy Lutomirski wrote:
> On Fri, Mar 9, 2018 at 1:20 AM, Alexei Starovoitov
>  wrote:
> > On Fri, Mar 09, 2018 at 12:59:36AM +, Andy Lutomirski wrote:
> >>
> >> Alexei, can you give an example use case?  I'm sure it's upthread
> >> somewhere, but I'm having trouble finding it.
> >
> > at the time of iptable's setsockopt() the kernel will do
> > err = request_module("bpfilter");
> > once.
> > The rough POC code:
> > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/ipv4/bpfilter/sockopt.c?h=ipt_bpf#n25
> 
> Here's what I gather from reading that code: you have a new kernel
> feature (consisting of actual kernel code) that wants to defer some of
> its implementation to user mode.  I like this idea a lot.  But I have
> a suggestion for a slightly different way of accomplishing the same
> thing.  Rather than extending init_module() to accept ELF input,
> except the call_umh code to be able to call blobs.  You'd use it it
> very roughly like this:
> 
> First, compile your user code and emit a staitc binary.  Use objdump
> fiddling or a trivial .S file to make that static binary into a
> variable.  Then write a tiny shim module like this:
> 
> extern unsigned char __begin_user_code[], __end_user_code[];
> 
> int __init init_shim_module(void)
> {
>   return call_umh_blob(__begin_user_code, __end_user_code - 
> __begin_user_code);
> }
> 
> By itself, this is clearly a worse solution than yours, but it has two
> benefits, one small and two big.  The small benefit is that it is
> completely invisible to userspace: the .ko file is a bona fide module.

Unfortunately it's not quite the case.
The normal .ko that does call_umh_blob is indeed seen in lsmod,
but the umh process is a separate task.
It could have been oomed or killed by admin and
this normal .ko wouldn't notice it, so health check of umh process
by the kernel is needed regardless.
Right now bpfilter has trivial fuse-like protocol. This part
is still to be designed cleanly.

No doubt that visibility and debuggability into this umh processes
is must have, but lsmod/rmmod interface doesn't quite fit.
As you said letting this priv tasks register themselves in lsmod
is certainly no-go.
I think if they will be in lsmod, kernel has to register them
and establish health check with umh at the same time.
I think worrying about restarting is not necessary.
This is still kernel code with the same high standards and
review process. If they crash it's really a kernel bug.
It only doesn't take the system down.

> I think we don't want to end up in a situation where we ship a program
> with a .ko extension that opens something in /dev, for example.

this part I don't get. What's wrong with open of /dev ?
I don't see a use case for it, but technically why not?

> call_umh_blob() would create an anon_inode or similar object backed by
> the blob and exec it.

Interesting. I haven't considered such approach.

For full context it all started from the idea of 'unprivileged kernel modules'
or 'hardened kernel modules'. Something that kernel can easily
interact with, but much safer than traditional kernel module.

I've tried a bunch of crappy ideas first:
1. have a piece of kernel .text vm_mmap-ed into user process that
doing iptables setsockopt and on return to user space force handle_signal
to execute that code. Sort of like forced ld_preload where parasite
code is provided by the kernel but runs in user space
2. have a special set of kernel page tables in read-only mode while
iptable->bpf conversion is happening
3. have load_module() fork a user task and load real kernel .ko into it

trying to hack #3 realized that I'm mainly copy-pasting a lot of
load_elf_binary() code without elf_interpreter bits,
so figured it's much easier and simpler to blend sys_finit_module
with load_elf_binary via tweaking do_execveat_common and keeping
that .ko as normal elf which is implemented in this patch.
Debugging of that fake .ko is so much better.
If it's done via call_umh_blob() the process that starts
will indeed be a user mode process, but you won't be able to
attach gdb to it. Whereas in this patch it's normal elf and
standard debugging techniques apply. A developer can
do gdb breakpoints, debug info, etc.
That is huge advantage of keeping it as normal elf.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 7:06 PM, Linus Torvalds
 wrote:
>
> So I don't like Andy's "let's make it a kernel module and then that
> kernel module can execve() a blob". THAT seems like just stupid
> indirection.
>
> But I do like Andy's "execve a blob" part, because it is the *blob*
> that has had its signature verified, not the file!

To be honest., Andy's approach has the advantage that all the
synchronization we do for modules still works.

In particular, we have module loading logic where we synchronize
loading of modules with the same name, so that two things that do
request_module() concurrently will not lead to two modules being
loaded.

See that whole "module_mutex" thing, and the logic in (for example)
"add_unformed_module()".

Andrei's patch short-circuits the module loading before that, so if
you have two concurrent request_module() calls, you'll end up with no
serialization, and two executables.

So maybe Andy is right. He often is, after all.

 Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Andy Lutomirski



> On Mar 8, 2018, at 6:31 PM, David Miller  wrote:
> 
> From: Andy Lutomirski 
> Date: Fri, 9 Mar 2018 02:12:24 +
> 
>> First, compile your user code and emit a staitc binary.  Use objdump
>> fiddling or a trivial .S file to make that static binary into a
>> variable.  Then write a tiny shim module like this:
>> 
>> extern unsigned char __begin_user_code[], __end_user_code[];
>> 
>> int __init init_shim_module(void)
>> {
>>  return call_umh_blob(__begin_user_code, __end_user_code - 
>> __begin_user_code);
>> }
>> 
>> By itself, this is clearly a worse solution than yours, but it has two
>> benefits, one small and two big.  The small benefit is that it is
>> completely invisible to userspace: the .ko file is a bona fide module.
> 
> Anything you try to do which makes these binaries "special" is a huge
> negative.

I don’t know what you mean.  Alexei’s approach introduces a whole new kind of 
special module.  Mine doesn’t. 

> 
>> The big benefits are:
> 
> I don't see those things as benefits at all, and Alexei's scheme can
> easily be made to work in your benefit #1 case too.
> 

How?  I think you’ll find that a non-modular implementation of a bundled ELF 
binary looks a *lot* like my call_umh_blob().

> It's a user binary.  It's shipped with the kernel and it's signed.
> 
> If we can't trust that, we can't trust much else.

I’m not making any arguments about security at all. I’m talking about 
functionality. 

If we apply Alexei’s patch as is, then I think we’ll have a situation where 
ET_EXEC modules are only useful if they can do their jobs without any 
filesystem access at all.  This is fine for networking, where netlink sockets 
are used, but I think it’s not so great for other use cases. If we ever try to 
stick a usb driver into userspace, we’re going to want to instantiate the user 
task once per device, passed as stdin or similar, and Alexei’s code will make 
that very awkward.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 5:44 PM, Kees Cook  wrote:
>
> My concerns are mostly about crossing namespaces. If a container
> triggers an autoload, the result runs in the init_ns.

Heh. I saw that as an advantage. It's basically the same semantics as
a normal module load does - in that the "kernel namespace" is init_ns.

My own personal worry is actually different - we do check the
signature of the file we're loading, but we're then passing it off to
execve() not as the image we loaded, but as the file pointer.

So the execve() will end up not using the actual buffer we checked the
signature on, but instead just re-reading the file.

Now, that has two issues:

 (a) it means that 'init_module' doesn't work, only 'finit_module'.

Not a big deal, but I do think that we should fail the 'info->file
== NULL' case explicitly (instead of failing when it does an execve()
of /dev/null, which is what I think it does now - but that's just from
the reading the patch, not from testing it).

 (b) somebody could maybe try to time it and modify the file
after-the-fact of the signature check, and then we execute something
else.

Honestly, that "read twice" thing may be what scuttles this.
Initially, I thought it was a non-issue, because anybody who controls
the module subdirectory enough to rewrite files would be in a position
to just execute the file itself directly instead.

But it turns out that isn't needed. Some bad actor could just do
"finit_module()" with a file that they just *copied* from the module
directory.

Yes, yes, they still need CAP_SYS_MODULE to even get that far, but it
does worry me.

So this does seem to turn "CAP_SYS_MODULE" into meaning "can run any
executable as root in the init namespace". They don't have to have the
module signing key that I thought would protect us, because they can
do that "rewrite after signature check".

So that does seem like a bad problem with the approach.

So I don't like Andy's "let's make it a kernel module and then that
kernel module can execve() a blob". THAT seems like just stupid
indirection.

But I do like Andy's "execve a blob" part, because it is the *blob*
that has had its signature verified, not the file!

 Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread David Miller
From: Andy Lutomirski 
Date: Fri, 9 Mar 2018 02:12:24 +

> First, compile your user code and emit a staitc binary.  Use objdump
> fiddling or a trivial .S file to make that static binary into a
> variable.  Then write a tiny shim module like this:
> 
> extern unsigned char __begin_user_code[], __end_user_code[];
> 
> int __init init_shim_module(void)
> {
>   return call_umh_blob(__begin_user_code, __end_user_code - 
> __begin_user_code);
> }
> 
> By itself, this is clearly a worse solution than yours, but it has two
> benefits, one small and two big.  The small benefit is that it is
> completely invisible to userspace: the .ko file is a bona fide module.

Anything you try to do which makes these binaries "special" is a huge
negative.

> The big benefits are:

I don't see those things as benefits at all, and Alexei's scheme can
easily be made to work in your benefit #1 case too.

It's a user binary.  It's shipped with the kernel and it's signed.

If we can't trust that, we can't trust much else.

And this whole container argument..  It's a mirage.

Kernel modules are 1000 times worse, since they can access any
container and any namespace they want.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Andy Lutomirski
On Fri, Mar 9, 2018 at 1:20 AM, Alexei Starovoitov
 wrote:
> On Fri, Mar 09, 2018 at 12:59:36AM +, Andy Lutomirski wrote:
>>
>> Alexei, can you give an example use case?  I'm sure it's upthread
>> somewhere, but I'm having trouble finding it.
>
> at the time of iptable's setsockopt() the kernel will do
> err = request_module("bpfilter");
> once.
> The rough POC code:
> https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/ipv4/bpfilter/sockopt.c?h=ipt_bpf#n25

Here's what I gather from reading that code: you have a new kernel
feature (consisting of actual kernel code) that wants to defer some of
its implementation to user mode.  I like this idea a lot.  But I have
a suggestion for a slightly different way of accomplishing the same
thing.  Rather than extending init_module() to accept ELF input,
except the call_umh code to be able to call blobs.  You'd use it it
very roughly like this:

First, compile your user code and emit a staitc binary.  Use objdump
fiddling or a trivial .S file to make that static binary into a
variable.  Then write a tiny shim module like this:

extern unsigned char __begin_user_code[], __end_user_code[];

int __init init_shim_module(void)
{
  return call_umh_blob(__begin_user_code, __end_user_code - __begin_user_code);
}

By itself, this is clearly a worse solution than yours, but it has two
benefits, one small and two big.  The small benefit is that it is
completely invisible to userspace: the .ko file is a bona fide module.
The big benefits are:

1. It works even in a non-modular kernel!  (Okay, it probably only
works if you can arrange for the built-in module to be initialized
late enough, but that's straightforward.)

2. It allows future extensions to change the way the glue works.  For
example, maybe you want the module to integrate properly with lsmod,
etc.  Rather than adding a mechanism for general privileged programs
to register themselves with lsmod (ick!), you could do it entirely in
the kernel where lsmod would know that a particular umh task is
special.  More usefully, you could extend call_umh_blob() to pass in
some pre-initialized struct files, which would give a clean way to
*synchronously* create a communication channel to user code for
whatever service the user code provides.  And it would be more
straightforward to make the umh blob do what it needs to do without
relying on any particular filesystems being mounted.

I think we don't want to end up in a situation where we ship a program
with a .ko extension that opens something in /dev, for example.

call_umh_blob() would create an anon_inode or similar object backed by
the blob and exec it.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Luis R. Rodriguez
On Thu, Mar 08, 2018 at 03:07:01PM -0800, Alexei Starovoitov wrote:
> On 3/7/18 5:23 PM, Luis R. Rodriguez wrote:
> > 
> > request_module() has its own world though too. How often in your proof of
> > concept is request_module() called? How many times do you envision it being
> > called?
> 
> once.

What about other users later in the kernel?

> > > +static int run_umh(struct file *file)
> > > +{
> > > + struct subprocess_info *sub_info = call_usermodehelper_setup_file(file);
> > > +
> > > + if (!sub_info)
> > > + return -ENOMEM;
> > > + return call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
> > > +}
> > 
> > run_umh() calls the program and waits. Note that while we are running a UMH 
> > we
> > can't suspend. What if they take forever, who is hosing them down with an
> > equivalent:
> > 
> > schedule();
> > try_to_freeze();
> > 
> > Say they are buggy and never return, will they stall suspend, have you 
> > tried?
> > 
> > call_usermodehelper_exec() uses helper_lock() which in turn is used for
> > umh's accounting for number of running umh's. One of the sad obscure uses
> > for this is to deal with suspend/resume. Refer to __usermodehelper* calls
> > on kernel/power/process.c
> > 
> > Note how you use UMH_WAIT_EXEC too, so this is all synchronous.
> 
> looks like you misread this code
>
> #define UMH_NO_WAIT 0   /* don't wait at all */
> #define UMH_WAIT_EXEC   1   /* wait for the exec, but not the process */
> #define UMH_WAIT_PROC   2   /* wait for the process to complete */
> #define UMH_KILLABLE4   /* wait for EXEC/PROC killable */

I certainly did get the incorrect impression this was a sync op, sorry
about that. In that case its odd to see a request_module() call, when
what is really meant is more of a request_module_nowait(), its also
UMH_NO_WAIT on the modprobe call itself as well.

In fact I think I'd much prefer at the very least to see a different wrapper
for this, even if its using the same bolts and nuts underneath for userspace
processes compiled on the kernel. The sanity checks in place for
request_module() may change later and this use case seems rather simple and
limited. Keeping tabs on this type of users seems desirable. At the *very
least*

diff --git a/include/linux/kmod.h b/include/linux/kmod.h
index 40c89ad4bea6..7530e00da03b 100644
--- a/include/linux/kmod.h
+++ b/include/linux/kmod.h
@@ -37,11 +37,13 @@ extern __printf(2, 3)
 int __request_module(bool wait, const char *name, ...);
 #define request_module(mod...) __request_module(true, mod)
 #define request_module_nowait(mod...) __request_module(false, mod)
+#define request_umh_proc(mod...) __request_module(false, mod)
 #define try_then_request_module(x, mod...) \
((x) ?: (__request_module(true, mod), (x)))
 #else
 static inline int request_module(const char *name, ...) { return -ENOSYS; }
 static inline int request_module_nowait(const char *name, ...) { return 
-ENOSYS; }
+static inline int request_umh_proc(const char *name, ...) { return -ENOSYS; }
 #define try_then_request_module(x, mod...) (x)
 #endif

Modulo, kernel/umh.c is already its own thing, so pick a name that helps
identify these things clearly.

> and the rest of your concerns with suspend/resume are not applicable any
> more.

Agreed, except it does still mean these userspace processes are limited to
execution only the kernel is up, and not going to suspend, but I think that's
a proper sanity check by the umh API.

> bpftiler.ko is started once and runs non-stop from there on.
> Unless it crashes, but it's a different discussion.

Sure.

  Luis


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Kees Cook
On Thu, Mar 8, 2018 at 5:38 PM, Linus Torvalds
 wrote:
> On Thu, Mar 8, 2018 at 4:59 PM, Andy Lutomirski  wrote:
>>
>> Also, I don't see how this is any more exploitable than any other
>> init_module().
>
> Absolutely. If Kees doesn't trust the files to be loaded, an
> executable - even if it's running with root privileges and in the
> initns - is still fundamentally weaker than a kernel module.
>
> So I don't understand the security argument AT ALL. It's nonsensical.
> The executable loading does all the same security checks that the
> module loading does, including the signing check.
>
> And the whole point is that we can now do things with building and
> loading a ebpf rule instead of having a full module.

My concerns are mostly about crossing namespaces. If a container
triggers an autoload, the result runs in the init_ns. So, really,
there's nothing new from my perspective, except that it's in userspace
instead of in the kernel.

Perhaps it's an orthogonal concern.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 4:59 PM, Andy Lutomirski  wrote:
>
> Also, I don't see how this is any more exploitable than any other
> init_module().

Absolutely. If Kees doesn't trust the files to be loaded, an
executable - even if it's running with root privileges and in the
initns - is still fundamentally weaker than a kernel module.

So I don't understand the security argument AT ALL. It's nonsensical.
The executable loading does all the same security checks that the
module loading does, including the signing check.

And the whole point is that we can now do things with building and
loading a ebpf rule instead of having a full module.

 Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Alexei Starovoitov
On Fri, Mar 09, 2018 at 01:04:39AM +, Andy Lutomirski wrote:
> On Fri, Mar 9, 2018 at 12:57 AM, Alexei Starovoitov  wrote:
> > On 3/8/18 4:24 PM, Kees Cook wrote:
> >>
> >> As Andy asked earlier, why not DYN too to catch PIE executables? Seems
> >> like forcing the userspace helper to be non-PIE would defeat some of
> >> the userspace defenses in use in most distros.
> >
> >
> > because we don't add features without concrete users.
> 
> I disagree here.  If you're going to add a magic trick that triggers
> an execve(), then I think you should either support *both* standard,
> widely-used types of ELF programs or you should give a compelling use
> case that works for ET_EXEC but not for ET_DYN.  Keep in mind that
> many distros have a very strong preference for ET_DYN.

misunderstanding here is that this bpfiler.ko is part of _kernel build_.
Kernel decides how it's build.
Nothing to do with distros.
Current Makefile is very dumb and it's built with HOSTCC:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/bpfilter/Makefile?h=ipt_bpf
but it will be standalone with CC before final to make sure crosscompiling 
works.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Kees Cook
On Thu, Mar 8, 2018 at 4:57 PM, Alexei Starovoitov  wrote:
> The above are three paragraphs of security paranoia without single
> concrete example of a security issue.

How is running an arbitrary ELF as full init_ns root from a container
not a concrete example?

I'm not saying this approach can never happen. And this isn't
paranoia. There are very real security boundary violations in this
model, IMO. Though, as Andy says, it's more about autoloading than umh
itself. I just don't want to extend that problem further.

>> As Andy asked earlier, why not DYN too to catch PIE executables? Seems
>> like forcing the userspace helper to be non-PIE would defeat some of
>> the userspace defenses in use in most distros.
>
>
> because we don't add features without concrete users.

It's just a two-line change, and then it would work on distros that
build PIE-by-default. That seems like a concrete use-case.

>> The exec.c changes should be split into a separate patch. Something
>> feels incorrectly refactored here, though. Passing all three of fd,
>> filename, and file to __do_execve_file() seems wrong; perhaps the fd
>> to file handling needs to happen externally in what you have here as
>> do_execveat_common()? The resulting __do_execve_file() has repeated
>> conditionals on filename... maybe what I object to is being able to
>> pass a NULL filename at all. The filename can be (painfully)
>> reconstructed from the struct file.
>
> reconstruct the path and instantly introduce the race between execing
> something by path vs prior check that it's actual elf of already
> opened file ?!
> excellent suggestion to improve security.

I'm not sure why you're being so hostile. We're both interested in
improving the kernel.

Path names aren't stable, but they provide good _debugging_ about
things when needed. For example, the LSM hooks, if they report on one
of these loads, can produce a hint to the user about what happened.
Passing "/dev/null" around is confusing at the very least. The ELF is
absolutely not /dev/null. Why make someone guess?

>>> [...]
>>> diff --git a/kernel/module.c b/kernel/module.c
>>> index ad2d420024f6..6cfa35795741 100644
>>> --- a/kernel/module.c
>>> +++ b/kernel/module.c
>>> [...]
>>> @@ -3870,14 +3896,21 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char
>>> __user *, uargs, int, flags)
>>>   |MODULE_INIT_IGNORE_VERMAGIC))
>>> return -EINVAL;
>>>
>>> -   err = kernel_read_file_from_fd(fd, , , INT_MAX,
>>> -  READING_MODULE);
>>> +   f = fdget(fd);
>>> +   if (!f.file)
>>> +   return -EBADF;
>>> +
>>> +   err = kernel_read_file(f.file, , , INT_MAX,
>>> READING_MODULE);
>>
>>
>> For the LSM subsystem, I think this should also get it's own enum
>> kernel_read_file_id. This is really no longer a kernel module...
>
>
> at this point it's a _file_. It could have been text file just well.
> If lsm is thinking that at this point kernel is processing
> kernel module that lsm is badly broken.

Again, this is about making things more discoverable. We already know
if we're loading a kernel module or a umh ELF. Passing this
information to the LSM is valuable to the LSM to distinguish between
types of files. They have very different effects on the system, for
example. The issue is about validating intent with target. "Is this
kernel module allowed?" and "Is this umh ELF allowed?" are better
questions that "Is something loaded through finit_module() allowed?"
Note already that the LSM distinguishes between many other file types
in kernel_read_file*().

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Alexei Starovoitov
On Fri, Mar 09, 2018 at 12:59:36AM +, Andy Lutomirski wrote:
> 
> Alexei, can you give an example use case?  I'm sure it's upthread
> somewhere, but I'm having trouble finding it.

at the time of iptable's setsockopt() the kernel will do
err = request_module("bpfilter");
once.
The rough POC code:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/ipv4/bpfilter/sockopt.c?h=ipt_bpf#n25

> Also, I just tested this concept a bit.  Depmod invoked explicitly on
> an ET_EXEC with a.ko extension gets mad, but depmod -a on a kernel
> that has a "module" like that seems to work fine.  Go figure.

right. that's with the current patch.
In v2 I require .modinfo section to make sure license is specified,
but depmod still not very happy:
$ depmod /lib/modules/`uname -r`/kernel/net/bpfilter/bpfilter.ko
depmod: ERROR: Bad version passed 
/lib/modules/4.16.0-rc4-00799-g1716f0aa3039-dirty/kernel/net/bpfilter/bpfilter.ko
I'm not sure it's worth to silence it, since
as you noticed 'depmod -a' works.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Andy Lutomirski
On Fri, Mar 9, 2018 at 12:57 AM, Alexei Starovoitov  wrote:
> On 3/8/18 4:24 PM, Kees Cook wrote:
>>
>> As Andy asked earlier, why not DYN too to catch PIE executables? Seems
>> like forcing the userspace helper to be non-PIE would defeat some of
>> the userspace defenses in use in most distros.
>
>
> because we don't add features without concrete users.

I disagree here.  If you're going to add a magic trick that triggers
an execve(), then I think you should either support *both* standard,
widely-used types of ELF programs or you should give a compelling use
case that works for ET_EXEC but not for ET_DYN.  Keep in mind that
many distros have a very strong preference for ET_DYN.

Or you could argue that ET_DYN requires tooling changes, but I think
it's awkward to ask the tooling to change in advance of the kernel
being willing to actually invoke the thing.  I'm not actually
convinced that any tooling changes would be needed, though.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Andy Lutomirski
On Fri, Mar 9, 2018 at 12:24 AM, Kees Cook  wrote:
> How is this not marked [RFC]? :)
>
> On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov  wrote:
>> As the first step in development of bpfilter project [1] the request_module()
>> code is extended to allow user mode helpers to be invoked. Idea is that
>> user mode helpers are built as part of the kernel build and installed as
>> traditional kernel modules with .ko file extension into distro specified
>> location, such that from a distribution point of view, they are no different
>> than regular kernel modules. Thus, allow request_module() logic to load such
>> user mode helper (umh) modules via:
>>
>>   request_module("foo") ->
>> call_umh("modprobe foo") ->
>>   sys_finit_module(FD of /lib/modules/.../foo.ko) ->
>> call_umh(struct file)
>
> Yikes. This means autoloaded artbitrary exec() now, instead of
> autoloading just loading arbitrary modules? That seems extremely bad.
> This just extends all the problems we've had with defining security
> boundaries with modules out to umh too. I would need some major
> convincing that this can be made safe.
>
> It's easy for container to trigger arbitrary module loading, so if it
> can find/use a similar one of the bugs we've seen in the past to
> redirect the module loading we don't just get new module-created
> attack surface added to the kernel, but we again get arbitrary ELF
> running in the root ns (not in the container). And in the insane case
> of a container with CAP_SYS_MODULE, this is a trivial escape without
> even needing to build a special kernel module: the root ns, running
> with all privileges runs an arbitrary ELF.
>
> As it stands, I have to NAK this. At the very least, you need to solve
> the execution environment problems here: the ELF should run with no
> greater privileges than what loaded the module, and very importantly,
> must not be allowed to bypass these checks through autoloading. What
> _triggered_ the autoload must be the environment, not the "modprobe",
> since that's running with full privileges. Basically, we need to fix
> module autoloading first, or at least a significant subset of the
> problems: https://lkml.org/lkml/2017/11/27/754

I disagree.  If we're going to somehow have ELF binaries that pretend
to be modules, then they should run with high privilege and, more
importantly, should run in a context independent of whoever triggered
autoloading.

Also, I don't see how this is any more exploitable than any other
init_module().  If someone has CAP_SYS_MODULE or autoload privileges
and they can trigger init_module() on a file, then they're granting
maximum privilege to the contents of that file.  So who cares if that
file is a kernel module or an ELF binary?

Alexei, can you give an example use case?  I'm sure it's upthread
somewhere, but I'm having trouble finding it.

Also, I just tested this concept a bit.  Depmod invoked explicitly on
an ET_EXEC with a.ko extension gets mad, but depmod -a on a kernel
that has a "module" like that seems to work fine.  Go figure.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Alexei Starovoitov

On 3/8/18 4:24 PM, Kees Cook wrote:

How is this not marked [RFC]? :)

On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov  wrote:

As the first step in development of bpfilter project [1] the request_module()
code is extended to allow user mode helpers to be invoked. Idea is that
user mode helpers are built as part of the kernel build and installed as
traditional kernel modules with .ko file extension into distro specified
location, such that from a distribution point of view, they are no different
than regular kernel modules. Thus, allow request_module() logic to load such
user mode helper (umh) modules via:

  request_module("foo") ->
call_umh("modprobe foo") ->
  sys_finit_module(FD of /lib/modules/.../foo.ko) ->
call_umh(struct file)


Yikes. This means autoloaded artbitrary exec() now, instead of
autoloading just loading arbitrary modules? That seems extremely bad.
This just extends all the problems we've had with defining security
boundaries with modules out to umh too. I would need some major
convincing that this can be made safe.

It's easy for container to trigger arbitrary module loading, so if it
can find/use a similar one of the bugs we've seen in the past to
redirect the module loading we don't just get new module-created
attack surface added to the kernel, but we again get arbitrary ELF
running in the root ns (not in the container). And in the insane case
of a container with CAP_SYS_MODULE, this is a trivial escape without
even needing to build a special kernel module: the root ns, running
with all privileges runs an arbitrary ELF.

As it stands, I have to NAK this. At the very least, you need to solve
the execution environment problems here: the ELF should run with no
greater privileges than what loaded the module, and very importantly,
must not be allowed to bypass these checks through autoloading. What
_triggered_ the autoload must be the environment, not the "modprobe",
since that's running with full privileges. Basically, we need to fix
module autoloading first, or at least a significant subset of the
problems: https://lkml.org/lkml/2017/11/27/754


The above are three paragraphs of security paranoia without single
concrete example of a security issue.


Such approach enables kernel to delegate functionality traditionally done
by kernel modules into user space processes (either root or !root) and
reduces security attack surface of such new code, meaning in case of
potential bugs only the umh would crash but not the kernel. Another
advantage coming with that would be that bpfilter.ko can be debugged and
tested out of user space as well (e.g. opening the possibility to run
all clang sanitizers, fuzzers or test suites for checking translation).
Also, such architecture makes the kernel/user boundary very precise:
control plane is done by the user space while data plane stays in the kernel.

It's easy to distinguish "umh module" from traditional kernel module:

$ readelf -h bld/net/bpfilter/bpfilter.ko|grep Type
  Type:  EXEC (Executable file)


As Andy asked earlier, why not DYN too to catch PIE executables? Seems
like forcing the userspace helper to be non-PIE would defeat some of
the userspace defenses in use in most distros.


because we don't add features without concrete users.


$ readelf -h bld/net/ipv4/netfilter/iptable_filter.ko |grep Type
  Type:  REL (Relocatable file)

Since umh can crash, can be oom-ed by the kernel, killed by admin,
the subsystem that uses them (like bpfilter) need to manage life
time of umh on its own, so module infra doesn't do any accounting
of them. They don't appear in "lsmod" and cannot be "rmmod".
Multiple request_module("umh") will load multiple umh.ko processes.

Similar to kernel modules the kernel will be tainted if "umh module"
has invalid signature.

[1] 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_747551_=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4=pMlEM-qKOmGljadUKAdwKBpuYHQRXzApvSGkZFIT4Gg=0-vP6LH-GxXnL7LuV-KfMl1NbqsVJUINSygoVa9R6nk=

Signed-off-by: Alexei Starovoitov 
---
 fs/exec.c   | 40 +++-
 include/linux/binfmts.h |  1 +
 include/linux/umh.h |  3 +++
 kernel/module.c | 43 ++-
 kernel/umh.c| 24 +---
 5 files changed, 94 insertions(+), 17 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 7eb8d21bcab9..0483c438de7d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1691,14 +1691,13 @@ static int exec_binprm(struct linux_binprm *bprm)
 /*
  * sys_execve() executes a new program.
  */
-static int do_execveat_common(int fd, struct filename *filename,
- struct user_arg_ptr argv,
- struct user_arg_ptr envp,
- int flags)
+static int __do_execve_file(int fd, struct filename 

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Kees Cook
How is this not marked [RFC]? :)

On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov  wrote:
> As the first step in development of bpfilter project [1] the request_module()
> code is extended to allow user mode helpers to be invoked. Idea is that
> user mode helpers are built as part of the kernel build and installed as
> traditional kernel modules with .ko file extension into distro specified
> location, such that from a distribution point of view, they are no different
> than regular kernel modules. Thus, allow request_module() logic to load such
> user mode helper (umh) modules via:
>
>   request_module("foo") ->
> call_umh("modprobe foo") ->
>   sys_finit_module(FD of /lib/modules/.../foo.ko) ->
> call_umh(struct file)

Yikes. This means autoloaded artbitrary exec() now, instead of
autoloading just loading arbitrary modules? That seems extremely bad.
This just extends all the problems we've had with defining security
boundaries with modules out to umh too. I would need some major
convincing that this can be made safe.

It's easy for container to trigger arbitrary module loading, so if it
can find/use a similar one of the bugs we've seen in the past to
redirect the module loading we don't just get new module-created
attack surface added to the kernel, but we again get arbitrary ELF
running in the root ns (not in the container). And in the insane case
of a container with CAP_SYS_MODULE, this is a trivial escape without
even needing to build a special kernel module: the root ns, running
with all privileges runs an arbitrary ELF.

As it stands, I have to NAK this. At the very least, you need to solve
the execution environment problems here: the ELF should run with no
greater privileges than what loaded the module, and very importantly,
must not be allowed to bypass these checks through autoloading. What
_triggered_ the autoload must be the environment, not the "modprobe",
since that's running with full privileges. Basically, we need to fix
module autoloading first, or at least a significant subset of the
problems: https://lkml.org/lkml/2017/11/27/754

> Such approach enables kernel to delegate functionality traditionally done
> by kernel modules into user space processes (either root or !root) and
> reduces security attack surface of such new code, meaning in case of
> potential bugs only the umh would crash but not the kernel. Another
> advantage coming with that would be that bpfilter.ko can be debugged and
> tested out of user space as well (e.g. opening the possibility to run
> all clang sanitizers, fuzzers or test suites for checking translation).
> Also, such architecture makes the kernel/user boundary very precise:
> control plane is done by the user space while data plane stays in the kernel.
>
> It's easy to distinguish "umh module" from traditional kernel module:
>
> $ readelf -h bld/net/bpfilter/bpfilter.ko|grep Type
>   Type:  EXEC (Executable file)

As Andy asked earlier, why not DYN too to catch PIE executables? Seems
like forcing the userspace helper to be non-PIE would defeat some of
the userspace defenses in use in most distros.

> $ readelf -h bld/net/ipv4/netfilter/iptable_filter.ko |grep Type
>   Type:  REL (Relocatable file)
>
> Since umh can crash, can be oom-ed by the kernel, killed by admin,
> the subsystem that uses them (like bpfilter) need to manage life
> time of umh on its own, so module infra doesn't do any accounting
> of them. They don't appear in "lsmod" and cannot be "rmmod".
> Multiple request_module("umh") will load multiple umh.ko processes.
>
> Similar to kernel modules the kernel will be tainted if "umh module"
> has invalid signature.
>
> [1] https://lwn.net/Articles/747551/
>
> Signed-off-by: Alexei Starovoitov 
> ---
>  fs/exec.c   | 40 +++-
>  include/linux/binfmts.h |  1 +
>  include/linux/umh.h |  3 +++
>  kernel/module.c | 43 ++-
>  kernel/umh.c| 24 +---
>  5 files changed, 94 insertions(+), 17 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 7eb8d21bcab9..0483c438de7d 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1691,14 +1691,13 @@ static int exec_binprm(struct linux_binprm *bprm)
>  /*
>   * sys_execve() executes a new program.
>   */
> -static int do_execveat_common(int fd, struct filename *filename,
> - struct user_arg_ptr argv,
> - struct user_arg_ptr envp,
> - int flags)
> +static int __do_execve_file(int fd, struct filename *filename,
> +   struct user_arg_ptr argv,
> +   struct user_arg_ptr envp,
> +   int flags, struct file *file)
>  {
> char *pathbuf = NULL;
> struct linux_binprm *bprm;
> -   struct file *file;
> struct 

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Alexei Starovoitov

On 3/7/18 5:23 PM, Luis R. Rodriguez wrote:


request_module() has its own world though too. How often in your proof of
concept is request_module() called? How many times do you envision it being
called?


once.


+static int run_umh(struct file *file)
+{
+   struct subprocess_info *sub_info = call_usermodehelper_setup_file(file);
+
+   if (!sub_info)
+   return -ENOMEM;
+   return call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
+}


run_umh() calls the program and waits. Note that while we are running a UMH we
can't suspend. What if they take forever, who is hosing them down with an
equivalent:

schedule();
try_to_freeze();

Say they are buggy and never return, will they stall suspend, have you tried?

call_usermodehelper_exec() uses helper_lock() which in turn is used for
umh's accounting for number of running umh's. One of the sad obscure uses
for this is to deal with suspend/resume. Refer to __usermodehelper* calls
on kernel/power/process.c

Note how you use UMH_WAIT_EXEC too, so this is all synchronous.


looks like you misread this code and the rest of your concerns
with suspend/resume are not applicable any more.

#define UMH_NO_WAIT 0   /* don't wait at all */
#define UMH_WAIT_EXEC   1   /* wait for the exec, but not the process */
#define UMH_WAIT_PROC   2   /* wait for the process to complete */
#define UMH_KILLABLE4   /* wait for EXEC/PROC killable */

bpftiler.ko is started once and runs non-stop from there on.
Unless it crashes, but it's a different discussion.


+   if (info->hdr->e_type == ET_EXEC) {
+#ifdef CONFIG_MODULE_SIG
+   if (!info->sig_ok) {
+   pr_notice_once("umh %s verification failed: signature and/or 
required key missing - tainting kernel\n",
+  info->file->f_path.dentry->d_name.name);
+   add_taint(TAINT_UNSIGNED_MODULE, LOCKDEP_STILL_OK);
+   }
+#endif


So I guess this check is done *after* run_umh() then, what about the enforce 
mode,
don't we want to reject loading at all in any circumstance?


already answered this twice in the thread.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-07 Thread Luis R. Rodriguez
On Mon, Mar 05, 2018 at 05:34:57PM -0800, Alexei Starovoitov wrote:
> As the first step in development of bpfilter project [1]

So meta :) The URL refers an lwn article, which in turn refers to this effort's
first RFC.  As someone only getting *one* of these patches in emails, It would
be useful if the cover letter referenced instead an optional git tree and
branch so one could easily get the patches for more careful inspection. In the
meantime, can you make such tree available with a branch?

Also, since kernel/module.c is involved it would be wise to include Jessica,
which I've Cc'd. I'm going to guess Kees may like to review this too. Mimi
might want to help review as well.

Rafael may care about suspend/resume implications of these "umh modules" as
you put it.

> the request_module()
> code is extended to allow user mode helpers to be invoked. 

Upon inspection you never touch or use request_module() so this is
false and misleading. You don't use or extend request_module() at all, you rely
on extending finit_module() so that load_module() itself will now execute (via
modified do_execve_file()) the same file which was loaded as an module.

This is *very* different given request_module() has its own full magic and is
in and of itself a UMH of the kernel implemented in kernel/kmod.c.

Nevertheless, why?

> Idea is that
> user mode helpers are built as part of the kernel build and installed as
> traditional kernel modules with .ko file extension into distro specified
> location, such that from a distribution point of view, they are no different
> than regular kernel modules.

It sounds like finit_module() module loading is used as a convenience factor
simply to take advantage of being able to ship / maintain/ compile these umh
programs as part of the kernel. Is that right?

So the finit_module() interface was just a convenient mechanism. Is that right?

Ie, if folks had these binaries in place the regular UMH interface / API could 
be
used so that these could be looked for, but instead we want to carry these
in tandem with the kernel?

If so this still seems like an overly complex way to deal with this.

> Thus, allow request_module() logic to load such
> user mode helper (umh) modules via:
> 
>   request_module("foo") ->
> call_umh("modprobe foo") ->
>   sys_finit_module(FD of /lib/modules/.../foo.ko) ->
> call_umh(struct file)

OK so the use case envisioned here was for networking code to do
something like:

if (!loaded) {
err = request_module("bpfilter");
...
}

This is visible on your third patch (this is from your RFC, not this series):

https://www.mail-archive.com/netfilter-devel@vger.kernel.org/msg11129.html

So indeed all this patch does in the end is just putting tons of wrappers in
place so that kernel code can load certain trusted UMH programs we ship, and
maintain in the kernel.

request_module() has its own world though too. How often in your proof of
concept is request_module() called? How many times do you envision it being
called?

Please review lib/test_kmod.c and tools/testing/selftests/kmod/ for testing
your stuff too or consider extending appropriately.

Are aliases something which you expect we'll need to support for these
userspace... modules?

> Such approach enables kernel to delegate functionality traditionally done
> by kernel modules into user space processes (either root or !root) and
> reduces security attack surface of such new code, meaning in case of
> potential bugs only the umh would crash but not the kernel.

Now, this sounds great, however I think that the proof of concept chosen is
pretty complex to start off with. Even if its not designed to be a real
world life use case, a much simpler proof of concept to do something
more simple may be useful, if possible. One wouldn't need to to have it
replace a kernel functionality in real life. lib/ is full of CONFIG_TEST_*
examples, a simple new stupid kernel functionality which can in turn be replaced
with a respective userspace counterpart may be useful, and both kconfig
entries would be mutually exclusive.

> Another
> advantage coming with that would be that bpfilter.ko 

You mean foo.ko

> can be debugged and
> tested out of user space as well (e.g. opening the possibility to run
> all clang sanitizers, fuzzers or test suites for checking translation).

Great too.

> Also, such architecture makes the kernel/user boundary very precise:
> control plane is done by the user space while data plane stays in the kernel.

I don't see how this is defining any boundary, I see just a loader for a
userspace program, and re-using a kernel interface known, finit_module() which
makes it convenient for us to load pre-compiled kernel junk. I'm still
not convinced this is the right approach.

> It's easy to distinguish "umh module" from traditional kernel module:

Ah you said it, "umh module". I don't see what makes it a "umh module" so far,
all we are doing is executing a 

Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-07 Thread David Miller
From: Alexei Starovoitov 
Date: Mon, 5 Mar 2018 17:34:57 -0800

> As the first step in development of bpfilter project [1] the
> request_module() code is extended to allow user mode helpers to be
> invoked.

Looks like there is a little bit of feedback requiring changes, please
take care of that and resubmit.

Thanks!


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Greg KH
On Tue, Mar 06, 2018 at 05:07:45PM -0800, Alexei Starovoitov wrote:
> combining multiple answers...
> 
> On 3/6/18 3:05 AM, Greg KH wrote:
> > 
> > Any chance you can add a field to your "umh module" type such that a
> > normal 'modinfo' program will be able to notice it is different easily?
> 
> ok. handling of modinfo turned out to be straightforward.
> kmod tooling worked fine with simple addition of .modinfo section.
> 
> $ modinfo bpfilter
> filename:
> /lib/modules/4.16.0-rc4-00799-g1716f0aa3039-dirty/net/bpfilter/bpfilter.ko
> umh:Y

Nice.  But perhaps spell it out, "user_mode_helper"?  Anyway,
bikesheding now, sorry, whatever you want to call it is fine with me.

> license:GPL
> 
> I will require umh=Y and license to be present.
> umh has to be set to Y for this 'umh modules'
> and taint of kernel will happen if license is not gpl.

Interesting, I like it :)


> Other modinfo like vermagic are not applicable here, since
> umh modules interact with kernel via normal kernel/user abi.

Very true.

> > > Since umh can crash, can be oom-ed by the kernel, killed by admin,
> > > the subsystem that uses them (like bpfilter) need to manage life
> > > time of umh on its own, so module infra doesn't do any accounting
> > > of them. They don't appear in "lsmod" and cannot be "rmmod".
> > > Multiple request_module("umh") will load multiple umh.ko processes.
> > > 
> > > Similar to kernel modules the kernel will be tainted if "umh module"
> > > has invalid signature.
> > 
> > Shouldn't we fail to load the "module" if the signature is not valid if
> > CONFIG_MODULE_SIG_FORCE=y is enabled, like we do for modules?  I run my
> > systems like that, and just "warning" isn't probably a good idea for
> > systems that want to enforce that everything is signed properly?
> 
> CONFIG_MODULE_SIG_FORCE=y is already handled by this patch.
> It's checked first for either .ko or umh.ko (before any elf parsing)
> and returns -ENOKEY to user space without any dmesg message.
> I think it's best to keep it as-is.
> The taint and warning is for 'undef SIG_FORCE' and when module
> is signed, but incorrectly.

Ah, sorry, I missed that, thanks for clearing it up.

> > Other than that, one minor question:
> > 
> > > @@ -1745,7 +1745,9 @@ static int do_execveat_common(int fd, struct 
> > > filename *filename,
> > >   sched_exec();
> > > 
> > >   bprm->file = file;
> > > - if (fd == AT_FDCWD || filename->name[0] == '/') {
> > > + if (!filename) {
> > > + bprm->filename = "/dev/null";
> > 
> > Why the use of "/dev/null" for the filename here, and elsewhere in the
> > code?  While I'm "sure" that everyone really does have /dev/null/
> > mounted in the root namespace, what is the use of it here?
> 
> filename is assumed to be non-null in several places further
> down and instead of hacking it everywhere it's cleaner to assign
> some string to it.
> I'll change it to filename = "none"
> Same in umh part.

Thanks, that makes sense.

greg k-h


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Alexei Starovoitov

combining multiple answers...

On 3/6/18 3:05 AM, Greg KH wrote:


Any chance you can add a field to your "umh module" type such that a
normal 'modinfo' program will be able to notice it is different easily?


ok. handling of modinfo turned out to be straightforward.
kmod tooling worked fine with simple addition of .modinfo section.

$ modinfo bpfilter
filename: 
/lib/modules/4.16.0-rc4-00799-g1716f0aa3039-dirty/net/bpfilter/bpfilter.ko

umh:Y
license:GPL

I will require umh=Y and license to be present.
umh has to be set to Y for this 'umh modules'
and taint of kernel will happen if license is not gpl.
Other modinfo like vermagic are not applicable here, since
umh modules interact with kernel via normal kernel/user abi.


Since umh can crash, can be oom-ed by the kernel, killed by admin,
the subsystem that uses them (like bpfilter) need to manage life
time of umh on its own, so module infra doesn't do any accounting
of them. They don't appear in "lsmod" and cannot be "rmmod".
Multiple request_module("umh") will load multiple umh.ko processes.

Similar to kernel modules the kernel will be tainted if "umh module"
has invalid signature.


Shouldn't we fail to load the "module" if the signature is not valid if
CONFIG_MODULE_SIG_FORCE=y is enabled, like we do for modules?  I run my
systems like that, and just "warning" isn't probably a good idea for
systems that want to enforce that everything is signed properly?


CONFIG_MODULE_SIG_FORCE=y is already handled by this patch.
It's checked first for either .ko or umh.ko (before any elf parsing)
and returns -ENOKEY to user space without any dmesg message.
I think it's best to keep it as-is.
The taint and warning is for 'undef SIG_FORCE' and when module
is signed, but incorrectly.



Other than that, one minor question:


@@ -1745,7 +1745,9 @@ static int do_execveat_common(int fd, struct filename 
*filename,
sched_exec();

bprm->file = file;
-   if (fd == AT_FDCWD || filename->name[0] == '/') {
+   if (!filename) {
+   bprm->filename = "/dev/null";


Why the use of "/dev/null" for the filename here, and elsewhere in the
code?  While I'm "sure" that everyone really does have /dev/null/
mounted in the root namespace, what is the use of it here?


filename is assumed to be non-null in several places further
down and instead of hacking it everywhere it's cleaner to assign
some string to it.
I'll change it to filename = "none"
Same in umh part.


Also, what "namespace" does this usermode helper run in?  I'm guessing
the "root" one, which is fine with me, but note that people have
complained in the past about other UMH running in that namespace and not
in the specific namespace of the "container" that they wanted it to run
in.


right. this is something we can tweak later if really necessary.
Right now most of the bpf is root-only, so bpfilter.ko would have to run
as cap_sys_admin for now. Later we plan to tighten it to be
cap_net_admin.


On 3/6/18 11:12 AM, Linus Torvalds wrote:
>
> particularly for the early implementation when this is a new thing, I
> really want a message like
>
> executed user process xyz-abc as a pseudo-module
>
> or something in dmesg.
>
> I do *not* want this to be a magical way to hide things.

right. no intent of hiding anything.
The first thing bpfilter.ko does is print 'Starting bpfilter'
into /dev/console.

Long term the health check of 'umh module' and interaction with
the kernel should be standardized and though they're normal processes
seen with 'ps' would be good to see them in lsmod as well.
For now it's indeed the best to do pr_warn() message like above.
Ratelimiting is probably not necessary.


On 3/6/18 12:01 PM, Andy Lutomirski wrote:
>
> I imagine that usermode tooling needs to change regardless
> because the existing tools may get rather confused if a .ko "module"

the goal is to do zero changes to user tooling.
The kmod tools handle this special .ko just fine.
Tested with modprobe, depmod, modinfo, insmod.
scripts/sign-file also works.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Chris Mason

On 6 Mar 2018, at 11:12, Linus Torvalds wrote:

On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov  
wrote:
As the first step in development of bpfilter project [1] the 
request_module()
code is extended to allow user mode helpers to be invoked. Idea is 
that
user mode helpers are built as part of the kernel build and installed 
as
traditional kernel modules with .ko file extension into distro 
specified
location, such that from a distribution point of view, they are no 
different
than regular kernel modules. Thus, allow request_module() logic to 
load such

user mode helper (umh) modules via:

[,,]

I like this, but I have one request: can we make sure that this action
is visible in the system messages?

When we load a regular module, at least it shows in lsmod afterwards,
although I have a few times wanted to really see module load as an
event in the logs too.

When we load a module that just executes a user program, and there is
no sign of it in the module list, I think we *really* need to make
that event show to the admin some way.

.. and yes, maybe we'll need to rate-limit the messages, and maybe it
turns out that I'm entirely wrong and people will hate the messages
after they get used to the concept of these pseudo-modules, but
particularly for the early implementation when this is a new thing, I
really want a message like

 executed user process xyz-abc as a pseudo-module

or something in dmesg.

I do *not* want this to be a magical way to hide things.


Especially early on, this makes a lot of sense.  But I wanted to plug 
bps and the hopefully growing set of bpf introspection tools:


https://github.com/iovisor/bcc/blob/master/introspection/bps_example.txt

Long term these are probably a good place to tell the admin what's going 
on.


-chris


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Linus Torvalds
On Tue, Mar 6, 2018 at 12:01 PM, Andy Lutomirski  wrote:
>
> I assume I'm missing some context here, but why does this need to be
> handled by the kernel rather than, say, a change to how modprobe
> works?

Honestly, the less we have to mess with user-mode tooling, the better.

We've been *so* much better off moving most of the module loading
logic to the kernel, we should not go back in the old broken
direction.

I do *not* want the kmod project that is then taken over by systemd,
and breaks it the same way they broke firmware loading.

Keep modprobe doing one thing, and one thing only: track dependencies
and mindlessly just load the modules. Do *not* ask for it to do
anything else.

Right now kmod is a nice simple project. Lots of testsuite stuff, and
a very clear goal. Let's keep kmod doing one thing, and not even have
to care about internal kernel decisions like "oh, this module might
not be a module, but an executable".

If anything, I think we want to keep our options open, in the case we
need or want to ever consider short-circuiting things and allowing
direct loading of the simple cases and bypassing modprobe entirely.

   Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Andy Lutomirski
On Tue, Mar 6, 2018 at 1:34 AM, Alexei Starovoitov  wrote:
> As the first step in development of bpfilter project [1] the request_module()
> code is extended to allow user mode helpers to be invoked. Idea is that
> user mode helpers are built as part of the kernel build and installed as
> traditional kernel modules with .ko file extension into distro specified
> location, such that from a distribution point of view, they are no different
> than regular kernel modules. Thus, allow request_module() logic to load such
> user mode helper (umh) modules via:
>
>   request_module("foo") ->
> call_umh("modprobe foo") ->
>   sys_finit_module(FD of /lib/modules/.../foo.ko) ->
> call_umh(struct file)
>

I assume I'm missing some context here, but why does this need to be
handled by the kernel rather than, say, a change to how modprobe
works?  I imagine that usermode tooling needs to change regardless
because the existing tools may get rather confused if a .ko "module"
is really a dynamically liked program.  I notice that you're using
ET_EXEC in your example, and that will probably avoid problems, but I
imagine that some distros would much rather use ET_DYN.


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Linus Torvalds
On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov  wrote:
> As the first step in development of bpfilter project [1] the request_module()
> code is extended to allow user mode helpers to be invoked. Idea is that
> user mode helpers are built as part of the kernel build and installed as
> traditional kernel modules with .ko file extension into distro specified
> location, such that from a distribution point of view, they are no different
> than regular kernel modules. Thus, allow request_module() logic to load such
> user mode helper (umh) modules via:
[,,]

I like this, but I have one request: can we make sure that this action
is visible in the system messages?

When we load a regular module, at least it shows in lsmod afterwards,
although I have a few times wanted to really see module load as an
event in the logs too.

When we load a module that just executes a user program, and there is
no sign of it in the module list, I think we *really* need to make
that event show to the admin some way.

.. and yes, maybe we'll need to rate-limit the messages, and maybe it
turns out that I'm entirely wrong and people will hate the messages
after they get used to the concept of these pseudo-modules, but
particularly for the early implementation when this is a new thing, I
really want a message like

 executed user process xyz-abc as a pseudo-module

or something in dmesg.

I do *not* want this to be a magical way to hide things.

  Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Greg KH
On Mon, Mar 05, 2018 at 05:34:57PM -0800, Alexei Starovoitov wrote:
> As the first step in development of bpfilter project [1] the request_module()
> code is extended to allow user mode helpers to be invoked. Idea is that
> user mode helpers are built as part of the kernel build and installed as
> traditional kernel modules with .ko file extension into distro specified
> location, such that from a distribution point of view, they are no different
> than regular kernel modules. Thus, allow request_module() logic to load such
> user mode helper (umh) modules via:
> 
>   request_module("foo") ->
> call_umh("modprobe foo") ->
>   sys_finit_module(FD of /lib/modules/.../foo.ko) ->
> call_umh(struct file)
> 
> Such approach enables kernel to delegate functionality traditionally done
> by kernel modules into user space processes (either root or !root) and
> reduces security attack surface of such new code, meaning in case of
> potential bugs only the umh would crash but not the kernel. Another
> advantage coming with that would be that bpfilter.ko can be debugged and
> tested out of user space as well (e.g. opening the possibility to run
> all clang sanitizers, fuzzers or test suites for checking translation).
> Also, such architecture makes the kernel/user boundary very precise:
> control plane is done by the user space while data plane stays in the kernel.
> 
> It's easy to distinguish "umh module" from traditional kernel module:
> 
> $ readelf -h bld/net/bpfilter/bpfilter.ko|grep Type
>   Type:  EXEC (Executable file)
> $ readelf -h bld/net/ipv4/netfilter/iptable_filter.ko |grep Type
>   Type:  REL (Relocatable file)

Any chance you can add a field to your "umh module" type such that a
normal 'modinfo' program will be able to notice it is different easily?

> Since umh can crash, can be oom-ed by the kernel, killed by admin,
> the subsystem that uses them (like bpfilter) need to manage life
> time of umh on its own, so module infra doesn't do any accounting
> of them. They don't appear in "lsmod" and cannot be "rmmod".
> Multiple request_module("umh") will load multiple umh.ko processes.
> 
> Similar to kernel modules the kernel will be tainted if "umh module"
> has invalid signature.

Shouldn't we fail to load the "module" if the signature is not valid if
CONFIG_MODULE_SIG_FORCE=y is enabled, like we do for modules?  I run my
systems like that, and just "warning" isn't probably a good idea for
systems that want to enforce that everything is signed properly?

Other than that, one minor question:

> @@ -1745,7 +1745,9 @@ static int do_execveat_common(int fd, struct filename 
> *filename,
>   sched_exec();
>  
>   bprm->file = file;
> - if (fd == AT_FDCWD || filename->name[0] == '/') {
> + if (!filename) {
> + bprm->filename = "/dev/null";

Why the use of "/dev/null" for the filename here, and elsewhere in the
code?  While I'm "sure" that everyone really does have /dev/null/
mounted in the root namespace, what is the use of it here?

Also, what "namespace" does this usermode helper run in?  I'm guessing
the "root" one, which is fine with me, but note that people have
complained in the past about other UMH running in that namespace and not
in the specific namespace of the "container" that they wanted it to run
in.

Anyway, this is crazy stuff, but I like the idea and have no objection
to it overall :)

thanks,

greg k-h


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-05 Thread Alexei Starovoitov

On 3/5/18 6:13 PM, Randy Dunlap wrote:

Hi,

On 03/05/2018 05:34 PM, Alexei Starovoitov wrote:


diff --git a/kernel/module.c b/kernel/module.c
index ad2d420024f6..6cfa35795741 100644
--- a/kernel/module.c
+++ b/kernel/module.c



@@ -3669,6 +3683,17 @@ static int load_module(struct load_info *info, const 
char __user *uargs,
if (err)
goto free_copy;

+   if (info->hdr->e_type == ET_EXEC) {
+#ifdef CONFIG_MODULE_SIG
+   if (!info->sig_ok) {
+   pr_notice_once("umh %s verification failed: signature and/or 
required key missing - tainting kernel\n",


That's not a very friendly message to tell a user.  "umh" eh?


umh is an abbreviation known to kernel newbies:
https://kernelnewbies.org/KernelProjects/usermode-helper-enhancements

The rest of the message is copy paste of existing one.


+  info->file->f_path.dentry->d_name.name);
+   add_taint(TAINT_UNSIGNED_MODULE, LOCKDEP_STILL_OK);
+   }


And since the signature failed, why is it being loaded at all?


because this is how regular kernel modules deal with it.
sig_enforce is handled earlier.


Is this in the "--force" load path?


--force forces modver and modmagic. These things don't apply here.



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-05 Thread Randy Dunlap
Hi,

On 03/05/2018 05:34 PM, Alexei Starovoitov wrote:

> diff --git a/kernel/module.c b/kernel/module.c
> index ad2d420024f6..6cfa35795741 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c

> @@ -3669,6 +3683,17 @@ static int load_module(struct load_info *info, const 
> char __user *uargs,
>   if (err)
>   goto free_copy;
>  
> + if (info->hdr->e_type == ET_EXEC) {
> +#ifdef CONFIG_MODULE_SIG
> + if (!info->sig_ok) {
> + pr_notice_once("umh %s verification failed: signature 
> and/or required key missing - tainting kernel\n",

That's not a very friendly message to tell a user.  "umh" eh?

> +info->file->f_path.dentry->d_name.name);
> + add_taint(TAINT_UNSIGNED_MODULE, LOCKDEP_STILL_OK);
> + }

And since the signature failed, why is it being loaded at all?
Is this in the "--force" load path?

> +#endif
> + return 0;
> + }
> +
>   /* Figure out module layout, and allocate all the memory. */
>   mod = layout_and_allocate(info, flags);
>   if (IS_ERR(mod)) {

thanks,
-- 
~Randy