Re: [PATCH] [RFC][WIP] namespace.c: Allow some unprivileged proc mounts when not fully visible

2018-04-13 Thread Djalal Harouni
On Wed, Apr 4, 2018 at 4:45 PM, Eric W. Biederman  wrote:
[...]
>
> The only option I have seen proposed that might qualify as something
> general purpose and simple is a new filesystem that is just the process
> directories of proc.  As there would in essence be no files that would
> need restrictions it would be safe to allow anyone to mount without
> restriction.
>
Eric, there is a series for this:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1533642.html

patch on top for pids:
https://github.com/legionus/linux/commit/993a2a5b9af95b0ac901ff41d32124b72ed676e3

it was reviewed, and suggestions were integrated from Andy and Al Viro
feedback, thanks. It works on Debian, Ubuntu and others, not on Fedora
due to bug with dracut+systemd.

I do not have time to work on it now, anyone can just pick them.

Thanks!


-- 
tixxdz


Re: [PATCH] [RFC][WIP] namespace.c: Allow some unprivileged proc mounts when not fully visible

2018-04-13 Thread Djalal Harouni
On Wed, Apr 4, 2018 at 4:45 PM, Eric W. Biederman  wrote:
[...]
>
> The only option I have seen proposed that might qualify as something
> general purpose and simple is a new filesystem that is just the process
> directories of proc.  As there would in essence be no files that would
> need restrictions it would be safe to allow anyone to mount without
> restriction.
>
Eric, there is a series for this:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1533642.html

patch on top for pids:
https://github.com/legionus/linux/commit/993a2a5b9af95b0ac901ff41d32124b72ed676e3

it was reviewed, and suggestions were integrated from Andy and Al Viro
feedback, thanks. It works on Debian, Ubuntu and others, not on Fedora
due to bug with dracut+systemd.

I do not have time to work on it now, anyone can just pick them.

Thanks!


-- 
tixxdz


Re: module: add debugging alias parsing support

2017-12-04 Thread Djalal Harouni
On Mon, Dec 4, 2017 at 2:58 PM, Jessica Yu <j...@kernel.org> wrote:
> +++ Djalal Harouni [04/12/17 10:01 +0100]:
>>
>> On Thu, Nov 30, 2017 at 7:39 PM, Luis R. Rodriguez <mcg...@kernel.org>
>> wrote:
>>>
>>> On Thu, Nov 30, 2017 at 02:17:11PM +0100, Jessica Yu wrote:
>>>>
>>>> Just some quick questions - are there any plans to use these in-kernel
>>>> module aliases anywhere else? Or are you using them just for debugging?
>>>
>>>
>>> As-is for now just debugging, but this could also more easily enable
>>> folks to
>>> prototype further evaluation of its uses. IMHO just having this at least
>>> posted
>>> online should suffice the later aspect of enabling folks to prototype.
>>
>>
>> I confirm that, after the module auto-load discussion where it is
>> clear that we need to improve the infrastructure, this debug
>> information may save some time, maybe someone can automate a script go
>> through modules and then on filesystem,
>
>
>> however these patches may show
>> which module lead to load another one, right ? on userspace if there
>> are multiple dependencies it can be difficult I think.
>
>
> Hm? I'm confused by what you mean here. The patchset just saves and
> prints a module's aliases on module load if the debug option is
> enabled. There's no dependency tracking here; that's modprobe's job.
> And if you need to see which additional modules are being loaded as a
> result of a module load there's already modprobe --verbose and
> modules.dep..

Yes I was referring by the printing or kernel logs order, if two
modules depend on same one, we know which first one triggered it, and
in that context it will be a bit easier in the auto-loading context,
maybe like crypto ones that can be triggered from anywhere.

Thanks!

-- 
tixxdz


Re: module: add debugging alias parsing support

2017-12-04 Thread Djalal Harouni
On Mon, Dec 4, 2017 at 2:58 PM, Jessica Yu  wrote:
> +++ Djalal Harouni [04/12/17 10:01 +0100]:
>>
>> On Thu, Nov 30, 2017 at 7:39 PM, Luis R. Rodriguez 
>> wrote:
>>>
>>> On Thu, Nov 30, 2017 at 02:17:11PM +0100, Jessica Yu wrote:
>>>>
>>>> Just some quick questions - are there any plans to use these in-kernel
>>>> module aliases anywhere else? Or are you using them just for debugging?
>>>
>>>
>>> As-is for now just debugging, but this could also more easily enable
>>> folks to
>>> prototype further evaluation of its uses. IMHO just having this at least
>>> posted
>>> online should suffice the later aspect of enabling folks to prototype.
>>
>>
>> I confirm that, after the module auto-load discussion where it is
>> clear that we need to improve the infrastructure, this debug
>> information may save some time, maybe someone can automate a script go
>> through modules and then on filesystem,
>
>
>> however these patches may show
>> which module lead to load another one, right ? on userspace if there
>> are multiple dependencies it can be difficult I think.
>
>
> Hm? I'm confused by what you mean here. The patchset just saves and
> prints a module's aliases on module load if the debug option is
> enabled. There's no dependency tracking here; that's modprobe's job.
> And if you need to see which additional modules are being loaded as a
> result of a module load there's already modprobe --verbose and
> modules.dep..

Yes I was referring by the printing or kernel logs order, if two
modules depend on same one, we know which first one triggered it, and
in that context it will be a bit easier in the auto-loading context,
maybe like crypto ones that can be triggered from anywhere.

Thanks!

-- 
tixxdz


Re: module: add debugging alias parsing support

2017-12-04 Thread Djalal Harouni
On Thu, Nov 30, 2017 at 7:39 PM, Luis R. Rodriguez  wrote:
> On Thu, Nov 30, 2017 at 02:17:11PM +0100, Jessica Yu wrote:
>> Just some quick questions - are there any plans to use these in-kernel
>> module aliases anywhere else? Or are you using them just for debugging?
>
> As-is for now just debugging, but this could also more easily enable folks to
> prototype further evaluation of its uses. IMHO just having this at least 
> posted
> online should suffice the later aspect of enabling folks to prototype.

I confirm that, after the module auto-load discussion where it is
clear that we need to improve the infrastructure, this debug
information may save some time, maybe someone can automate a script go
through modules and then on filesystem, however these patches may show
which module lead to load another one, right ? on userspace if there
are multiple dependencies it can be difficult I think.


>
> You're right that one can find aliases in userspace. One of the benefits
> of having this dump things on the kernel log is just that you can easily
> get the aliases printed out for all modules actually loaded for your system
> without much effort. I did find this useful when debugging and found it much
> more convenient than scraping modules one by one by hand in userspace.
>
> I had this implemented since 2016, and I had some ideas to use them in a
> functional way, however I first had to knock out a series of of fixes for
> kernel/kmod.c and setting up a baseline test infrastructure for kmod
> (tools/testing/selftests/kmod/ and lib/test_kmod.c) as such I hadn't had time
> to yet come around and finish benchmarking the alias enhancement ideas I had
> started evaluating.
>
> As such having aliases in-kernel currently are only useful for debugging and
> prototyping.

I would say so, however no strong argument if it should be mainlined.
Luis in your commit log you say:

"Obviously userspace can be buggy though, and it can lie to us. We
currently have no easy way to determine this."

Could you please share some info here ? how userspace can be buggy ?

Thank you!

>   Luis



-- 
tixxdz


Re: module: add debugging alias parsing support

2017-12-04 Thread Djalal Harouni
On Thu, Nov 30, 2017 at 7:39 PM, Luis R. Rodriguez  wrote:
> On Thu, Nov 30, 2017 at 02:17:11PM +0100, Jessica Yu wrote:
>> Just some quick questions - are there any plans to use these in-kernel
>> module aliases anywhere else? Or are you using them just for debugging?
>
> As-is for now just debugging, but this could also more easily enable folks to
> prototype further evaluation of its uses. IMHO just having this at least 
> posted
> online should suffice the later aspect of enabling folks to prototype.

I confirm that, after the module auto-load discussion where it is
clear that we need to improve the infrastructure, this debug
information may save some time, maybe someone can automate a script go
through modules and then on filesystem, however these patches may show
which module lead to load another one, right ? on userspace if there
are multiple dependencies it can be difficult I think.


>
> You're right that one can find aliases in userspace. One of the benefits
> of having this dump things on the kernel log is just that you can easily
> get the aliases printed out for all modules actually loaded for your system
> without much effort. I did find this useful when debugging and found it much
> more convenient than scraping modules one by one by hand in userspace.
>
> I had this implemented since 2016, and I had some ideas to use them in a
> functional way, however I first had to knock out a series of of fixes for
> kernel/kmod.c and setting up a baseline test infrastructure for kmod
> (tools/testing/selftests/kmod/ and lib/test_kmod.c) as such I hadn't had time
> to yet come around and finish benchmarking the alias enhancement ideas I had
> started evaluating.
>
> As such having aliases in-kernel currently are only useful for debugging and
> prototyping.

I would say so, however no strong argument if it should be mainlined.
Luis in your commit log you say:

"Obviously userspace can be buggy though, and it can lie to us. We
currently have no easy way to determine this."

Could you please share some info here ? how userspace can be buggy ?

Thank you!

>   Luis



-- 
tixxdz


Re: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-30 Thread Djalal Harouni
On Thu, Nov 30, 2017 at 3:16 PM, Theodore Ts'o <ty...@mit.edu> wrote:
> On Thu, Nov 30, 2017 at 09:50:27AM +0100, Djalal Harouni wrote:
>> In embedded systems we can't maintain a SELinux policy, distro man
>> power hardly manage. We have abstracted seccomp etc, but the kernel
>> inherited the difficult multiplex things, plus all other paths that
>> trigger this.
>
>> Yes, but it is hard to maintain a whitelist policy, the code is hardly
>> maintained...
>
> So this is the part that scares me to death about IOT, and why I tell
> everyone to ***never*** trust an IOT device on their home network, and
> ***never*** trust it with anything you don't mind splattered all over
> the front page of NY Times and RT / Sputnick news.

Yes.

For your pleasure:
https://techcrunch.com/2017/04/25/brickerbot-is-a-vigilante-worm-that-destroys-insecure-iot-devices/
 bricked million of devices to stupid busybox remote port.
https://en.wikipedia.org/wiki/Mirai_(malware)  an other million bots
used to disturb netflix, twitter and others I don't know the details.
...

> You're saying that you want to use modules (as opposed to compile
> everything tightly down to just what you need for the embedded
> system); that the code is "hardly maintained".  And yet we're supposed
> to consider it trustworthy?

I didn't say that.

> If that's the case, turning off implicit module loading sounds and
> thinking that this will somehow be a magic wand sounds crazy.

The product costs decide, web developers, javascript, big data
analysis, electronic engineers all want to use Linux for IoT prototype
and sell in some months, they will get any kernel+userspace add their
value on top and sell. It will be non-sense to think that if a web
developer wants to sell a node.js app as an IoT he has to compile a
kernel and do all the other stuff, they all re-use the same layer the
same config for everything. Requiring for everyone to compile its own
kernel does not make much sense. Default safe behaviour is what we
should do.

Thanks!

>  - Ted



-- 
tixxdz


Re: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-30 Thread Djalal Harouni
On Thu, Nov 30, 2017 at 3:16 PM, Theodore Ts'o  wrote:
> On Thu, Nov 30, 2017 at 09:50:27AM +0100, Djalal Harouni wrote:
>> In embedded systems we can't maintain a SELinux policy, distro man
>> power hardly manage. We have abstracted seccomp etc, but the kernel
>> inherited the difficult multiplex things, plus all other paths that
>> trigger this.
>
>> Yes, but it is hard to maintain a whitelist policy, the code is hardly
>> maintained...
>
> So this is the part that scares me to death about IOT, and why I tell
> everyone to ***never*** trust an IOT device on their home network, and
> ***never*** trust it with anything you don't mind splattered all over
> the front page of NY Times and RT / Sputnick news.

Yes.

For your pleasure:
https://techcrunch.com/2017/04/25/brickerbot-is-a-vigilante-worm-that-destroys-insecure-iot-devices/
 bricked million of devices to stupid busybox remote port.
https://en.wikipedia.org/wiki/Mirai_(malware)  an other million bots
used to disturb netflix, twitter and others I don't know the details.
...

> You're saying that you want to use modules (as opposed to compile
> everything tightly down to just what you need for the embedded
> system); that the code is "hardly maintained".  And yet we're supposed
> to consider it trustworthy?

I didn't say that.

> If that's the case, turning off implicit module loading sounds and
> thinking that this will somehow be a magic wand sounds crazy.

The product costs decide, web developers, javascript, big data
analysis, electronic engineers all want to use Linux for IoT prototype
and sell in some months, they will get any kernel+userspace add their
value on top and sell. It will be non-sense to think that if a web
developer wants to sell a node.js app as an IoT he has to compile a
kernel and do all the other stuff, they all re-use the same layer the
same config for everything. Requiring for everyone to compile its own
kernel does not make much sense. Default safe behaviour is what we
should do.

Thanks!

>  - Ted



-- 
tixxdz


Re: [PATCH v5 next 3/5] modules:capabilities: automatic module loading restriction

2017-11-30 Thread Djalal Harouni
On Thu, Nov 30, 2017 at 2:23 AM, Luis R. Rodriguez <mcg...@kernel.org> wrote:
> On Mon, Nov 27, 2017 at 06:18:36PM +0100, Djalal Harouni wrote:
>> diff --git a/include/linux/module.h b/include/linux/module.h
>> index 5cbb239..c36aed8 100644
>> --- a/include/linux/module.h
>> +++ b/include/linux/module.h
>> @@ -261,7 +261,16 @@ struct notifier_block;
>>
>>  #ifdef CONFIG_MODULES
>>
>> -extern int modules_disabled; /* for sysctl */
>> +enum {
>> + MODULES_AUTOLOAD_ALLOWED= 0,
>> + MODULES_AUTOLOAD_PRIVILEGED = 1,
>> + MODULES_AUTOLOAD_DISABLED   = 2,
>> +};
>> +
>
> Can you kdocify these and add a respective rst doc file?  Maybe stuff your
> extensive docs which you are currently adding to
> Documentation/sysctl/kernel.txt to this new file and in kernel.txt just refer
> to it. This way this can be also nicely visibly documented on the web with the
> new sphinx.
>
> This way you can take advantage of the kdocs you are already adding and refer
> to them.

Alright I'll do it in the next series next week, we'll change the
semantics as requested by Linus and Kees here:
http://www.openwall.com/lists/kernel-hardening/2017/11/29/38

To block the privilege escalation through the usermod helper.


>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 2fb4e27..0b6f0c8 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -683,6 +688,15 @@ static struct ctl_table kern_table[] = {
>>   .extra1 = ,
>>   .extra2 = ,
>>   },
>> + {
>> + .procname   = "modules_autoload_mode",
>> + .data   = _autoload_mode,
>> + .maxlen = sizeof(int),
>> + .mode   = 0644,
>> + .proc_handler   = modules_autoload_dointvec_minmax,
>
> It would seem this is a unint ... so why not reflect that?
>
>> @@ -2499,6 +2513,20 @@ static int proc_dointvec_minmax_sysadmin(struct 
>> ctl_table *table, int write,
>>  }
>>  #endif
>>
>> +#ifdef CONFIG_MODULES
>> +static int modules_autoload_dointvec_minmax(struct ctl_table *table, int 
>> write,
>> + void __user *buffer, size_t *lenp, loff_t 
>> *ppos)
>> +{
>> + /*
>> +  * Only CAP_SYS_MODULE in init user namespace are allowed to change 
>> this
>> +  */
>> + if (write && !capable(CAP_SYS_MODULE))
>> + return -EPERM;
>> +
>> + return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>> +}
>> +#endif
>
> We now have proc_douintvec_minmax().
>

Yes, however in that same response by Linus it was suggested to drop
the sysctl completely, so next iterations will not have this code.

Thank you for the review!

-- 
tixxdz


Re: [PATCH v5 next 3/5] modules:capabilities: automatic module loading restriction

2017-11-30 Thread Djalal Harouni
On Thu, Nov 30, 2017 at 2:23 AM, Luis R. Rodriguez  wrote:
> On Mon, Nov 27, 2017 at 06:18:36PM +0100, Djalal Harouni wrote:
>> diff --git a/include/linux/module.h b/include/linux/module.h
>> index 5cbb239..c36aed8 100644
>> --- a/include/linux/module.h
>> +++ b/include/linux/module.h
>> @@ -261,7 +261,16 @@ struct notifier_block;
>>
>>  #ifdef CONFIG_MODULES
>>
>> -extern int modules_disabled; /* for sysctl */
>> +enum {
>> + MODULES_AUTOLOAD_ALLOWED= 0,
>> + MODULES_AUTOLOAD_PRIVILEGED = 1,
>> + MODULES_AUTOLOAD_DISABLED   = 2,
>> +};
>> +
>
> Can you kdocify these and add a respective rst doc file?  Maybe stuff your
> extensive docs which you are currently adding to
> Documentation/sysctl/kernel.txt to this new file and in kernel.txt just refer
> to it. This way this can be also nicely visibly documented on the web with the
> new sphinx.
>
> This way you can take advantage of the kdocs you are already adding and refer
> to them.

Alright I'll do it in the next series next week, we'll change the
semantics as requested by Linus and Kees here:
http://www.openwall.com/lists/kernel-hardening/2017/11/29/38

To block the privilege escalation through the usermod helper.


>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 2fb4e27..0b6f0c8 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -683,6 +688,15 @@ static struct ctl_table kern_table[] = {
>>   .extra1 = ,
>>   .extra2 = ,
>>   },
>> + {
>> + .procname   = "modules_autoload_mode",
>> + .data   = _autoload_mode,
>> + .maxlen = sizeof(int),
>> + .mode   = 0644,
>> + .proc_handler   = modules_autoload_dointvec_minmax,
>
> It would seem this is a unint ... so why not reflect that?
>
>> @@ -2499,6 +2513,20 @@ static int proc_dointvec_minmax_sysadmin(struct 
>> ctl_table *table, int write,
>>  }
>>  #endif
>>
>> +#ifdef CONFIG_MODULES
>> +static int modules_autoload_dointvec_minmax(struct ctl_table *table, int 
>> write,
>> + void __user *buffer, size_t *lenp, loff_t 
>> *ppos)
>> +{
>> + /*
>> +  * Only CAP_SYS_MODULE in init user namespace are allowed to change 
>> this
>> +  */
>> + if (write && !capable(CAP_SYS_MODULE))
>> + return -EPERM;
>> +
>> + return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>> +}
>> +#endif
>
> We now have proc_douintvec_minmax().
>

Yes, however in that same response by Linus it was suggested to drop
the sysctl completely, so next iterations will not have this code.

Thank you for the review!

-- 
tixxdz


Re: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-30 Thread Djalal Harouni
On Thu, Nov 30, 2017 at 7:51 AM, Daniel Micay  wrote:
[...]
> Lots of potential module attack surface also gets eliminated by default
> via their SELinux whitelists for /dev, /sys, /proc, debugfs, ioctl
> commands, etc. The global seccomp whitelist might be relevant in some
> cases too.

In embedded systems we can't maintain a SELinux policy, distro man
power hardly manage. We have abstracted seccomp etc, but the kernel
inherited the difficult multiplex things, plus all other paths that
trigger this.


> Android devices like to build everything into the kernel too, so even if
> they weren't using a module this feature wouldn't usually help them. It
> would need to work like this existing sysctl:
>
> net.ipv4.tcp_available_congestion_control = cubic reno lp
>
> i.e. whitelists for functionality offered by the modules, not just
> whether they can be loaded.

Yes, but it is hard to maintain a whitelist policy, the code is hardly
maintained... if you include everything you should have an LSM policy
or something like that, and compiling kernels is expert thing.
Otherwise IMHO the kernel should provide default secure behaviour on
how to load or add new functionality to the running one. From a user
perspective, a switch "yes/no" that a privileged entity will
*understand* and assume is what should be there, and the switch or
flag as discussed here is local to processes, the sysctl will be
removed. IMO it should come from userspace point of view, cause as an
example the sysctl:

net.ipv4.tcp_available_congestion_control = cubic reno lp

Is kernel thing, too technical, userspace developers, admins or
privileged entity will not understand what cubic or reno mean. Doing
the same per functionality directly like this seems to much of a
burden compared to the use case. The kernel maybe can do this to
advance the art of the networking stack and for advanced cases, but in
IMHO a sane default behaviour + an abstracted process/sandbox flag
"yes/no" for most others, userspace developers and humans is what
should be provided and we need the kernel to help here.

It seems that Linus and kees agreed on this direction which allows me
to follow up.

Thanks!


-- 
tixxdz


Re: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-30 Thread Djalal Harouni
On Thu, Nov 30, 2017 at 7:51 AM, Daniel Micay  wrote:
[...]
> Lots of potential module attack surface also gets eliminated by default
> via their SELinux whitelists for /dev, /sys, /proc, debugfs, ioctl
> commands, etc. The global seccomp whitelist might be relevant in some
> cases too.

In embedded systems we can't maintain a SELinux policy, distro man
power hardly manage. We have abstracted seccomp etc, but the kernel
inherited the difficult multiplex things, plus all other paths that
trigger this.


> Android devices like to build everything into the kernel too, so even if
> they weren't using a module this feature wouldn't usually help them. It
> would need to work like this existing sysctl:
>
> net.ipv4.tcp_available_congestion_control = cubic reno lp
>
> i.e. whitelists for functionality offered by the modules, not just
> whether they can be loaded.

Yes, but it is hard to maintain a whitelist policy, the code is hardly
maintained... if you include everything you should have an LSM policy
or something like that, and compiling kernels is expert thing.
Otherwise IMHO the kernel should provide default secure behaviour on
how to load or add new functionality to the running one. From a user
perspective, a switch "yes/no" that a privileged entity will
*understand* and assume is what should be there, and the switch or
flag as discussed here is local to processes, the sysctl will be
removed. IMO it should come from userspace point of view, cause as an
example the sysctl:

net.ipv4.tcp_available_congestion_control = cubic reno lp

Is kernel thing, too technical, userspace developers, admins or
privileged entity will not understand what cubic or reno mean. Doing
the same per functionality directly like this seems to much of a
burden compared to the use case. The kernel maybe can do this to
advance the art of the networking stack and for advanced cases, but in
IMHO a sane default behaviour + an abstracted process/sandbox flag
"yes/no" for most others, userspace developers and humans is what
should be provided and we need the kernel to help here.

It seems that Linus and kees agreed on this direction which allows me
to follow up.

Thanks!


-- 
tixxdz


Re: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-28 Thread Djalal Harouni
On Wed, Nov 29, 2017 at 12:23 AM, Theodore Ts'o  wrote:
> On Tue, Nov 28, 2017 at 01:33:40PM -0800, Kees Cook wrote:
>> As I've said before, this isn't a theoretical attack surface. This
>> year alone there have been three known-exploitable flaws exposed by
>> autoloading:
>>
>> The exploit for CVE-2017-2636 uses int n_hdlc = N_HDLC; ioctl(fd,
>> TIOCSETD, _hdlc) [1]. This is using the existing "tty-ldisc-"
>> prefix, and is intentionally unprivileged.
>>
>> The exploit for CVE-2017-6074 uses socket(PF_INET6, SOCK_DCCP,
>> IPPROTO_IP) [2]. This is using the existing proto prefix, and is
>> intentionally unprivileged.
>
> So in these two cases, if the kernel was built w/o modules, and HDLC
> and DCCP was built-in, you'd be screwed, then?
>
> Is the goal here to protect people using distro kernels which build
> the world as modules, including dodgy pieces of kernel code that are
> bug-ridden?
>
> If so, then presumably 90% of the problem you've cited can be done by
> creating a script which takes a look of the modules that are normally
> in use once the machine is in production, and then deleting everything
> else?  Correct?
>
> And yes, this will potentially break some users, but the security
> folks who are advocating for the more aggressive version of this
> change seem to be OK with breaking users, so they can do this without
> making kernel changes.  Good luck getting Red Hat and SuSE to accept
> such a change, though

The patches does not change default and make it easy for users and we
have request for this, not all world is Red Hat / SuSE , I build
embedded Linux for clients when I manage to have some, and I clearly
would have set this to my clients since most of them won't be able to
afford all the signing and complexity, now how I should allow modules
load/unload replace with newer versions, but restrict some of their
apps from triggering it, "modules_disabled=1" is not practical. I
can't build a perfect version for every usecase,  and they started to
ship apps for IoT, even containers for IoT, yes they do it and they
use the same os for various use cases! so it is not about those, even
embedded vendors have one single shared layer that they use for all
their products.

For distros, the target is also containers and sandboxes, and they are
already interested in it.

P.S. please the cover letter already mentions that this is for Embedded and IoT.

> - Ted



-- 
tixxdz


Re: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-28 Thread Djalal Harouni
On Wed, Nov 29, 2017 at 12:23 AM, Theodore Ts'o  wrote:
> On Tue, Nov 28, 2017 at 01:33:40PM -0800, Kees Cook wrote:
>> As I've said before, this isn't a theoretical attack surface. This
>> year alone there have been three known-exploitable flaws exposed by
>> autoloading:
>>
>> The exploit for CVE-2017-2636 uses int n_hdlc = N_HDLC; ioctl(fd,
>> TIOCSETD, _hdlc) [1]. This is using the existing "tty-ldisc-"
>> prefix, and is intentionally unprivileged.
>>
>> The exploit for CVE-2017-6074 uses socket(PF_INET6, SOCK_DCCP,
>> IPPROTO_IP) [2]. This is using the existing proto prefix, and is
>> intentionally unprivileged.
>
> So in these two cases, if the kernel was built w/o modules, and HDLC
> and DCCP was built-in, you'd be screwed, then?
>
> Is the goal here to protect people using distro kernels which build
> the world as modules, including dodgy pieces of kernel code that are
> bug-ridden?
>
> If so, then presumably 90% of the problem you've cited can be done by
> creating a script which takes a look of the modules that are normally
> in use once the machine is in production, and then deleting everything
> else?  Correct?
>
> And yes, this will potentially break some users, but the security
> folks who are advocating for the more aggressive version of this
> change seem to be OK with breaking users, so they can do this without
> making kernel changes.  Good luck getting Red Hat and SuSE to accept
> such a change, though

The patches does not change default and make it easy for users and we
have request for this, not all world is Red Hat / SuSE , I build
embedded Linux for clients when I manage to have some, and I clearly
would have set this to my clients since most of them won't be able to
afford all the signing and complexity, now how I should allow modules
load/unload replace with newer versions, but restrict some of their
apps from triggering it, "modules_disabled=1" is not practical. I
can't build a perfect version for every usecase,  and they started to
ship apps for IoT, even containers for IoT, yes they do it and they
use the same os for various use cases! so it is not about those, even
embedded vendors have one single shared layer that they use for all
their products.

For distros, the target is also containers and sandboxes, and they are
already interested in it.

P.S. please the cover letter already mentions that this is for Embedded and IoT.

> - Ted



-- 
tixxdz


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-28 Thread Djalal Harouni
On Tue, Nov 28, 2017 at 11:18 PM, Luis R. Rodriguez <mcg...@kernel.org> wrote:
> On Tue, Nov 28, 2017 at 10:33:27PM +0100, Djalal Harouni wrote:
>> On Tue, Nov 28, 2017 at 10:16 PM, Luis R. Rodriguez <mcg...@kernel.org> 
>> wrote:
>> > On Tue, Nov 28, 2017 at 12:11:34PM -0800, Kees Cook wrote:
>> >> On Tue, Nov 28, 2017 at 11:14 AM, Luis R. Rodriguez <mcg...@kernel.org> 
>> >> wrote:
>> >> > kmod is just a helper to poke userpsace to load a module, that's it.
>> >> >
>> >> > The old init_module() and newer finit_module() do the real handy work or
>> >> > module loading, and both currently only use may_init_module():
>> >> >
>> >> > static int may_init_module(void)
>> >> > {
>> >> > if (!capable(CAP_SYS_MODULE) || modules_disabled)
>> >> > return -EPERM;
>> >> >
>> >> > return 0;
>> >> > }
>> >> >
>> >> > This begs the question:
>> >> >
>> >> >   o If userspace just tries to just use raw finit_module() do we want 
>> >> > similar
>> >> > checks?
>> >> >
>> >> > Otherwise, correct me if I'm wrong this all seems pointless.
>> >>
>> >> Hm? That's direct-loading, not auto-loading. This series is only about
>> >> auto-loading.
>> >
>> > And *all* auto-loading uses aliases? What's the difference between 
>> > auto-loading
>> > and direct-loading?
>>
>> Not all auto-loading uses aliases, auto-loading is when kernel code
>> calls request_module() to loads the feature that was not present,
>
> It seems the actual interest here is system call implicated request_module()
> calls? Because there are uses of request_module() which may be module hacks,
> and not implicated via system calls.

Indeed.


>> and direct-loading in this thread is the direct syscalls like
>> finit_module().
>
> OK.
>
>> >> We already have a global sysctl for blocking direct-loading 
>> >> (modules_disabled).
>> >
>> > My point was that even if you have a CAP_NET_ADMIN check on 
>> > request_module(),
>> > finit_module() will not check for it, so a crafty userspace could still try
>> > to just finit_module() directly, and completely then bypass the 
>> > CAP_NET_ADMIN
>> > check.
>>
>> The finit_module() uses CAP_SYS_MODULE which should allow all modules
>> and in this context it should be more privileged than CAP_NET_ADMIN
>> which is only for "netdev-%s" (to not load arbitrary modules with it).
>>
>> finit_module() coming from request_module() always has the
>> CAP_NET_ADMIN, hence the check is done before.
>
> But since CAP_SYS_MODULE is more restrictive, what's the point in checking
> for CAP_NET_ADMIN?

For backward compatibility with 'netdev' modules since it is for those.


>> > So unless I'm missing something, I see no point in adding extra checks for
>> > request_module() but nothing for the respective load_module().
>>
>> I see, request_module() is called from kernel context which runs in
>> init namespace will full capabilities, the spawned userspace modprobe
>> will get CAP_SYS_MODULE and all other caps, then after comes modprobe
>> and load_module().
>
> Right, so defining the gains of adding this extra check is not very clear
> yet. It would seem a benefit exists, what is it?

it will able to filter if the request_module() should continue loading
the module or deny it which prevents spawning the *privileged*
usermode helper. This is all based on are we allowed to load new
features or not, or IOW I don't want to allow new features or modules
autoloading from now and on, as stated in the cover letter for various
benefit including security, reduce the amount of kernel code running,
but also do not allow new features for anyone like tunneling, etc.


>> Btw as suggested by Linus I will update with request_module_cap() and > I can
>> offer my help maintaining these bits too.
>
> Can you start by extending lib/test_module.c and
> tools/testing/selftests/kmod/kmod.sh with a proof of concept of the gains 
> here,
> as well as ensuring things work as expected ?

Alright Luis, thanks for the hint, yes I will make sure to cover these.

For gains, kees already answered in the other email, and please check
the DCCP exploit and others linked in the cover letter.


Thank you!

>   Luis



-- 
tixxdz


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-28 Thread Djalal Harouni
On Tue, Nov 28, 2017 at 11:18 PM, Luis R. Rodriguez  wrote:
> On Tue, Nov 28, 2017 at 10:33:27PM +0100, Djalal Harouni wrote:
>> On Tue, Nov 28, 2017 at 10:16 PM, Luis R. Rodriguez  
>> wrote:
>> > On Tue, Nov 28, 2017 at 12:11:34PM -0800, Kees Cook wrote:
>> >> On Tue, Nov 28, 2017 at 11:14 AM, Luis R. Rodriguez  
>> >> wrote:
>> >> > kmod is just a helper to poke userpsace to load a module, that's it.
>> >> >
>> >> > The old init_module() and newer finit_module() do the real handy work or
>> >> > module loading, and both currently only use may_init_module():
>> >> >
>> >> > static int may_init_module(void)
>> >> > {
>> >> > if (!capable(CAP_SYS_MODULE) || modules_disabled)
>> >> > return -EPERM;
>> >> >
>> >> > return 0;
>> >> > }
>> >> >
>> >> > This begs the question:
>> >> >
>> >> >   o If userspace just tries to just use raw finit_module() do we want 
>> >> > similar
>> >> > checks?
>> >> >
>> >> > Otherwise, correct me if I'm wrong this all seems pointless.
>> >>
>> >> Hm? That's direct-loading, not auto-loading. This series is only about
>> >> auto-loading.
>> >
>> > And *all* auto-loading uses aliases? What's the difference between 
>> > auto-loading
>> > and direct-loading?
>>
>> Not all auto-loading uses aliases, auto-loading is when kernel code
>> calls request_module() to loads the feature that was not present,
>
> It seems the actual interest here is system call implicated request_module()
> calls? Because there are uses of request_module() which may be module hacks,
> and not implicated via system calls.

Indeed.


>> and direct-loading in this thread is the direct syscalls like
>> finit_module().
>
> OK.
>
>> >> We already have a global sysctl for blocking direct-loading 
>> >> (modules_disabled).
>> >
>> > My point was that even if you have a CAP_NET_ADMIN check on 
>> > request_module(),
>> > finit_module() will not check for it, so a crafty userspace could still try
>> > to just finit_module() directly, and completely then bypass the 
>> > CAP_NET_ADMIN
>> > check.
>>
>> The finit_module() uses CAP_SYS_MODULE which should allow all modules
>> and in this context it should be more privileged than CAP_NET_ADMIN
>> which is only for "netdev-%s" (to not load arbitrary modules with it).
>>
>> finit_module() coming from request_module() always has the
>> CAP_NET_ADMIN, hence the check is done before.
>
> But since CAP_SYS_MODULE is more restrictive, what's the point in checking
> for CAP_NET_ADMIN?

For backward compatibility with 'netdev' modules since it is for those.


>> > So unless I'm missing something, I see no point in adding extra checks for
>> > request_module() but nothing for the respective load_module().
>>
>> I see, request_module() is called from kernel context which runs in
>> init namespace will full capabilities, the spawned userspace modprobe
>> will get CAP_SYS_MODULE and all other caps, then after comes modprobe
>> and load_module().
>
> Right, so defining the gains of adding this extra check is not very clear
> yet. It would seem a benefit exists, what is it?

it will able to filter if the request_module() should continue loading
the module or deny it which prevents spawning the *privileged*
usermode helper. This is all based on are we allowed to load new
features or not, or IOW I don't want to allow new features or modules
autoloading from now and on, as stated in the cover letter for various
benefit including security, reduce the amount of kernel code running,
but also do not allow new features for anyone like tunneling, etc.


>> Btw as suggested by Linus I will update with request_module_cap() and > I can
>> offer my help maintaining these bits too.
>
> Can you start by extending lib/test_module.c and
> tools/testing/selftests/kmod/kmod.sh with a proof of concept of the gains 
> here,
> as well as ensuring things work as expected ?

Alright Luis, thanks for the hint, yes I will make sure to cover these.

For gains, kees already answered in the other email, and please check
the DCCP exploit and others linked in the cover letter.


Thank you!

>   Luis



-- 
tixxdz


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-28 Thread Djalal Harouni
On Tue, Nov 28, 2017 at 10:16 PM, Luis R. Rodriguez  wrote:
> On Tue, Nov 28, 2017 at 12:11:34PM -0800, Kees Cook wrote:
>> On Tue, Nov 28, 2017 at 11:14 AM, Luis R. Rodriguez  
>> wrote:
>> > kmod is just a helper to poke userpsace to load a module, that's it.
>> >
>> > The old init_module() and newer finit_module() do the real handy work or
>> > module loading, and both currently only use may_init_module():
>> >
>> > static int may_init_module(void)
>> > {
>> > if (!capable(CAP_SYS_MODULE) || modules_disabled)
>> > return -EPERM;
>> >
>> > return 0;
>> > }
>> >
>> > This begs the question:
>> >
>> >   o If userspace just tries to just use raw finit_module() do we want 
>> > similar
>> > checks?
>> >
>> > Otherwise, correct me if I'm wrong this all seems pointless.
>>
>> Hm? That's direct-loading, not auto-loading. This series is only about
>> auto-loading.
>
> And *all* auto-loading uses aliases? What's the difference between 
> auto-loading
> and direct-loading?

Not all auto-loading uses aliases, auto-loading is when kernel code
calls request_module() to loads the feature that was not present, and
direct-loading in this thread is the direct syscalls like
finit_module().

>> We already have a global sysctl for blocking direct-loading 
>> (modules_disabled).
>
> My point was that even if you have a CAP_NET_ADMIN check on request_module(),
> finit_module() will not check for it, so a crafty userspace could still try
> to just finit_module() directly, and completely then bypass the CAP_NET_ADMIN
> check.

The finit_module() uses CAP_SYS_MODULE which should allow all modules
and in this context it should be more privileged than CAP_NET_ADMIN
which is only for "netdev-%s" (to not load arbitrary modules with it).

finit_module() coming from request_module() always has the
CAP_NET_ADMIN, hence the check is done before.

> So unless I'm missing something, I see no point in adding extra checks for
> request_module() but nothing for the respective load_module().

I see, request_module() is called from kernel context which runs in
init namespace will full capabilities, the spawned userspace modprobe
will get CAP_SYS_MODULE and all other caps, then after comes modprobe
and load_module().

Btw as suggested by Linus I will update with request_module_cap() and
I can offer my help maintaining these bits too.


>
>   Luis



-- 
tixxdz


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-28 Thread Djalal Harouni
On Tue, Nov 28, 2017 at 10:16 PM, Luis R. Rodriguez  wrote:
> On Tue, Nov 28, 2017 at 12:11:34PM -0800, Kees Cook wrote:
>> On Tue, Nov 28, 2017 at 11:14 AM, Luis R. Rodriguez  
>> wrote:
>> > kmod is just a helper to poke userpsace to load a module, that's it.
>> >
>> > The old init_module() and newer finit_module() do the real handy work or
>> > module loading, and both currently only use may_init_module():
>> >
>> > static int may_init_module(void)
>> > {
>> > if (!capable(CAP_SYS_MODULE) || modules_disabled)
>> > return -EPERM;
>> >
>> > return 0;
>> > }
>> >
>> > This begs the question:
>> >
>> >   o If userspace just tries to just use raw finit_module() do we want 
>> > similar
>> > checks?
>> >
>> > Otherwise, correct me if I'm wrong this all seems pointless.
>>
>> Hm? That's direct-loading, not auto-loading. This series is only about
>> auto-loading.
>
> And *all* auto-loading uses aliases? What's the difference between 
> auto-loading
> and direct-loading?

Not all auto-loading uses aliases, auto-loading is when kernel code
calls request_module() to loads the feature that was not present, and
direct-loading in this thread is the direct syscalls like
finit_module().

>> We already have a global sysctl for blocking direct-loading 
>> (modules_disabled).
>
> My point was that even if you have a CAP_NET_ADMIN check on request_module(),
> finit_module() will not check for it, so a crafty userspace could still try
> to just finit_module() directly, and completely then bypass the CAP_NET_ADMIN
> check.

The finit_module() uses CAP_SYS_MODULE which should allow all modules
and in this context it should be more privileged than CAP_NET_ADMIN
which is only for "netdev-%s" (to not load arbitrary modules with it).

finit_module() coming from request_module() always has the
CAP_NET_ADMIN, hence the check is done before.

> So unless I'm missing something, I see no point in adding extra checks for
> request_module() but nothing for the respective load_module().

I see, request_module() is called from kernel context which runs in
init namespace will full capabilities, the spawned userspace modprobe
will get CAP_SYS_MODULE and all other caps, then after comes modprobe
and load_module().

Btw as suggested by Linus I will update with request_module_cap() and
I can offer my help maintaining these bits too.


>
>   Luis



-- 
tixxdz


Re: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-28 Thread Djalal Harouni
On Tue, Nov 28, 2017 at 9:33 PM, Linus Torvalds
 wrote:
> On Tue, Nov 28, 2017 at 12:20 PM, Kees Cook  wrote:
>>
>> So what's the right path forward for allowing a way to block
>> autoloading? Separate existing request_module() calls into "must be
>> privileged" and "can be unpriv" first, then rework the series to deal
>> with the "unpriv okay" subset?
>
> So once we've taken care of the networking ones that check their own
> different capability bit, maybe we can then make the regular
> request_module() do a rate-limited warning for non-CAP_SYS_MODULE uses
> that prints which module it's loading.

Alright, I can start by those.


> And then just see what people report.
>
> Because maybe it's just a very small handful that matters, and we can
> say "those are ok".

The ones that are in the cover letter, etc may not have the
appropriate context, the request_module_dev() sure can be made since
if you can open you already have the context.


> And maybe that is too optimistic, and we have a lot of device driver
> ones because people still have a static /dev and don't have udev
> populating modules and device nodes, and then maybe we need to
> introduce a "request_module_dev()" where the rule is that you had to
> at least have privileges to open the device node.
>
> Because I really am *not* interested in these security flags that are
> off by default and then get turned on by special cases. I think it's
> completely unacceptable to say "we're insecure by default but then you
> can do X and be secure". It doesn't work. It doesn't fix anything.

this still leaves all the cases where we don't have the appropriate
context and other implicit loads that are triggered by another
implicit load, etc.

Also the simple local flag is easy to grasp, with real users for it,
and we can abstract on top with load "newfeatures" of course this does
not mean that we should say "we-re insecure by default". I want it to
be more allow newfeatures or not for apps and users... requiring caps
may give users the idea to pass CAP_SYS_MODULE or other caps for
something that used to work, they may start giving it if we break lot
of usecases, and yeh the caps are much broader and do much more
harm...

Ok, so beside updating with request_module_cap() I will investigate
request_module_dev() and we can see

Thanks!

>  Linus



-- 
tixxdz


Re: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-28 Thread Djalal Harouni
On Tue, Nov 28, 2017 at 9:33 PM, Linus Torvalds
 wrote:
> On Tue, Nov 28, 2017 at 12:20 PM, Kees Cook  wrote:
>>
>> So what's the right path forward for allowing a way to block
>> autoloading? Separate existing request_module() calls into "must be
>> privileged" and "can be unpriv" first, then rework the series to deal
>> with the "unpriv okay" subset?
>
> So once we've taken care of the networking ones that check their own
> different capability bit, maybe we can then make the regular
> request_module() do a rate-limited warning for non-CAP_SYS_MODULE uses
> that prints which module it's loading.

Alright, I can start by those.


> And then just see what people report.
>
> Because maybe it's just a very small handful that matters, and we can
> say "those are ok".

The ones that are in the cover letter, etc may not have the
appropriate context, the request_module_dev() sure can be made since
if you can open you already have the context.


> And maybe that is too optimistic, and we have a lot of device driver
> ones because people still have a static /dev and don't have udev
> populating modules and device nodes, and then maybe we need to
> introduce a "request_module_dev()" where the rule is that you had to
> at least have privileges to open the device node.
>
> Because I really am *not* interested in these security flags that are
> off by default and then get turned on by special cases. I think it's
> completely unacceptable to say "we're insecure by default but then you
> can do X and be secure". It doesn't work. It doesn't fix anything.

this still leaves all the cases where we don't have the appropriate
context and other implicit loads that are triggered by another
implicit load, etc.

Also the simple local flag is easy to grasp, with real users for it,
and we can abstract on top with load "newfeatures" of course this does
not mean that we should say "we-re insecure by default". I want it to
be more allow newfeatures or not for apps and users... requiring caps
may give users the idea to pass CAP_SYS_MODULE or other caps for
something that used to work, they may start giving it if we break lot
of usecases, and yeh the caps are much broader and do much more
harm...

Ok, so beside updating with request_module_cap() I will investigate
request_module_dev() and we can see

Thanks!

>  Linus



-- 
tixxdz


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-28 Thread Djalal Harouni
Hi Luis,

On Tue, Nov 28, 2017 at 8:14 PM, Luis R. Rodriguez <mcg...@kernel.org> wrote:
> On Mon, Nov 27, 2017 at 06:18:34PM +0100, Djalal Harouni wrote:
> ...
>
>> After a discussion with Rusty Russell [1], the suggestion was to pass
>> the capability from request_module() to security_kernel_module_request()
>> for 'netdev-%s' modules that need CAP_NET_ADMIN, and after review from
>> Kees Cook [2] and experimenting with the code, the patch now does the
>> following:
>>
>> * Adds the request_module_cap() function.
>> * Updates the __request_module() to take the "required_cap" argument
>> with the "prefix".
>
> ...
>
>> Signed-off-by: Djalal Harouni <tix...@gmail.com>
>> ---
>> diff --git a/kernel/kmod.c b/kernel/kmod.c
>> index bc6addd..679d401 100644
>> --- a/kernel/kmod.c
>> +++ b/kernel/kmod.c
>> @@ -139,13 +147,22 @@ int __request_module(bool wait, const char *fmt, ...)
>>   if (!modprobe_path[0])
>>   return 0;
>>
>> + /*
>> +  * Lets attach the prefix to the module name
>> +  */
>> + if (prefix != NULL && *prefix != '\0') {
>> + len += snprintf(module_name, MODULE_NAME_LEN, "%s-", prefix);
>> + if (len >= MODULE_NAME_LEN)
>> + return -ENAMETOOLONG;
>> + }
>> +
>>   va_start(args, fmt);
>> - ret = vsnprintf(module_name, MODULE_NAME_LEN, fmt, args);
>> + ret = vsnprintf(module_name + len, MODULE_NAME_LEN - len, fmt, args);
>>   va_end(args);
>> - if (ret >= MODULE_NAME_LEN)
>> + if (ret >= MODULE_NAME_LEN - len)
>>   return -ENAMETOOLONG;
>>
>> - ret = security_kernel_module_request(module_name);
>> + ret = security_kernel_module_request(module_name, required_cap, 
>> prefix);
>>   if (ret)
>>   return ret;
>>
>
> kmod is just a helper to poke userpsace to load a module, that's it.
>
> The old init_module() and newer finit_module() do the real handy work or
> module loading, and both currently only use may_init_module():
>
> static int may_init_module(void)
> {
> if (!capable(CAP_SYS_MODULE) || modules_disabled)
> return -EPERM;
>
> return 0;
> }
>
> This begs the question:
>
>   o If userspace just tries to just use raw finit_module() do we want similar
> checks?
>
> Otherwise, correct me if I'm wrong this all seems pointless.
>
> If we want something similar I think we might need to be processing aliases 
> and
> check for the aliases for their desired restrictions on finit_module(),
> otherwise userspace can skip through the checks if the module name does not
> match the alias prefix.
>
> To be clear, aliases are completely ignored today on load_module(), so loading
> 'xfs' with finit_module() will just have the kernel know about 'xfs' not
> 'fs-xfs'.
>
> So we currently do not process aliases in kernel.
>
> I have debugging patches to enable us to process them, but they are just for
> debugging and I've been meaning to send them in for review. I designed them
> only for debugging given last time someone suggested for aliases processing to
> be added, the only use case we found was a pre-optimizations we decided to 
> avoid
> pursuing. Debugging is a good reason to have alias processing in-kernel 
> though.
>
> The pre-optimization we decided to stay away from was to check if the 
> requested
> module via request_module() was already loaded *and* also check if the name 
> passed
> matches any of the existing module aliases for currently loaded modules. Today
> request_module() does not even check if a requested module is already loaded,
> its a stupid loader, it just goes to userspace, and lets userspace figure it
> out. Userspace in turn could check for aliases, but it could lie, or not be up
> to date to do that.
>
> The pre-optmization is a theoretical gain only then, and if userspace had
> proper alias checking it is arguable that it may perform just as equal.
> To help valuate these sorts of things we now have:
>
> tools/testing/selftests/kmod/kmod.sh
>
> So further patches can use and test impact with it.
>
> Anyway -- so aliasing is currently only a debugging consideration, but without
> processing aliases, all this work seems pointless to me as the real loader is
> in finit_module().

These patchset are about module auto-loading which is triggered from
multiple paths in the kernel, the cover letter notes all the
differences between the two operations and why the explicit one and
"modules_disabled=1" is already a pain.

The finit_module() is covered directly by CAP_SYS_MODULE, and for
aliasing I am not sure how it will be related or how userspace will
maintain it, we do not have a use case for it, we want a simple flag.

Thank you!


-- 
tixxdz


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-28 Thread Djalal Harouni
Hi Luis,

On Tue, Nov 28, 2017 at 8:14 PM, Luis R. Rodriguez  wrote:
> On Mon, Nov 27, 2017 at 06:18:34PM +0100, Djalal Harouni wrote:
> ...
>
>> After a discussion with Rusty Russell [1], the suggestion was to pass
>> the capability from request_module() to security_kernel_module_request()
>> for 'netdev-%s' modules that need CAP_NET_ADMIN, and after review from
>> Kees Cook [2] and experimenting with the code, the patch now does the
>> following:
>>
>> * Adds the request_module_cap() function.
>> * Updates the __request_module() to take the "required_cap" argument
>> with the "prefix".
>
> ...
>
>> Signed-off-by: Djalal Harouni 
>> ---
>> diff --git a/kernel/kmod.c b/kernel/kmod.c
>> index bc6addd..679d401 100644
>> --- a/kernel/kmod.c
>> +++ b/kernel/kmod.c
>> @@ -139,13 +147,22 @@ int __request_module(bool wait, const char *fmt, ...)
>>   if (!modprobe_path[0])
>>   return 0;
>>
>> + /*
>> +  * Lets attach the prefix to the module name
>> +  */
>> + if (prefix != NULL && *prefix != '\0') {
>> + len += snprintf(module_name, MODULE_NAME_LEN, "%s-", prefix);
>> + if (len >= MODULE_NAME_LEN)
>> + return -ENAMETOOLONG;
>> + }
>> +
>>   va_start(args, fmt);
>> - ret = vsnprintf(module_name, MODULE_NAME_LEN, fmt, args);
>> + ret = vsnprintf(module_name + len, MODULE_NAME_LEN - len, fmt, args);
>>   va_end(args);
>> - if (ret >= MODULE_NAME_LEN)
>> + if (ret >= MODULE_NAME_LEN - len)
>>   return -ENAMETOOLONG;
>>
>> - ret = security_kernel_module_request(module_name);
>> + ret = security_kernel_module_request(module_name, required_cap, 
>> prefix);
>>   if (ret)
>>   return ret;
>>
>
> kmod is just a helper to poke userpsace to load a module, that's it.
>
> The old init_module() and newer finit_module() do the real handy work or
> module loading, and both currently only use may_init_module():
>
> static int may_init_module(void)
> {
> if (!capable(CAP_SYS_MODULE) || modules_disabled)
> return -EPERM;
>
> return 0;
> }
>
> This begs the question:
>
>   o If userspace just tries to just use raw finit_module() do we want similar
> checks?
>
> Otherwise, correct me if I'm wrong this all seems pointless.
>
> If we want something similar I think we might need to be processing aliases 
> and
> check for the aliases for their desired restrictions on finit_module(),
> otherwise userspace can skip through the checks if the module name does not
> match the alias prefix.
>
> To be clear, aliases are completely ignored today on load_module(), so loading
> 'xfs' with finit_module() will just have the kernel know about 'xfs' not
> 'fs-xfs'.
>
> So we currently do not process aliases in kernel.
>
> I have debugging patches to enable us to process them, but they are just for
> debugging and I've been meaning to send them in for review. I designed them
> only for debugging given last time someone suggested for aliases processing to
> be added, the only use case we found was a pre-optimizations we decided to 
> avoid
> pursuing. Debugging is a good reason to have alias processing in-kernel 
> though.
>
> The pre-optimization we decided to stay away from was to check if the 
> requested
> module via request_module() was already loaded *and* also check if the name 
> passed
> matches any of the existing module aliases for currently loaded modules. Today
> request_module() does not even check if a requested module is already loaded,
> its a stupid loader, it just goes to userspace, and lets userspace figure it
> out. Userspace in turn could check for aliases, but it could lie, or not be up
> to date to do that.
>
> The pre-optmization is a theoretical gain only then, and if userspace had
> proper alias checking it is arguable that it may perform just as equal.
> To help valuate these sorts of things we now have:
>
> tools/testing/selftests/kmod/kmod.sh
>
> So further patches can use and test impact with it.
>
> Anyway -- so aliasing is currently only a debugging consideration, but without
> processing aliases, all this work seems pointless to me as the real loader is
> in finit_module().

These patchset are about module auto-loading which is triggered from
multiple paths in the kernel, the cover letter notes all the
differences between the two operations and why the explicit one and
"modules_disabled=1" is already a pain.

The finit_module() is covered directly by CAP_SYS_MODULE, and for
aliasing I am not sure how it will be related or how userspace will
maintain it, we do not have a use case for it, we want a simple flag.

Thank you!


-- 
tixxdz


Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-27 Thread Djalal Harouni
Hi Linus,

On Mon, Nov 27, 2017 at 7:44 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
> On Mon, Nov 27, 2017 at 9:18 AM, Djalal Harouni <tix...@gmail.com> wrote:
>> This uses the new request_module_cap() facility to directly propagate
>> CAP_NET_ADMIN capability and the 'netdev' module prefix to the
>> capability subsystem as it was suggested.
>
> This is the kind of complexity that I wonder if it's worth it at all.
>
> Nobody sane actually uses those stupid capability bits. Have you ever
> actually seen it used in real life?

Yes they are complicated even for developers, and normal users do not
understand them, however yes every sandbox and container is exposing
them to endusers directly, they are documented!  so yes CAP_SYS_MODULE
is exposed but it does not cover autoloading.

However, we are trying hard to abstract some semantics that are easy
to grasp, we are mutating capabilities and seccomp to have an
abstracted "yes/no" options for our endusers.

Now, if you are referring to kernel code, the networking subsystem is
using them and I don't want to break any assumption here. There is
still the request_module(), the request_module_cap() was suggested so
networking code later won't have to do the checks on its own, and
maybe it can be consistent in the long term. The phonet sockets even
needs CAP_SYS_ADMIN...


>
> They were a mistake, and we should never have done them - another case
> of security people who think that complexity == security, when in
> reality nobody actually wants the complexity or is willing to set it
> up and manage it.

Alright, but I guess we are stuck, is there something better on how we
can do this or describe this ?


Please note in these patches, the mode is specifically described as:

* allowed: for backward compatibility  (I would have done without it)
* privileged: which includes capabilities (backward compatibility too)
or we can add what ever in the future
* disabled: even for privileged.

So I would have preferred if it is something like "yes/no" but...
However in userspace we will try hard to hide this complexity and the
capability bits.

Now I can see that the code comments and doc refer to privileged with
capabilities a lot, where we can maybe update that doc and code to
less state that privileged means capabilities ? Suggestions ?

Thanks!

>Linus


-- 
tixxdz


Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-27 Thread Djalal Harouni
Hi Linus,

On Mon, Nov 27, 2017 at 7:44 PM, Linus Torvalds
 wrote:
> On Mon, Nov 27, 2017 at 9:18 AM, Djalal Harouni  wrote:
>> This uses the new request_module_cap() facility to directly propagate
>> CAP_NET_ADMIN capability and the 'netdev' module prefix to the
>> capability subsystem as it was suggested.
>
> This is the kind of complexity that I wonder if it's worth it at all.
>
> Nobody sane actually uses those stupid capability bits. Have you ever
> actually seen it used in real life?

Yes they are complicated even for developers, and normal users do not
understand them, however yes every sandbox and container is exposing
them to endusers directly, they are documented!  so yes CAP_SYS_MODULE
is exposed but it does not cover autoloading.

However, we are trying hard to abstract some semantics that are easy
to grasp, we are mutating capabilities and seccomp to have an
abstracted "yes/no" options for our endusers.

Now, if you are referring to kernel code, the networking subsystem is
using them and I don't want to break any assumption here. There is
still the request_module(), the request_module_cap() was suggested so
networking code later won't have to do the checks on its own, and
maybe it can be consistent in the long term. The phonet sockets even
needs CAP_SYS_ADMIN...


>
> They were a mistake, and we should never have done them - another case
> of security people who think that complexity == security, when in
> reality nobody actually wants the complexity or is willing to set it
> up and manage it.

Alright, but I guess we are stuck, is there something better on how we
can do this or describe this ?


Please note in these patches, the mode is specifically described as:

* allowed: for backward compatibility  (I would have done without it)
* privileged: which includes capabilities (backward compatibility too)
or we can add what ever in the future
* disabled: even for privileged.

So I would have preferred if it is something like "yes/no" but...
However in userspace we will try hard to hide this complexity and the
capability bits.

Now I can see that the code comments and doc refer to privileged with
capabilities a lot, where we can maybe update that doc and code to
less state that privileged means capabilities ? Suggestions ?

Thanks!

>Linus


-- 
tixxdz


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-27 Thread Djalal Harouni
Hi Randy,

On Mon, Nov 27, 2017 at 7:48 PM, Randy Dunlap <rdun...@infradead.org> wrote:
> Hi,
>
> Mostly typos/spellos...
>
>
> On 11/27/2017 09:18 AM, Djalal Harouni wrote:
>> Cc: Serge Hallyn <se...@hallyn.com>
>> Cc: Andy Lutomirski <l...@kernel.org>
>> Suggested-by: Rusty Russell <ru...@rustcorp.com.au>
>> Suggested-by: Kees Cook <keesc...@chromium.org>
>> Signed-off-by: Djalal Harouni <tix...@gmail.com>
>> ---
>>  include/linux/kmod.h  | 65 
>> ++-
>>  include/linux/lsm_hooks.h |  6 -
>>  include/linux/security.h  |  7 +++--
>>  kernel/kmod.c | 29 -
>>  security/security.c   |  6 +++--
>>  security/selinux/hooks.c  |  3 ++-
>>  6 files changed, 97 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/kmod.h b/include/linux/kmod.h
>> index 40c89ad..ccd6a1c 100644
>> --- a/include/linux/kmod.h
>> +++ b/include/linux/kmod.h
>> @@ -33,16 +33,67 @@
>
>> +/**
>> + * request_module  Try to load a kernel module
>> + *
>> + * Automatically loads the request module.
>> + *
>> + * @mod...: The module name
>> + */
>
> what are the "..." for?  what do they do here?

Ok, will fix it.

>
>> +#define request_module(mod...) __request_module(true, -1, NULL, mod)
>> +
>> +#define request_module_nowait(mod...) __request_module(false, -1, NULL, mod)
>> +
>> +/**
>> + * request_module_cap  Load kernel module only if the required capability 
>> is set
>> + *
[...]
>
>
> --
> ~Randy

Thank you very much for the review, will fix all.


-- 
tixxdz


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-27 Thread Djalal Harouni
Hi Randy,

On Mon, Nov 27, 2017 at 7:48 PM, Randy Dunlap  wrote:
> Hi,
>
> Mostly typos/spellos...
>
>
> On 11/27/2017 09:18 AM, Djalal Harouni wrote:
>> Cc: Serge Hallyn 
>> Cc: Andy Lutomirski 
>> Suggested-by: Rusty Russell 
>> Suggested-by: Kees Cook 
>> Signed-off-by: Djalal Harouni 
>> ---
>>  include/linux/kmod.h  | 65 
>> ++-
>>  include/linux/lsm_hooks.h |  6 -
>>  include/linux/security.h  |  7 +++--
>>  kernel/kmod.c | 29 -
>>  security/security.c   |  6 +++--
>>  security/selinux/hooks.c  |  3 ++-
>>  6 files changed, 97 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/kmod.h b/include/linux/kmod.h
>> index 40c89ad..ccd6a1c 100644
>> --- a/include/linux/kmod.h
>> +++ b/include/linux/kmod.h
>> @@ -33,16 +33,67 @@
>
>> +/**
>> + * request_module  Try to load a kernel module
>> + *
>> + * Automatically loads the request module.
>> + *
>> + * @mod...: The module name
>> + */
>
> what are the "..." for?  what do they do here?

Ok, will fix it.

>
>> +#define request_module(mod...) __request_module(true, -1, NULL, mod)
>> +
>> +#define request_module_nowait(mod...) __request_module(false, -1, NULL, mod)
>> +
>> +/**
>> + * request_module_cap  Load kernel module only if the required capability 
>> is set
>> + *
[...]
>
>
> --
> ~Randy

Thank you very much for the review, will fix all.


-- 
tixxdz


Re: [PATCH v5 next 0/5] Improve Module autoloading infrastructure

2017-11-27 Thread Djalal Harouni
Hi Linus,

On Mon, Nov 27, 2017 at 8:12 PM, Linus Torvalds
 wrote:
> On Mon, Nov 27, 2017 at 11:02 AM, Linus Torvalds
>  wrote:
>>
>> Now, the above will not necessarily work with a legacy /dev/ directory
>> where al the nodes have been pre-populated, and opening the device
>> node is supposed to load the module. So _historically_ we did indeed
>> load modules as normal users. But does that really happen any more?
>
> Sadly, it looks like bluetoothd actually does expect to load the
> bt-proto-XYZ modules with no capabilities at all.
>
> So apparently we really do depend on not needing capabilities for
> module loading.
>
> Oh well.

Yes DCCP is unprivileged, tun and all tunneling, some md drivers, some
crypto, and device drivers... fs modules can be loaded inside
usernamespaces, and maybe when some request requires external symbols
too...

However tunneling helps to solve real usecases, so that's why the
backward compatibility and opt-in.

I do perfectly understand that opt-in is not the best choice, however
these patchset includes a per process tree, and given that lot of code
is running in containers and sandboxes, it is better than nothing. I
will follow up later with patches to the major ones especially when we
force the flag by default. Ubuntu was said to be owned in a past
security contest due to this kind of things, and now since they have
ubuntu snaps or apps they can set the flag, and others will follow.

Thanks!

>  Linus



-- 
tixxdz


Re: [PATCH v5 next 0/5] Improve Module autoloading infrastructure

2017-11-27 Thread Djalal Harouni
Hi Linus,

On Mon, Nov 27, 2017 at 8:12 PM, Linus Torvalds
 wrote:
> On Mon, Nov 27, 2017 at 11:02 AM, Linus Torvalds
>  wrote:
>>
>> Now, the above will not necessarily work with a legacy /dev/ directory
>> where al the nodes have been pre-populated, and opening the device
>> node is supposed to load the module. So _historically_ we did indeed
>> load modules as normal users. But does that really happen any more?
>
> Sadly, it looks like bluetoothd actually does expect to load the
> bt-proto-XYZ modules with no capabilities at all.
>
> So apparently we really do depend on not needing capabilities for
> module loading.
>
> Oh well.

Yes DCCP is unprivileged, tun and all tunneling, some md drivers, some
crypto, and device drivers... fs modules can be loaded inside
usernamespaces, and maybe when some request requires external symbols
too...

However tunneling helps to solve real usecases, so that's why the
backward compatibility and opt-in.

I do perfectly understand that opt-in is not the best choice, however
these patchset includes a per process tree, and given that lot of code
is running in containers and sandboxes, it is better than nothing. I
will follow up later with patches to the major ones especially when we
force the flag by default. Ubuntu was said to be owned in a past
security contest due to this kind of things, and now since they have
ubuntu snaps or apps they can set the flag, and others will follow.

Thanks!

>  Linus



-- 
tixxdz


[PATCH v5 next 3/5] modules:capabilities: automatic module loading restriction

2017-11-27 Thread Djalal Harouni
The code path can be triggered by unprivileged, using the trigger.c
program for DCCP use after free [2] and that was fixed by
commit 5edabca9d4cff7f "dccp: fix freeing skb too early for IPV6_RECVPKTINFO".

Before:
$ lsmod | grep dccp
$ strace ./dccp_trigger
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = 3
...
$ lsmod | grep dccp
dccp_ipv6  24576  5
dccp_ipv4  24576  5 dccp_ipv6
dccp  102400  2 dccp_ipv6,dccp_ipv4

After:
Only privileged:
$ lsmod | grep dccp
$ strace ./dccp_trigger
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = -1 ESOCKTNOSUPPORT (Socket type not 
supported)
...
$ lsmod | grep dccp
$ dmesg
...
[  175.945063] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1390] was denied
[  175.947952] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1390] was denied
[  175.956061] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1390] was denied
[  175.959733] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1390] was denied

$ sudo strace ./dccp_trigger
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = 3
...
$ lsmod | grep dccp
dccp_ipv6  24576  6
dccp_ipv4  24576  5 dccp_ipv6
dccp  102400  2 dccp_ipv6,dccp_ipv4

Disable automatic module loading:
$ lsmod | grep dccp
$ su - root
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = -1 ESOCKTNOSUPPORT (Socket type not 
supported)
...
$ lsmod | grep dccp
$ dmesg
...
[  126.596545] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1291] was denied
[  126.598800] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1291] was denied
[  126.601264] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1291] was denied
[  126.602839] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1291] was denied

As an example, this blocks abuses, DCCP still can be explicilty loaded by
an administrator using modprobe, at same time automatic module loading is
disabled forever.

[1] http://www.openwall.com/lists/oss-security/2017/02/22/3
[2] https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-6074
[3] http://www.openwall.com/lists/oss-security/2017/03/29/2
[4] http://www.openwall.com/lists/oss-security/2017/03/07/6
[5] https://a13xp0p0v.github.io/2017/03/24/CVE-2017-2636.html

Cc: Rusty Russell <ru...@rustcorp.com.au>
Cc: James Morris <james.l.mor...@oracle.com>
Cc: Serge Hallyn <se...@hallyn.com>
Cc: Ben Hutchings <ben.hutchi...@codethink.co.uk>
Cc: Solar Designer <so...@openwall.com>
Cc: Andy Lutomirski <l...@kernel.org>
Suggested-by: Kees Cook <keesc...@chromium.org>
Signed-off-by: Djalal Harouni <tix...@gmail.com>
---
 Documentation/sysctl/kernel.txt | 54 +++
 include/linux/module.h  | 11 +-
 kernel/module.c | 81 -
 kernel/sysctl.c | 28 ++
 4 files changed, 172 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 694968c..dc44075 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -43,6 +43,7 @@ show up in /proc/sys/kernel:
 - l2cr[ PPC only ]
 - modprobe==> Documentation/debugging-modules.txt
 - modules_disabled
+- modules_autoload_mode
 - msg_next_id[ sysv ipc ]
 - msgmax
 - msgmnb
@@ -413,6 +414,59 @@ to false.  Generally used with the "kexec_load_disabled" 
toggle.
 
 ==
 
+modules_autoload_mode:
+
+A sysctl to control if modules auto-load feature is allowed or not.
+This sysctl complements "modules_disabled" which is for all module
+operations where this flag applies only to automatic module loading.
+Automatic module loading happens when programs request a kernel
+feature that is implemented by an unloaded module, the kernel
+automatically runs the program pointed by "modprobe" sysctl in order
+to load the corresponding module.
+
+Historically, the kernel was always able to automatically load modules
+if they are not blacklisted. This is one of the most important and
+transparent operations of Linux, it allows to provide numerous other
+features as they are needed which is crucial for a better user experience.
+However, as Linux is popular now and used for different appliances some
+of these may need to control such operations. For such systems, recent
+needs showed that in some cases allowing to control automatic module
+loading is as important as the operation itself. Restricting unprivileged
+programs or attackers that abuse this feature to load unused modules 

[PATCH v5 next 3/5] modules:capabilities: automatic module loading restriction

2017-11-27 Thread Djalal Harouni
The code path can be triggered by unprivileged, using the trigger.c
program for DCCP use after free [2] and that was fixed by
commit 5edabca9d4cff7f "dccp: fix freeing skb too early for IPV6_RECVPKTINFO".

Before:
$ lsmod | grep dccp
$ strace ./dccp_trigger
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = 3
...
$ lsmod | grep dccp
dccp_ipv6  24576  5
dccp_ipv4  24576  5 dccp_ipv6
dccp  102400  2 dccp_ipv6,dccp_ipv4

After:
Only privileged:
$ lsmod | grep dccp
$ strace ./dccp_trigger
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = -1 ESOCKTNOSUPPORT (Socket type not 
supported)
...
$ lsmod | grep dccp
$ dmesg
...
[  175.945063] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1390] was denied
[  175.947952] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1390] was denied
[  175.956061] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1390] was denied
[  175.959733] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1390] was denied

$ sudo strace ./dccp_trigger
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = 3
...
$ lsmod | grep dccp
dccp_ipv6  24576  6
dccp_ipv4  24576  5 dccp_ipv6
dccp  102400  2 dccp_ipv6,dccp_ipv4

Disable automatic module loading:
$ lsmod | grep dccp
$ su - root
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = -1 ESOCKTNOSUPPORT (Socket type not 
supported)
...
$ lsmod | grep dccp
$ dmesg
...
[  126.596545] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1291] was denied
[  126.598800] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1291] was denied
[  126.601264] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1291] was denied
[  126.602839] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1291] was denied

As an example, this blocks abuses, DCCP still can be explicilty loaded by
an administrator using modprobe, at same time automatic module loading is
disabled forever.

[1] http://www.openwall.com/lists/oss-security/2017/02/22/3
[2] https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-6074
[3] http://www.openwall.com/lists/oss-security/2017/03/29/2
[4] http://www.openwall.com/lists/oss-security/2017/03/07/6
[5] https://a13xp0p0v.github.io/2017/03/24/CVE-2017-2636.html

Cc: Rusty Russell 
Cc: James Morris 
Cc: Serge Hallyn 
Cc: Ben Hutchings 
Cc: Solar Designer 
Cc: Andy Lutomirski 
Suggested-by: Kees Cook 
Signed-off-by: Djalal Harouni 
---
 Documentation/sysctl/kernel.txt | 54 +++
 include/linux/module.h  | 11 +-
 kernel/module.c | 81 -
 kernel/sysctl.c | 28 ++
 4 files changed, 172 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 694968c..dc44075 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -43,6 +43,7 @@ show up in /proc/sys/kernel:
 - l2cr[ PPC only ]
 - modprobe==> Documentation/debugging-modules.txt
 - modules_disabled
+- modules_autoload_mode
 - msg_next_id[ sysv ipc ]
 - msgmax
 - msgmnb
@@ -413,6 +414,59 @@ to false.  Generally used with the "kexec_load_disabled" 
toggle.
 
 ==
 
+modules_autoload_mode:
+
+A sysctl to control if modules auto-load feature is allowed or not.
+This sysctl complements "modules_disabled" which is for all module
+operations where this flag applies only to automatic module loading.
+Automatic module loading happens when programs request a kernel
+feature that is implemented by an unloaded module, the kernel
+automatically runs the program pointed by "modprobe" sysctl in order
+to load the corresponding module.
+
+Historically, the kernel was always able to automatically load modules
+if they are not blacklisted. This is one of the most important and
+transparent operations of Linux, it allows to provide numerous other
+features as they are needed which is crucial for a better user experience.
+However, as Linux is popular now and used for different appliances some
+of these may need to control such operations. For such systems, recent
+needs showed that in some cases allowing to control automatic module
+loading is as important as the operation itself. Restricting unprivileged
+programs or attackers that abuse this feature to load unused modules or
+modules that contain bugs is a significant security measure.
+
+The three modes that "modules_autoload_mode" support allow to provide
+restrictions on automatic module loading without breaking user
+experience.
+
+When modu

[PATCH v5 next 4/5] modules:capabilities: add a per-task modules auto-load mode

2017-11-27 Thread Djalal Harouni
uot;[1873] was denied
[ 5154.222731] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1873] was denied

As showed, this blocks automatic module loading per-task. This allows to
provide a usable system, where only some sandboxed apps or containers will be
restricted to trigger automatic module loading, other parts of the
system can continue to use the feature as it is which is the case of the
desktop and userfriendly machines.

[1] http://www.openwall.com/lists/oss-security/2017/02/22/3
[2] https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-6074
[3] http://www.openwall.com/lists/oss-security/2017/03/29/2
[4] http://www.openwall.com/lists/oss-security/2017/03/07/6
[5] https://a13xp0p0v.github.io/2017/03/24/CVE-2017-2636.html

Cc: Ben Hutchings <ben.hutchi...@codethink.co.uk>
Cc: Rusty Russell <ru...@rustcorp.com.au>
Cc: James Morris <james.l.mor...@oracle.com>
Cc: Serge Hallyn <se...@hallyn.com>
Cc: Solar Designer <so...@openwall.com>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: Kees Cook <keesc...@chromium.org>
Signed-off-by: Djalal Harouni <tix...@gmail.com>
---
 Documentation/filesystems/proc.txt |   3 +
 Documentation/userspace-api/index.rst  |   1 +
 .../userspace-api/modules_autoload_mode.rst| 116 +
 fs/proc/array.c|   6 ++
 include/linux/init_task.h  |   8 ++
 include/linux/module.h |  20 
 include/linux/sched.h  |   5 +
 include/uapi/linux/prctl.h |   8 ++
 kernel/module.c|  83 ---
 security/commoncap.c   |  36 +++
 10 files changed, 270 insertions(+), 16 deletions(-)
 create mode 100644 Documentation/userspace-api/modules_autoload_mode.rst

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 2a84bb3..1974cb6 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -195,6 +195,7 @@ read the file /proc/PID/status:
   CapBnd: 
   NoNewPrivs: 0
   Seccomp:0
+  ModulesAutoloadMode:0
   voluntary_ctxt_switches:0
   nonvoluntary_ctxt_switches: 1
 
@@ -269,6 +270,8 @@ Table 1-2: Contents of the status files (as of 4.8)
  CapBnd  bitmap of capabilities bounding set
  NoNewPrivs  no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...)
  Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...)
+ ModulesAutoloadMode modules auto-load mode, like
+ prctl(PR_GET_MODULES_AUTOLOAD_MODE, ...)
  Cpus_allowedmask of CPUs on which this process may run
  Cpus_allowed_list   Same as previous, but in "list format"
  Mems_allowedmask of memory nodes allowed to this process
diff --git a/Documentation/userspace-api/index.rst 
b/Documentation/userspace-api/index.rst
index 7b2eb1b..bfd51b7 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -17,6 +17,7 @@ place where this information is gathered.
:maxdepth: 2
 
no_new_privs
+   modules_autoload_mode
seccomp_filter
unshare
 
diff --git a/Documentation/userspace-api/modules_autoload_mode.rst 
b/Documentation/userspace-api/modules_autoload_mode.rst
new file mode 100644
index 000..1153c35
--- /dev/null
+++ b/Documentation/userspace-api/modules_autoload_mode.rst
@@ -0,0 +1,116 @@
+==
+Per-task module auto-load restrictions
+==
+
+
+Introduction
+
+
+Usually a request to a kernel feature that is implemented by a module
+that is not loaded may trigger automatic module loading feature, allowing
+to transparently satisfy userspace, and provide numerous other features
+as they are needed. In this case an implicit kernel module load
+operation happens.
+
+In most cases to load or unload a kernel module, an explicit operation
+happens where programs are required to have ``CAP_SYS_MODULE`` capability
+to perform so. However, with implicit module loading, no capabilities are
+required, or only ``CAP_NET_ADMIN`` in rare cases where the module has the
+'netdev-%s' alias. Historically this was always the case as automatic
+module loading is one of the most important and transparent operations
+of Linux, users expect that their programs just work, yet, recent cases
+showed that this can be abused by unprivileged users or attackers to load
+modules that were not updated, or modules that contain bugs and
+vulnerabilities.
+
+Currently most of Linux code is in a form of modules, hence, allowing to
+control automatic module loading in some cases is as important as the
+operation itself, especially in the context where Linux is used in
+different appli

[PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-27 Thread Djalal Harouni
This is a preparation patch to improve the module auto-load
infrastructure.

We need this patch to have more control on module auto-load operations.
The operation by default is allowed unless enduser or the calling code
requests that we need to perform futher permission checks.

With this change subsystems will be able to decide if module auto-load
feature first will have to do a capability check and load the module if
the permission check succeeds or deny the operation.

As an example "netdev-%s" modules, they are allowed to be loaded if
CAP_NET_ADMIN is set. Therefore, in order to not break this assumption,
and allow userspace to load "netdev-%s" modules with CAP_NET_ADMIN,
we have added:

request_module_cap(required_cap, prefix, fmt...)

This new function will take:
'@required_cap': Required capability to load the module
'@prefix': The module prefix if any, otherwise NULL
'@fmt': printf style format string for the name of the module with its
arguments if any

ex:
request_module_cap(CAP_NET_ADMIN, "netdev", "%s", mod);

After a discussion with Rusty Russell [1], the suggestion was to pass
the capability from request_module() to security_kernel_module_request()
for 'netdev-%s' modules that need CAP_NET_ADMIN, and after review from
Kees Cook [2] and experimenting with the code, the patch now does the
following:

* Adds the request_module_cap() function.
* Updates the __request_module() to take the "required_cap" argument
with the "prefix".

This patch also updates SELinux which is currently the only user of
security_kernel_module_request(), the security hook now accepts
'required_cap' and 'prefix' as arguments.

Based on patch by Rusty Russell and discussion with Kees Cook:
[1] https://lkml.org/lkml/2017/4/26/735
[2] https://lkml.org/lkml/2017/5/23/775

Cc: Serge Hallyn <se...@hallyn.com>
Cc: Andy Lutomirski <l...@kernel.org>
Suggested-by: Rusty Russell <ru...@rustcorp.com.au>
Suggested-by: Kees Cook <keesc...@chromium.org>
Signed-off-by: Djalal Harouni <tix...@gmail.com>
---
 include/linux/kmod.h  | 65 ++-
 include/linux/lsm_hooks.h |  6 -
 include/linux/security.h  |  7 +++--
 kernel/kmod.c | 29 -
 security/security.c   |  6 +++--
 security/selinux/hooks.c  |  3 ++-
 6 files changed, 97 insertions(+), 19 deletions(-)

diff --git a/include/linux/kmod.h b/include/linux/kmod.h
index 40c89ad..ccd6a1c 100644
--- a/include/linux/kmod.h
+++ b/include/linux/kmod.h
@@ -33,16 +33,67 @@
 extern char modprobe_path[]; /* for sysctl */
 /* modprobe exit status on success, -ve on error.  Return value
  * usually useless though. */
-extern __printf(2, 3)
-int __request_module(bool wait, const char *name, ...);
-#define request_module(mod...) __request_module(true, mod)
-#define request_module_nowait(mod...) __request_module(false, mod)
+extern __printf(4, 5)
+int __request_module(bool wait, int required_cap,
+const char *prefix, const char *name, ...);
 #define try_then_request_module(x, mod...) \
-   ((x) ?: (__request_module(true, mod), (x)))
+   ((x) ?: (__request_module(true, -1, NULL, mod), (x)))
 #else
-static inline int request_module(const char *name, ...) { return -ENOSYS; }
-static inline int request_module_nowait(const char *name, ...) { return 
-ENOSYS; }
+static inline __printf(4, 5)
+int __request_module(bool wait, int required_cap,
+const char *prefix, const char *name, ...)
+{ return -ENOSYS; }
 #define try_then_request_module(x, mod...) (x)
 #endif
 
+/**
+ * request_module  Try to load a kernel module
+ *
+ * Automatically loads the request module.
+ *
+ * @mod...: The module name
+ */
+#define request_module(mod...) __request_module(true, -1, NULL, mod)
+
+#define request_module_nowait(mod...) __request_module(false, -1, NULL, mod)
+
+/**
+ * request_module_cap  Load kernel module only if the required capability is 
set
+ *
+ * Automatically load a module if the required capability is set and it
+ * corresponds to the appropriate subsystem that is indicated by prefix.
+ *
+ * This allows to load aliased modules like 'netdev-%s' with CAP_NET_ADMIN.
+ *
+ * ex:
+ * request_module_cap(CAP_NET_ADMIN, "netdev", "%s", mod);
+ *
+ * @required_cap: Required capability to load the module
+ * @prefix: The module prefix if any, otherwise NULL
+ * @fmt: printf style format string for the name of the module with its
+ *   arguments if any
+ *
+ * If '@required_cap' is positive, the security subsystem will check if
+ * '@prefix' is set and if caller has the required capability then the
+ * operation is allowed.
+ * The security subsystem can not make assumption about the boundaries
+ * of other subsystems, it is their responsability to make a call with
+ * the right capability and module alias.
+ *
+ * If '@required_cap' is positive 

[PATCH v5 next 4/5] modules:capabilities: add a per-task modules auto-load mode

2017-11-27 Thread Djalal Harouni
uot;[1873] was denied
[ 5154.222731] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1873] was denied

As showed, this blocks automatic module loading per-task. This allows to
provide a usable system, where only some sandboxed apps or containers will be
restricted to trigger automatic module loading, other parts of the
system can continue to use the feature as it is which is the case of the
desktop and userfriendly machines.

[1] http://www.openwall.com/lists/oss-security/2017/02/22/3
[2] https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-6074
[3] http://www.openwall.com/lists/oss-security/2017/03/29/2
[4] http://www.openwall.com/lists/oss-security/2017/03/07/6
[5] https://a13xp0p0v.github.io/2017/03/24/CVE-2017-2636.html

Cc: Ben Hutchings 
Cc: Rusty Russell 
Cc: James Morris 
Cc: Serge Hallyn 
Cc: Solar Designer 
Cc: Andy Lutomirski 
Cc: Kees Cook 
Signed-off-by: Djalal Harouni 
---
 Documentation/filesystems/proc.txt |   3 +
 Documentation/userspace-api/index.rst  |   1 +
 .../userspace-api/modules_autoload_mode.rst| 116 +
 fs/proc/array.c|   6 ++
 include/linux/init_task.h  |   8 ++
 include/linux/module.h |  20 
 include/linux/sched.h  |   5 +
 include/uapi/linux/prctl.h |   8 ++
 kernel/module.c|  83 ---
 security/commoncap.c   |  36 +++
 10 files changed, 270 insertions(+), 16 deletions(-)
 create mode 100644 Documentation/userspace-api/modules_autoload_mode.rst

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 2a84bb3..1974cb6 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -195,6 +195,7 @@ read the file /proc/PID/status:
   CapBnd: 
   NoNewPrivs: 0
   Seccomp:0
+  ModulesAutoloadMode:0
   voluntary_ctxt_switches:0
   nonvoluntary_ctxt_switches: 1
 
@@ -269,6 +270,8 @@ Table 1-2: Contents of the status files (as of 4.8)
  CapBnd  bitmap of capabilities bounding set
  NoNewPrivs  no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...)
  Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...)
+ ModulesAutoloadMode modules auto-load mode, like
+ prctl(PR_GET_MODULES_AUTOLOAD_MODE, ...)
  Cpus_allowedmask of CPUs on which this process may run
  Cpus_allowed_list   Same as previous, but in "list format"
  Mems_allowedmask of memory nodes allowed to this process
diff --git a/Documentation/userspace-api/index.rst 
b/Documentation/userspace-api/index.rst
index 7b2eb1b..bfd51b7 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -17,6 +17,7 @@ place where this information is gathered.
:maxdepth: 2
 
no_new_privs
+   modules_autoload_mode
seccomp_filter
unshare
 
diff --git a/Documentation/userspace-api/modules_autoload_mode.rst 
b/Documentation/userspace-api/modules_autoload_mode.rst
new file mode 100644
index 000..1153c35
--- /dev/null
+++ b/Documentation/userspace-api/modules_autoload_mode.rst
@@ -0,0 +1,116 @@
+==
+Per-task module auto-load restrictions
+==
+
+
+Introduction
+
+
+Usually a request to a kernel feature that is implemented by a module
+that is not loaded may trigger automatic module loading feature, allowing
+to transparently satisfy userspace, and provide numerous other features
+as they are needed. In this case an implicit kernel module load
+operation happens.
+
+In most cases to load or unload a kernel module, an explicit operation
+happens where programs are required to have ``CAP_SYS_MODULE`` capability
+to perform so. However, with implicit module loading, no capabilities are
+required, or only ``CAP_NET_ADMIN`` in rare cases where the module has the
+'netdev-%s' alias. Historically this was always the case as automatic
+module loading is one of the most important and transparent operations
+of Linux, users expect that their programs just work, yet, recent cases
+showed that this can be abused by unprivileged users or attackers to load
+modules that were not updated, or modules that contain bugs and
+vulnerabilities.
+
+Currently most of Linux code is in a form of modules, hence, allowing to
+control automatic module loading in some cases is as important as the
+operation itself, especially in the context where Linux is used in
+different appliances.
+
+Restricting automatic module loading allows administratros to have the
+appropriate time to update or deny module autoloading in advance. In a
+container or sandbox world where apps can be moved from one context to

[PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-27 Thread Djalal Harouni
This is a preparation patch to improve the module auto-load
infrastructure.

We need this patch to have more control on module auto-load operations.
The operation by default is allowed unless enduser or the calling code
requests that we need to perform futher permission checks.

With this change subsystems will be able to decide if module auto-load
feature first will have to do a capability check and load the module if
the permission check succeeds or deny the operation.

As an example "netdev-%s" modules, they are allowed to be loaded if
CAP_NET_ADMIN is set. Therefore, in order to not break this assumption,
and allow userspace to load "netdev-%s" modules with CAP_NET_ADMIN,
we have added:

request_module_cap(required_cap, prefix, fmt...)

This new function will take:
'@required_cap': Required capability to load the module
'@prefix': The module prefix if any, otherwise NULL
'@fmt': printf style format string for the name of the module with its
arguments if any

ex:
request_module_cap(CAP_NET_ADMIN, "netdev", "%s", mod);

After a discussion with Rusty Russell [1], the suggestion was to pass
the capability from request_module() to security_kernel_module_request()
for 'netdev-%s' modules that need CAP_NET_ADMIN, and after review from
Kees Cook [2] and experimenting with the code, the patch now does the
following:

* Adds the request_module_cap() function.
* Updates the __request_module() to take the "required_cap" argument
with the "prefix".

This patch also updates SELinux which is currently the only user of
security_kernel_module_request(), the security hook now accepts
'required_cap' and 'prefix' as arguments.

Based on patch by Rusty Russell and discussion with Kees Cook:
[1] https://lkml.org/lkml/2017/4/26/735
[2] https://lkml.org/lkml/2017/5/23/775

Cc: Serge Hallyn 
Cc: Andy Lutomirski 
Suggested-by: Rusty Russell 
Suggested-by: Kees Cook 
Signed-off-by: Djalal Harouni 
---
 include/linux/kmod.h  | 65 ++-
 include/linux/lsm_hooks.h |  6 -
 include/linux/security.h  |  7 +++--
 kernel/kmod.c | 29 -
 security/security.c   |  6 +++--
 security/selinux/hooks.c  |  3 ++-
 6 files changed, 97 insertions(+), 19 deletions(-)

diff --git a/include/linux/kmod.h b/include/linux/kmod.h
index 40c89ad..ccd6a1c 100644
--- a/include/linux/kmod.h
+++ b/include/linux/kmod.h
@@ -33,16 +33,67 @@
 extern char modprobe_path[]; /* for sysctl */
 /* modprobe exit status on success, -ve on error.  Return value
  * usually useless though. */
-extern __printf(2, 3)
-int __request_module(bool wait, const char *name, ...);
-#define request_module(mod...) __request_module(true, mod)
-#define request_module_nowait(mod...) __request_module(false, mod)
+extern __printf(4, 5)
+int __request_module(bool wait, int required_cap,
+const char *prefix, const char *name, ...);
 #define try_then_request_module(x, mod...) \
-   ((x) ?: (__request_module(true, mod), (x)))
+   ((x) ?: (__request_module(true, -1, NULL, mod), (x)))
 #else
-static inline int request_module(const char *name, ...) { return -ENOSYS; }
-static inline int request_module_nowait(const char *name, ...) { return 
-ENOSYS; }
+static inline __printf(4, 5)
+int __request_module(bool wait, int required_cap,
+const char *prefix, const char *name, ...)
+{ return -ENOSYS; }
 #define try_then_request_module(x, mod...) (x)
 #endif
 
+/**
+ * request_module  Try to load a kernel module
+ *
+ * Automatically loads the request module.
+ *
+ * @mod...: The module name
+ */
+#define request_module(mod...) __request_module(true, -1, NULL, mod)
+
+#define request_module_nowait(mod...) __request_module(false, -1, NULL, mod)
+
+/**
+ * request_module_cap  Load kernel module only if the required capability is 
set
+ *
+ * Automatically load a module if the required capability is set and it
+ * corresponds to the appropriate subsystem that is indicated by prefix.
+ *
+ * This allows to load aliased modules like 'netdev-%s' with CAP_NET_ADMIN.
+ *
+ * ex:
+ * request_module_cap(CAP_NET_ADMIN, "netdev", "%s", mod);
+ *
+ * @required_cap: Required capability to load the module
+ * @prefix: The module prefix if any, otherwise NULL
+ * @fmt: printf style format string for the name of the module with its
+ *   arguments if any
+ *
+ * If '@required_cap' is positive, the security subsystem will check if
+ * '@prefix' is set and if caller has the required capability then the
+ * operation is allowed.
+ * The security subsystem can not make assumption about the boundaries
+ * of other subsystems, it is their responsability to make a call with
+ * the right capability and module alias.
+ *
+ * If '@required_cap' is positive and '@prefix' is NULL then we assume
+ * that the '@required_cap' is CAP_SYS_MODULE.
+ *
+ * If '@required_cap' is negative t

[PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-27 Thread Djalal Harouni
This uses the new request_module_cap() facility to directly propagate
CAP_NET_ADMIN capability and the 'netdev' module prefix to the
capability subsystem as it was suggested.

We do not remove the explicit capable(CAP_NET_ADMIN) check here, but we
may remove it in future versions since it is also performed by the
capability subsystem. This allows to have a better interface where other
subsystems will just use this call and let the capability subsystem
handles the permission checks, if the modules should be loaded or not.

This is also an infrastructure fix since historically Linux always
allowed to auto-load modules without privileges, and later the net code
started to check capabilities and prefixes, adapted the CAP_NET_ADMIN
check with the 'netdev' prefix to prevent abusing the capability by
loading non-netdev modules. However from a bigger picture we want to
continue to support automatic module loading as non privileged but also
implement easy policy solutions like:

User=djalal
DenyNewFeatures=no

Which will translate to allow the interactive user djalal to load extra
Linux features. Others, volatile accounts or guests can be easily
blocked from doing so. We have introduced in previous patches the
necessary infrastructure and now with this change we start to use the
new request_module_cap() function to explicitly tell the capability
subsystem that we want to auto-load modules with CAP_NET_ADMIN if they
are prefixed.

This is also based on suggestions from Rusty Russel and Kees Cook [1]

[1] https://lkml.org/lkml/2017/4/26/735

Cc: Ben Hutchings <ben.hutchi...@codethink.co.uk>
Cc: James Morris <james.l.mor...@oracle.com>
Cc: Serge Hallyn <se...@hallyn.com>
Cc: Solar Designer <so...@openwall.com>
Cc: Andy Lutomirski <l...@kernel.org>
Suggested-by: Rusty Russell <ru...@rustcorp.com.au>
Suggested-by: Kees Cook <keesc...@chromium.org>
Signed-off-by: Djalal Harouni <tix...@gmail.com>
---
 net/core/dev_ioctl.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
index 7e690d0..fdd8560 100644
--- a/net/core/dev_ioctl.c
+++ b/net/core/dev_ioctl.c
@@ -382,8 +382,10 @@ void dev_load(struct net *net, const char *name)
rcu_read_unlock();
 
no_module = !dev;
+   /* "netdev-%s" modules are allowed if CAP_NET_ADMIN is set */
if (no_module && capable(CAP_NET_ADMIN))
-   no_module = request_module("netdev-%s", name);
+   no_module = request_module_cap(CAP_NET_ADMIN, "netdev",
+  "%s", name);
if (no_module && capable(CAP_SYS_MODULE))
request_module("%s", name);
 }
-- 
2.7.4



[PATCH v5 next 2/5] modules:capabilities: add cap_kernel_module_request() permission check

2017-11-27 Thread Djalal Harouni
This is a preparation patch to improve for the module auto-load
infrastrucutre.

With this change, subsystems that want to autoload modules and implement
onsite capability checks, can defer the checks to the capability
subsystem by passing the required capabilities with the appropriate
modules alias. The capability subsystem will trust callers about
the passed values and perform a capability check to either allow module
auto-loading or deny it.

This patch changes:
* Adds cap_kernel_module_request() capability hook.
* Adds an empty may_autoload_module() that will be updated in the next
  patch.

Cc: James Morris <james.l.mor...@oracle.com>
Cc: Serge Hallyn <se...@hallyn.com>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: Ben Hutchings <ben.hutchi...@codethink.co.uk>
Suggested-by: Rusty Russell <ru...@rustcorp.com.au>
Suggested-by: Kees Cook <keesc...@chromium.org>
Signed-off-by: Djalal Harouni <tix...@gmail.com>
---
 include/linux/module.h   | 10 ++
 include/linux/security.h |  4 +++-
 kernel/module.c  | 23 +++
 security/commoncap.c | 26 ++
 4 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index c69b49a..5cbb239 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -497,6 +497,10 @@ bool __is_module_percpu_address(unsigned long addr, 
unsigned long *can_addr);
 bool is_module_percpu_address(unsigned long addr);
 bool is_module_text_address(unsigned long addr);
 
+/* Determine whether a module auto-load operation is permitted. */
+int may_autoload_module(char *kmod_name, int required_cap,
+   const char *kmod_prefix);
+
 static inline bool within_module_core(unsigned long addr,
  const struct module *mod)
 {
@@ -643,6 +647,12 @@ bool is_module_sig_enforced(void);
 
 #else /* !CONFIG_MODULES... */
 
+static inline int may_autoload_module(char *kmod_name, int required_cap,
+ const char *kmod_prefix)
+{
+   return -ENOSYS;
+}
+
 static inline struct module *__module_address(unsigned long addr)
 {
return NULL;
diff --git a/include/linux/security.h b/include/linux/security.h
index 41e700a..9bb53b5 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -102,6 +102,8 @@ extern int cap_task_setscheduler(struct task_struct *p);
 extern int cap_task_setioprio(struct task_struct *p, int ioprio);
 extern int cap_task_setnice(struct task_struct *p, int nice);
 extern int cap_vm_enough_memory(struct mm_struct *mm, long pages);
+extern int cap_kernel_module_request(char *kmod_name, int required_cap,
+const char *kmod_prefix);
 
 struct msghdr;
 struct sk_buff;
@@ -924,7 +926,7 @@ static inline int security_kernel_module_request(char 
*kmod_name,
 int required_cap,
 const char *prefix)
 {
-   return 0;
+   return cap_kernel_module_request(kmod_name, required_cap, prefix);
 }
 
 static inline int security_kernel_read_file(struct file *file,
diff --git a/kernel/module.c b/kernel/module.c
index f0411a2..3380d39 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -4340,6 +4340,29 @@ struct module *__module_text_address(unsigned long addr)
 }
 EXPORT_SYMBOL_GPL(__module_text_address);
 
+/**
+ * may_autoload_module - Determine whether a module auto-load operation
+ * is permitted
+ * @kmod_name: The module name
+ * @required_cap: if positive, may allow to auto-load the module if this
+ * capability is set
+ * @kmod_prefix: The module prefix if any, otherwise NULL
+ *
+ * Determine whether a module auto-load operation is allowed or not.
+ *
+ * This allows to have more control on automatic module loading, and align it
+ * with explicit load/unload module operations. The kernel contains several
+ * modules, some of them are not updated often and may contain bugs and
+ * vulnerabilities.
+ *
+ * Returns 0 if the module request is allowed or -EPERM if not.
+ */
+int may_autoload_module(char *kmod_name, int required_cap,
+   const char *kmod_prefix)
+{
+   return 0;
+}
+
 /* Don't grab lock, we're oopsing. */
 void print_modules(void)
 {
diff --git a/security/commoncap.c b/security/commoncap.c
index 4f8e093..236e573 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -1340,6 +1340,31 @@ int cap_mmap_file(struct file *file, unsigned long 
reqprot,
return 0;
 }
 
+/**
+ * cap_kernel_module_request - Determine whether a module auto-load is 
permitted
+ * @kmod_name: The module name
+ * @required_cap: if positive, may allow to auto-load the module if this
+ * capability is set
+ * @kmod_prefix: the module prefix if any, otherwise NULL
+ *
+ * Determine whether a module should be automatically loaded.
+ * Returns 0 if the module request s

[PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-27 Thread Djalal Harouni
This uses the new request_module_cap() facility to directly propagate
CAP_NET_ADMIN capability and the 'netdev' module prefix to the
capability subsystem as it was suggested.

We do not remove the explicit capable(CAP_NET_ADMIN) check here, but we
may remove it in future versions since it is also performed by the
capability subsystem. This allows to have a better interface where other
subsystems will just use this call and let the capability subsystem
handles the permission checks, if the modules should be loaded or not.

This is also an infrastructure fix since historically Linux always
allowed to auto-load modules without privileges, and later the net code
started to check capabilities and prefixes, adapted the CAP_NET_ADMIN
check with the 'netdev' prefix to prevent abusing the capability by
loading non-netdev modules. However from a bigger picture we want to
continue to support automatic module loading as non privileged but also
implement easy policy solutions like:

User=djalal
DenyNewFeatures=no

Which will translate to allow the interactive user djalal to load extra
Linux features. Others, volatile accounts or guests can be easily
blocked from doing so. We have introduced in previous patches the
necessary infrastructure and now with this change we start to use the
new request_module_cap() function to explicitly tell the capability
subsystem that we want to auto-load modules with CAP_NET_ADMIN if they
are prefixed.

This is also based on suggestions from Rusty Russel and Kees Cook [1]

[1] https://lkml.org/lkml/2017/4/26/735

Cc: Ben Hutchings 
Cc: James Morris 
Cc: Serge Hallyn 
Cc: Solar Designer 
Cc: Andy Lutomirski 
Suggested-by: Rusty Russell 
Suggested-by: Kees Cook 
Signed-off-by: Djalal Harouni 
---
 net/core/dev_ioctl.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
index 7e690d0..fdd8560 100644
--- a/net/core/dev_ioctl.c
+++ b/net/core/dev_ioctl.c
@@ -382,8 +382,10 @@ void dev_load(struct net *net, const char *name)
rcu_read_unlock();
 
no_module = !dev;
+   /* "netdev-%s" modules are allowed if CAP_NET_ADMIN is set */
if (no_module && capable(CAP_NET_ADMIN))
-   no_module = request_module("netdev-%s", name);
+   no_module = request_module_cap(CAP_NET_ADMIN, "netdev",
+  "%s", name);
if (no_module && capable(CAP_SYS_MODULE))
request_module("%s", name);
 }
-- 
2.7.4



[PATCH v5 next 2/5] modules:capabilities: add cap_kernel_module_request() permission check

2017-11-27 Thread Djalal Harouni
This is a preparation patch to improve for the module auto-load
infrastrucutre.

With this change, subsystems that want to autoload modules and implement
onsite capability checks, can defer the checks to the capability
subsystem by passing the required capabilities with the appropriate
modules alias. The capability subsystem will trust callers about
the passed values and perform a capability check to either allow module
auto-loading or deny it.

This patch changes:
* Adds cap_kernel_module_request() capability hook.
* Adds an empty may_autoload_module() that will be updated in the next
  patch.

Cc: James Morris 
Cc: Serge Hallyn 
Cc: Andy Lutomirski 
Cc: Ben Hutchings 
Suggested-by: Rusty Russell 
Suggested-by: Kees Cook 
Signed-off-by: Djalal Harouni 
---
 include/linux/module.h   | 10 ++
 include/linux/security.h |  4 +++-
 kernel/module.c  | 23 +++
 security/commoncap.c | 26 ++
 4 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index c69b49a..5cbb239 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -497,6 +497,10 @@ bool __is_module_percpu_address(unsigned long addr, 
unsigned long *can_addr);
 bool is_module_percpu_address(unsigned long addr);
 bool is_module_text_address(unsigned long addr);
 
+/* Determine whether a module auto-load operation is permitted. */
+int may_autoload_module(char *kmod_name, int required_cap,
+   const char *kmod_prefix);
+
 static inline bool within_module_core(unsigned long addr,
  const struct module *mod)
 {
@@ -643,6 +647,12 @@ bool is_module_sig_enforced(void);
 
 #else /* !CONFIG_MODULES... */
 
+static inline int may_autoload_module(char *kmod_name, int required_cap,
+ const char *kmod_prefix)
+{
+   return -ENOSYS;
+}
+
 static inline struct module *__module_address(unsigned long addr)
 {
return NULL;
diff --git a/include/linux/security.h b/include/linux/security.h
index 41e700a..9bb53b5 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -102,6 +102,8 @@ extern int cap_task_setscheduler(struct task_struct *p);
 extern int cap_task_setioprio(struct task_struct *p, int ioprio);
 extern int cap_task_setnice(struct task_struct *p, int nice);
 extern int cap_vm_enough_memory(struct mm_struct *mm, long pages);
+extern int cap_kernel_module_request(char *kmod_name, int required_cap,
+const char *kmod_prefix);
 
 struct msghdr;
 struct sk_buff;
@@ -924,7 +926,7 @@ static inline int security_kernel_module_request(char 
*kmod_name,
 int required_cap,
 const char *prefix)
 {
-   return 0;
+   return cap_kernel_module_request(kmod_name, required_cap, prefix);
 }
 
 static inline int security_kernel_read_file(struct file *file,
diff --git a/kernel/module.c b/kernel/module.c
index f0411a2..3380d39 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -4340,6 +4340,29 @@ struct module *__module_text_address(unsigned long addr)
 }
 EXPORT_SYMBOL_GPL(__module_text_address);
 
+/**
+ * may_autoload_module - Determine whether a module auto-load operation
+ * is permitted
+ * @kmod_name: The module name
+ * @required_cap: if positive, may allow to auto-load the module if this
+ * capability is set
+ * @kmod_prefix: The module prefix if any, otherwise NULL
+ *
+ * Determine whether a module auto-load operation is allowed or not.
+ *
+ * This allows to have more control on automatic module loading, and align it
+ * with explicit load/unload module operations. The kernel contains several
+ * modules, some of them are not updated often and may contain bugs and
+ * vulnerabilities.
+ *
+ * Returns 0 if the module request is allowed or -EPERM if not.
+ */
+int may_autoload_module(char *kmod_name, int required_cap,
+   const char *kmod_prefix)
+{
+   return 0;
+}
+
 /* Don't grab lock, we're oopsing. */
 void print_modules(void)
 {
diff --git a/security/commoncap.c b/security/commoncap.c
index 4f8e093..236e573 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -1340,6 +1340,31 @@ int cap_mmap_file(struct file *file, unsigned long 
reqprot,
return 0;
 }
 
+/**
+ * cap_kernel_module_request - Determine whether a module auto-load is 
permitted
+ * @kmod_name: The module name
+ * @required_cap: if positive, may allow to auto-load the module if this
+ * capability is set
+ * @kmod_prefix: the module prefix if any, otherwise NULL
+ *
+ * Determine whether a module should be automatically loaded.
+ * Returns 0 if the module request should be allowed, -EPERM if not.
+ */
+int cap_kernel_module_request(char *kmod_name, int required_cap,
+ const char *kmod_prefix)
+{
+   int ret;
+   char comm[sizeof(current

[PATCH v5 next 0/5] Improve Module autoloading infrastructure

2017-11-27 Thread Djalal Harouni
| grep dccp
$ ./pr_set_no_new_privs
$ grep NoNewPrivs /proc/self/status
NoNewPrivs: 1
$ ./pr_modules_autoload_mode_test 1
$ grep Modules /proc/self/status
ModulesAutoloadMode:1
$ strace ./dccp_trigger
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = -1 ESOCKTNOSUPPORT (Socket type not 
supported)
...
$ lsmod | grep dccp
$ dmesg
...
[ 4662.171994] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1759] was denied
[ 4662.177284] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1759] was denied
[ 4662.180181] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1759] was denied
[ 4662.181709] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1759] was denied


Now task "modules_autoload_mode" to 2, disabled mode.
$ lsmod | grep dccp
$ grep Modules /proc/self/status
ModulesAutoloadMode:0
$ su - root
 # ./pr_modules_autoload_mode_test 2
 # grep Modules /proc/self/status
ModulesAutoloadMode:2
 # strace ./dccp_trigger

...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = -1 ESOCKTNOSUPPORT (Socket type not 
supported)
...
...
[ 5154.218740] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1873] was denied
[ 5154.219828] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1873] was denied
[ 5154.221814] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1873] was denied
[ 5154.222731] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1873] was denied


As showed, this blocks automatic module loading per-task. This allows to
provide a usable system, where only some sandboxed apps or containers will be
restricted to trigger automatic module loading, other parts of the
system can continue to use the feature as it is which is the case of the
desktop and userfriendly machines.

[1] http://www.openwall.com/lists/oss-security/2017/02/22/3
[2] https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-6074
[3] http://www.openwall.com/lists/oss-security/2017/03/29/2
[4] http://www.openwall.com/lists/oss-security/2017/03/07/6 
[5] https://a13xp0p0v.github.io/2017/03/24/CVE-2017-2636.html


Finally we already have a use case for the prctl() interface to enforce
some systemd services, in docker and other containers, also in some
sandboxes, etc.


# Changes since v4:
*) Removed the property that when the "modules_autoload_mode" sysctl is
   set to "2" disabled mode, then that value is pinned and we can not
   revert it. Now you can undo the value if you have the appropriate
   privileges as it was suggested.

   Suggested-by: Solar Designer <so...@openwall.com>
   Suggested-by: Andy Lutomirski <l...@kernel.org>
   https://lkml.org/lkml/2017/5/22/330

*) Added request_module_cap() to take '@required_cap' and '@prefix'
   arguments that will be used to check if module autoloading is allowed
   or not.

   Suggested-by: Kees Cook <keesc...@chromium.org>

*) More cleanups and documentation.


# Changes since v3:
*) Renamed the sysctl from "modules_autoload" to "modules_autoload_mode"
   and the prctl() operation flag to "PR_{SET|GET}_MODULES_AUTOLOAD_MODE"
   as it was requested.

   Suggested-by: Ben Hutchings <ben.hutchi...@codethink.co.uk>


*) Updated __request_module() to take the capability that may allow to
   auto-load a module with the appropriate alias. This way we never
   parse aliases as it was requested by Rusty Russell. Security and
   SELinux hooks were updated too.

   Suggested-by: Rusty Russell <ru...@rustcorp.com.au>
   https://lkml.org/lkml/2017/4/24/7


*) Updated code to set prctl(PR_SET_MODULES_AUTOLOAD_MODE, 1, 0, 0, 0),
   the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) before or run with
   CAP_SYS_ADMIN privileges in its namespace. If these are not true,
   -EACCES will be returned.

   Suggested-by: Andy Lutomirski <l...@amacapital.net>
   https://lkml.org/lkml/2017/4/22/22


*) Remove task initialization logic and other cleanups
   Suggested-by: Kees Cook <keesc...@chromium.org>


*) Other code and documentation cleanups.
   

# Changes since v2:
*) Implemented as a core kernel feature inside capabilities subsystem
*) Renamed sysctl to "modules_autoload" to align with "modules_disabled"

   Suggested-by: Kees Cook <keesc...@chromium.org>

*) Improved documentation.
*) Removed unused code.


# Changes since v1:
*) Renamed module to ModAutoRestrict
*) Improved documentation to explicity refer to module autoloading.
*) Switched to use the new task_security_alloc() hook.
*) Switched from rhash tables to use task->security since it is in
   linux-security/next branch now.
*) Check all parameters passed to prctl() syscall.
*) Many other bug fixes and documentation improvements.


[PATCH v5 next 0/5] Improve Module autoloading infrastructure

2017-11-27 Thread Djalal Harouni
| grep dccp
$ ./pr_set_no_new_privs
$ grep NoNewPrivs /proc/self/status
NoNewPrivs: 1
$ ./pr_modules_autoload_mode_test 1
$ grep Modules /proc/self/status
ModulesAutoloadMode:1
$ strace ./dccp_trigger
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = -1 ESOCKTNOSUPPORT (Socket type not 
supported)
...
$ lsmod | grep dccp
$ dmesg
...
[ 4662.171994] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1759] was denied
[ 4662.177284] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1759] was denied
[ 4662.180181] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1759] was denied
[ 4662.181709] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1759] was denied


Now task "modules_autoload_mode" to 2, disabled mode.
$ lsmod | grep dccp
$ grep Modules /proc/self/status
ModulesAutoloadMode:0
$ su - root
 # ./pr_modules_autoload_mode_test 2
 # grep Modules /proc/self/status
ModulesAutoloadMode:2
 # strace ./dccp_trigger

...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = -1 ESOCKTNOSUPPORT (Socket type not 
supported)
...
...
[ 5154.218740] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1873] was denied
[ 5154.219828] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1873] was denied
[ 5154.221814] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1873] was denied
[ 5154.222731] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1873] was denied


As showed, this blocks automatic module loading per-task. This allows to
provide a usable system, where only some sandboxed apps or containers will be
restricted to trigger automatic module loading, other parts of the
system can continue to use the feature as it is which is the case of the
desktop and userfriendly machines.

[1] http://www.openwall.com/lists/oss-security/2017/02/22/3
[2] https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-6074
[3] http://www.openwall.com/lists/oss-security/2017/03/29/2
[4] http://www.openwall.com/lists/oss-security/2017/03/07/6 
[5] https://a13xp0p0v.github.io/2017/03/24/CVE-2017-2636.html


Finally we already have a use case for the prctl() interface to enforce
some systemd services, in docker and other containers, also in some
sandboxes, etc.


# Changes since v4:
*) Removed the property that when the "modules_autoload_mode" sysctl is
   set to "2" disabled mode, then that value is pinned and we can not
   revert it. Now you can undo the value if you have the appropriate
   privileges as it was suggested.

   Suggested-by: Solar Designer 
   Suggested-by: Andy Lutomirski 
   https://lkml.org/lkml/2017/5/22/330

*) Added request_module_cap() to take '@required_cap' and '@prefix'
   arguments that will be used to check if module autoloading is allowed
   or not.

   Suggested-by: Kees Cook 

*) More cleanups and documentation.


# Changes since v3:
*) Renamed the sysctl from "modules_autoload" to "modules_autoload_mode"
   and the prctl() operation flag to "PR_{SET|GET}_MODULES_AUTOLOAD_MODE"
   as it was requested.

   Suggested-by: Ben Hutchings 


*) Updated __request_module() to take the capability that may allow to
   auto-load a module with the appropriate alias. This way we never
   parse aliases as it was requested by Rusty Russell. Security and
   SELinux hooks were updated too.

   Suggested-by: Rusty Russell 
   https://lkml.org/lkml/2017/4/24/7


*) Updated code to set prctl(PR_SET_MODULES_AUTOLOAD_MODE, 1, 0, 0, 0),
   the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) before or run with
   CAP_SYS_ADMIN privileges in its namespace. If these are not true,
   -EACCES will be returned.

   Suggested-by: Andy Lutomirski 
   https://lkml.org/lkml/2017/4/22/22


*) Remove task initialization logic and other cleanups
   Suggested-by: Kees Cook 


*) Other code and documentation cleanups.
   

# Changes since v2:
*) Implemented as a core kernel feature inside capabilities subsystem
*) Renamed sysctl to "modules_autoload" to align with "modules_disabled"

   Suggested-by: Kees Cook 

*) Improved documentation.
*) Removed unused code.


# Changes since v1:
*) Renamed module to ModAutoRestrict
*) Improved documentation to explicity refer to module autoloading.
*) Switched to use the new task_security_alloc() hook.
*) Switched from rhash tables to use task->security since it is in
   linux-security/next branch now.
*) Check all parameters passed to prctl() syscall.
*) Many other bug fixes and documentation improvements.


Patches (5) Djalal Harouni:
 (1/5) modules:capabilities: add request_module_cap()
 (2/5) modules:capabilities: add cap_kernel_module_request() permission check
 (3/5) modules:capabilities: automatic module loading res

Re: [PATCH RFC v3 2/7] proc: move /proc/{self|thread-self} dentries to proc_fs_info

2017-11-10 Thread Djalal Harouni
On Fri, Nov 10, 2017 at 11:31 AM, Alexey Dobriyan <adobri...@gmail.com> wrote:
> On 11/9/17, Djalal Harouni <tix...@gmail.com> wrote:
>
>>  struct proc_fs_info {
>>   struct pid_namespace *pid_ns;
>> + struct dentry *proc_self; /* For /proc/self/ */
>> + struct dentry *proc_thread_self; /* For /proc/thread-self/ */
>
> These are redundant comments.

I can remove them, but actually the area is so difficult and
uncommented that I won't mind extra comments...

Thanks!

-- 
tixxdz


Re: [PATCH RFC v3 2/7] proc: move /proc/{self|thread-self} dentries to proc_fs_info

2017-11-10 Thread Djalal Harouni
On Fri, Nov 10, 2017 at 11:31 AM, Alexey Dobriyan  wrote:
> On 11/9/17, Djalal Harouni  wrote:
>
>>  struct proc_fs_info {
>>   struct pid_namespace *pid_ns;
>> + struct dentry *proc_self; /* For /proc/self/ */
>> + struct dentry *proc_thread_self; /* For /proc/thread-self/ */
>
> These are redundant comments.

I can remove them, but actually the area is so difficult and
uncommented that I won't mind extra comments...

Thanks!

-- 
tixxdz


Re: [PATCH RFC v3 3/7] proc: add helpers to set and get proc hidepid and gid mount options

2017-11-10 Thread Djalal Harouni
On Fri, Nov 10, 2017 at 11:36 AM, Alexey Dobriyan <adobri...@gmail.com> wrote:
> On 11/9/17, Djalal Harouni <tix...@gmail.com> wrote:
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>
>> -static bool has_pid_permissions(struct pid_namespace *pid,
>> +static bool has_pid_permissions(struct proc_fs_info *fs_info,
>
> More "const".
>
>> diff --git a/fs/proc/inode.c b/fs/proc/inode.c
>> index 9abc370..bdd808d 100644
>> --- a/fs/proc/inode.c
>> +++ b/fs/proc/inode.c
>> @@ -476,11 +476,12 @@ struct inode *proc_get_inode(struct super_block *sb,
>> struct proc_dir_entry *de)
>>  int proc_fill_super(struct super_block *s, void *data, int silent)
>>  {
>>   struct proc_fs_info *fs_info = proc_sb(s);
>> - struct pid_namespace *ns = get_pid_ns(fs_info->pid_ns);
>>   struct inode *root_inode;
>>   int ret;
>>
>> - if (!proc_parse_options(data, ns))
>> + get_pid_ns(fs_info->pid_ns);
>> +
>> + if (!proc_parse_options(data, fs_info))
>>   return -EINVAL;
>>
>>   /* User space would break if executables or devices appear on proc */
>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
>> index 4a67188..10bc7be 100644
>> --- a/fs/proc/internal.h
>> +++ b/fs/proc/internal.h
>> @@ -240,7 +240,7 @@ static inline void proc_tty_init(void) {}
>>   * root.c
>>   */
>>  extern struct proc_dir_entry proc_root;
>> -extern int proc_parse_options(char *options, struct pid_namespace *pid);
>> +extern int proc_parse_options(char *options, struct proc_fs_info
>> *fs_info);
>
> "extern" can be dropped if you're touching prototype anyway.
>
>
>
>> +static inline int proc_fs_hide_pid(struct proc_fs_info *fs_info)
>> +{
>> + return fs_info->pid_ns->hide_pid;
>> +}
>> +
>> +static inline kgid_t proc_fs_pid_gid(struct proc_fs_info *fs_info)
>> +{
>> + return fs_info->pid_ns->pid_gid;
>> +}
>
> More "const".
>
>> @@ -59,6 +81,24 @@ static inline void proc_flush_task(struct task_struct
>> *task)
>>  {
>>  }
>>
>> +static inline void proc_fs_set_hide_pid(struct proc_fs_info *fs_info, int
>> hide_pid)
>> +{
>> +}
>> +
>> +static inline void proc_fs_set_pid_gid(struct proc_info_fs *fs_info, kgid_t
>> gid)
>> +{
>> +}
>> +
>> +static inline int proc_fs_hide_pid(struct proc_fs_info *fs_info)
>> +{
>> + return 0;
>> +}
>> +
>> +extern kgid_t proc_fs_pid_gid(struct proc_fs_info *fs_info)
>
> ehh?

Ouch copy/past, will compile it without proc support.

Will fix "const" and other comments too, thank you!


-- 
tixxdz


Re: [PATCH RFC v3 3/7] proc: add helpers to set and get proc hidepid and gid mount options

2017-11-10 Thread Djalal Harouni
On Fri, Nov 10, 2017 at 11:36 AM, Alexey Dobriyan  wrote:
> On 11/9/17, Djalal Harouni  wrote:
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>
>> -static bool has_pid_permissions(struct pid_namespace *pid,
>> +static bool has_pid_permissions(struct proc_fs_info *fs_info,
>
> More "const".
>
>> diff --git a/fs/proc/inode.c b/fs/proc/inode.c
>> index 9abc370..bdd808d 100644
>> --- a/fs/proc/inode.c
>> +++ b/fs/proc/inode.c
>> @@ -476,11 +476,12 @@ struct inode *proc_get_inode(struct super_block *sb,
>> struct proc_dir_entry *de)
>>  int proc_fill_super(struct super_block *s, void *data, int silent)
>>  {
>>   struct proc_fs_info *fs_info = proc_sb(s);
>> - struct pid_namespace *ns = get_pid_ns(fs_info->pid_ns);
>>   struct inode *root_inode;
>>   int ret;
>>
>> - if (!proc_parse_options(data, ns))
>> + get_pid_ns(fs_info->pid_ns);
>> +
>> + if (!proc_parse_options(data, fs_info))
>>   return -EINVAL;
>>
>>   /* User space would break if executables or devices appear on proc */
>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
>> index 4a67188..10bc7be 100644
>> --- a/fs/proc/internal.h
>> +++ b/fs/proc/internal.h
>> @@ -240,7 +240,7 @@ static inline void proc_tty_init(void) {}
>>   * root.c
>>   */
>>  extern struct proc_dir_entry proc_root;
>> -extern int proc_parse_options(char *options, struct pid_namespace *pid);
>> +extern int proc_parse_options(char *options, struct proc_fs_info
>> *fs_info);
>
> "extern" can be dropped if you're touching prototype anyway.
>
>
>
>> +static inline int proc_fs_hide_pid(struct proc_fs_info *fs_info)
>> +{
>> + return fs_info->pid_ns->hide_pid;
>> +}
>> +
>> +static inline kgid_t proc_fs_pid_gid(struct proc_fs_info *fs_info)
>> +{
>> + return fs_info->pid_ns->pid_gid;
>> +}
>
> More "const".
>
>> @@ -59,6 +81,24 @@ static inline void proc_flush_task(struct task_struct
>> *task)
>>  {
>>  }
>>
>> +static inline void proc_fs_set_hide_pid(struct proc_fs_info *fs_info, int
>> hide_pid)
>> +{
>> +}
>> +
>> +static inline void proc_fs_set_pid_gid(struct proc_info_fs *fs_info, kgid_t
>> gid)
>> +{
>> +}
>> +
>> +static inline int proc_fs_hide_pid(struct proc_fs_info *fs_info)
>> +{
>> + return 0;
>> +}
>> +
>> +extern kgid_t proc_fs_pid_gid(struct proc_fs_info *fs_info)
>
> ehh?

Ouch copy/past, will compile it without proc support.

Will fix "const" and other comments too, thank you!


-- 
tixxdz


Re: [PATCH RFC v3 6/7] proc: support new 'pids=all|ptraceable' mount option

2017-11-10 Thread Djalal Harouni
On Fri, Nov 10, 2017 at 3:38 AM, Andy Lutomirski <l...@kernel.org> wrote:
> On Thu, Nov 9, 2017 at 8:14 AM, Djalal Harouni <tix...@gmail.com> wrote:
>> This patch introduces the new 'pids' mount option, as it was discussed
>> and suggested by Andy Lutomirski [1].
>>
>> * If 'pids=' is passed without 'newinstance' then it has no effect.
>
> Would it be safer this were an error instead?

Hm, I tend to say that you are right, but I also keep your comment
when you said that "newinstance" should be the default later and users
won't have to explicitly pass it. What you think ?

-- 
tixxdz


Re: [PATCH RFC v3 6/7] proc: support new 'pids=all|ptraceable' mount option

2017-11-10 Thread Djalal Harouni
On Fri, Nov 10, 2017 at 3:38 AM, Andy Lutomirski  wrote:
> On Thu, Nov 9, 2017 at 8:14 AM, Djalal Harouni  wrote:
>> This patch introduces the new 'pids' mount option, as it was discussed
>> and suggested by Andy Lutomirski [1].
>>
>> * If 'pids=' is passed without 'newinstance' then it has no effect.
>
> Would it be safer this were an error instead?

Hm, I tend to say that you are right, but I also keep your comment
when you said that "newinstance" should be the default later and users
won't have to explicitly pass it. What you think ?

-- 
tixxdz


Re: [PATCH RFC v3 4/7] proc: support mounting private procfs instances inside same pid namespace

2017-11-10 Thread Djalal Harouni
On Fri, Nov 10, 2017 at 3:53 AM, James Morris <james.l.mor...@oracle.com> wrote:
> On Thu, 9 Nov 2017, Djalal Harouni wrote:
>
>> This should allow later after real testing to have a smooth transition
>> to a procfs with default private instances.
>>
>> [1] 
>> https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-January/004215.html
>> [2] http://www.openwall.com/lists/kernel-hardening/2017/10/05/5
>> [3] https://lwn.net/Articles/689539/
>> [4] 
>> http://lxr.free-electrons.com/source/Documentation/filesystems/devpts.txt?v=3.14
>> [5] https://lkml.org/lkml/2017/5/2/407
>> [6] https://lkml.org/lkml/2017/5/3/357
>>
>> Cc: Kees Cook <keesc...@chromium.org>
>> Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>
>> Suggested-by: Andy Lutomirski <l...@kernel.org>
>> Signed-off-by: Alexey Gladkov <gladkov.ale...@gmail.com>
>> Signed-off-by: Djalal Harouni <tix...@gmail.com>
>
>
> Reviewed-by: James Morris <james.l.mor...@oracle.com>

Thank you James!


-- 
tixxdz


Re: [PATCH RFC v3 4/7] proc: support mounting private procfs instances inside same pid namespace

2017-11-10 Thread Djalal Harouni
On Fri, Nov 10, 2017 at 3:53 AM, James Morris  wrote:
> On Thu, 9 Nov 2017, Djalal Harouni wrote:
>
>> This should allow later after real testing to have a smooth transition
>> to a procfs with default private instances.
>>
>> [1] 
>> https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-January/004215.html
>> [2] http://www.openwall.com/lists/kernel-hardening/2017/10/05/5
>> [3] https://lwn.net/Articles/689539/
>> [4] 
>> http://lxr.free-electrons.com/source/Documentation/filesystems/devpts.txt?v=3.14
>> [5] https://lkml.org/lkml/2017/5/2/407
>> [6] https://lkml.org/lkml/2017/5/3/357
>>
>> Cc: Kees Cook 
>> Cc: Greg Kroah-Hartman 
>> Suggested-by: Andy Lutomirski 
>> Signed-off-by: Alexey Gladkov 
>> Signed-off-by: Djalal Harouni 
>
>
> Reviewed-by: James Morris 

Thank you James!


-- 
tixxdz


Re: [PATCH RFC v3 1/7] proc: add proc_fs_info struct to store proc information

2017-11-10 Thread Djalal Harouni
On Fri, Nov 10, 2017 at 11:26 AM, Alexey Dobriyan <adobri...@gmail.com> wrote:
> On 11/9/17, Djalal Harouni <tix...@gmail.com> wrote:
>
>> +struct proc_fs_info {
>> + struct pid_namespace *pid_ns;
>> +};
>
>> +static inline struct proc_fs_info *proc_sb(struct super_block *sb)
>> +{
>> + return sb->s_fs_info;
>> +}
>
> Can you rename this to "struct proc_super_block *" then?
> That "info" suffix all over filesystems doesn't add much info itself
> just more typing.
> Ditto for "fs_info" identifiers.

Ok, will do.

>> +extern inline struct proc_fs_info *proc_sb(struct super_block *sb)
>> { return NULL;}
>
> extern inline?

Oups, sorry will fix it and try to compile without proc.

Thank you!

-- 
tixxdz


Re: [PATCH RFC v3 1/7] proc: add proc_fs_info struct to store proc information

2017-11-10 Thread Djalal Harouni
On Fri, Nov 10, 2017 at 11:26 AM, Alexey Dobriyan  wrote:
> On 11/9/17, Djalal Harouni  wrote:
>
>> +struct proc_fs_info {
>> + struct pid_namespace *pid_ns;
>> +};
>
>> +static inline struct proc_fs_info *proc_sb(struct super_block *sb)
>> +{
>> + return sb->s_fs_info;
>> +}
>
> Can you rename this to "struct proc_super_block *" then?
> That "info" suffix all over filesystems doesn't add much info itself
> just more typing.
> Ditto for "fs_info" identifiers.

Ok, will do.

>> +extern inline struct proc_fs_info *proc_sb(struct super_block *sb)
>> { return NULL;}
>
> extern inline?

Oups, sorry will fix it and try to compile without proc.

Thank you!

-- 
tixxdz


[PATCH RFC v3 1/7] proc: add proc_fs_info struct to store proc information

2017-11-09 Thread Djalal Harouni
This is a prepation patch that adds proc_fs_info to handle procfs
internal information. Right now procfs internal information is stored
inside the pid namespace which make it hard to change or modernize
procfs without affecting pid namespaces, furthermore this is blocking
all kind of changes that are needed to solve today's or future Linux
challenges, as noted by various maintainers and userspace needs:

"Here's another one: split up and modernize /proc." by Andy Lutomirski [1]

Discussion about kernel pointer leaks:
"And yes, as Kees and Daniel mentioned, it's definitely not just dmesg.
In fact, the primary things tend to be /proc and /sys, not dmesg
itself." By Linus Torvalds [2]

procfs is an important Linux API that offers features using filesystem
syscalls, hence lets handle it as a real filesystem, with its own
private information and avoid mixing it with PID namespaces since it is
more than PID namespace after all. This will allow later to support
separate instances each one with its own superblock, which will solve
lot of problems.

Other Linux interfaces like devpts were also updated to support
containers, sandboxes and multiple private instances [2]. Time to update
procfs also.

Patch changes:
* Adds proc_fs_info struct to store procfs mount information.
* Updates proc_mount() to directly handle mounts there.
* Updates all calls that need to access now proc_fs_info struct.

[1] 
https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-January/004215.html
[2] http://www.openwall.com/lists/kernel-hardening/2017/10/05/5
[3] 
http://lxr.free-electrons.com/source/Documentation/filesystems/devpts.txt?v=3.14

Cc: Kees Cook <keesc...@chromium.org>
Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Suggested-by: Andy Lutomirski <l...@kernel.org>
Signed-off-by: Alexey Gladkov <gladkov.ale...@gmail.com>
Signed-off-by: Djalal Harouni <tix...@gmail.com>
---
 fs/locks.c  |  6 +++--
 fs/proc/base.c  | 30 +++--
 fs/proc/inode.c |  8 +++---
 fs/proc/root.c  | 69 ++---
 fs/proc/self.c  |  8 +++---
 fs/proc/thread_self.c   |  6 +++--
 fs/proc_namespace.c | 14 +-
 include/linux/proc_fs.h | 10 +++
 8 files changed, 117 insertions(+), 34 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 21b4dfa..6d5c473 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2624,7 +2624,8 @@ static void lock_get_status(struct seq_file *f, struct 
file_lock *fl,
 {
struct inode *inode = NULL;
unsigned int fl_pid;
-   struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+   struct proc_fs_info *fs_info = proc_sb(file_inode(f->file)->i_sb);
+   struct pid_namespace *proc_pidns = fs_info->pid_ns;
 
fl_pid = locks_translate_pid(fl, proc_pidns);
/*
@@ -2704,7 +2705,8 @@ static int locks_show(struct seq_file *f, void *v)
 {
struct locks_iterator *iter = f->private;
struct file_lock *fl, *bfl;
-   struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+   struct proc_fs_info *fs_info = proc_sb(file_inode(f->file)->i_sb);
+   struct pid_namespace *proc_pidns = fs_info->pid_ns;
 
fl = hlist_entry(v, struct file_lock, fl_link);
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 31934cb..5fc2006 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -696,7 +696,8 @@ static bool has_pid_permissions(struct pid_namespace *pid,
 
 static int proc_pid_permission(struct inode *inode, int mask)
 {
-   struct pid_namespace *pid = inode->i_sb->s_fs_info;
+   struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+   struct pid_namespace *pid = fs_info->pid_ns;
struct task_struct *task;
bool has_perms;
 
@@ -731,12 +732,12 @@ static const struct inode_operations 
proc_def_inode_operations = {
 static int proc_single_show(struct seq_file *m, void *v)
 {
struct inode *inode = m->private;
-   struct pid_namespace *ns;
struct pid *pid;
struct task_struct *task;
int ret;
 
-   ns = inode->i_sb->s_fs_info;
+   struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+   struct pid_namespace *ns = fs_info->pid_ns;
pid = proc_pid(inode);
task = get_pid_task(pid, PIDTYPE_PID);
if (!task)
@@ -1774,9 +1775,10 @@ struct inode *proc_pid_make_inode(struct super_block * 
sb,
 int pid_getattr(const struct path *path, struct kstat *stat,
u32 request_mask, unsigned int query_flags)
 {
-   struct inode *inode = d_inode(path->dentry);
struct task_struct *task;
-   struct pid_namespace *pid = path->dentry->d_sb->s_fs_info;
+   struct inode *inode = d_inode(path->dentry);
+   struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+   struct pid_namespace *pid = fs

[PATCH RFC v3 1/7] proc: add proc_fs_info struct to store proc information

2017-11-09 Thread Djalal Harouni
This is a prepation patch that adds proc_fs_info to handle procfs
internal information. Right now procfs internal information is stored
inside the pid namespace which make it hard to change or modernize
procfs without affecting pid namespaces, furthermore this is blocking
all kind of changes that are needed to solve today's or future Linux
challenges, as noted by various maintainers and userspace needs:

"Here's another one: split up and modernize /proc." by Andy Lutomirski [1]

Discussion about kernel pointer leaks:
"And yes, as Kees and Daniel mentioned, it's definitely not just dmesg.
In fact, the primary things tend to be /proc and /sys, not dmesg
itself." By Linus Torvalds [2]

procfs is an important Linux API that offers features using filesystem
syscalls, hence lets handle it as a real filesystem, with its own
private information and avoid mixing it with PID namespaces since it is
more than PID namespace after all. This will allow later to support
separate instances each one with its own superblock, which will solve
lot of problems.

Other Linux interfaces like devpts were also updated to support
containers, sandboxes and multiple private instances [2]. Time to update
procfs also.

Patch changes:
* Adds proc_fs_info struct to store procfs mount information.
* Updates proc_mount() to directly handle mounts there.
* Updates all calls that need to access now proc_fs_info struct.

[1] 
https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-January/004215.html
[2] http://www.openwall.com/lists/kernel-hardening/2017/10/05/5
[3] 
http://lxr.free-electrons.com/source/Documentation/filesystems/devpts.txt?v=3.14

Cc: Kees Cook 
Cc: Greg Kroah-Hartman 
Suggested-by: Andy Lutomirski 
Signed-off-by: Alexey Gladkov 
Signed-off-by: Djalal Harouni 
---
 fs/locks.c  |  6 +++--
 fs/proc/base.c  | 30 +++--
 fs/proc/inode.c |  8 +++---
 fs/proc/root.c  | 69 ++---
 fs/proc/self.c  |  8 +++---
 fs/proc/thread_self.c   |  6 +++--
 fs/proc_namespace.c | 14 +-
 include/linux/proc_fs.h | 10 +++
 8 files changed, 117 insertions(+), 34 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 21b4dfa..6d5c473 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2624,7 +2624,8 @@ static void lock_get_status(struct seq_file *f, struct 
file_lock *fl,
 {
struct inode *inode = NULL;
unsigned int fl_pid;
-   struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+   struct proc_fs_info *fs_info = proc_sb(file_inode(f->file)->i_sb);
+   struct pid_namespace *proc_pidns = fs_info->pid_ns;
 
fl_pid = locks_translate_pid(fl, proc_pidns);
/*
@@ -2704,7 +2705,8 @@ static int locks_show(struct seq_file *f, void *v)
 {
struct locks_iterator *iter = f->private;
struct file_lock *fl, *bfl;
-   struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+   struct proc_fs_info *fs_info = proc_sb(file_inode(f->file)->i_sb);
+   struct pid_namespace *proc_pidns = fs_info->pid_ns;
 
fl = hlist_entry(v, struct file_lock, fl_link);
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 31934cb..5fc2006 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -696,7 +696,8 @@ static bool has_pid_permissions(struct pid_namespace *pid,
 
 static int proc_pid_permission(struct inode *inode, int mask)
 {
-   struct pid_namespace *pid = inode->i_sb->s_fs_info;
+   struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+   struct pid_namespace *pid = fs_info->pid_ns;
struct task_struct *task;
bool has_perms;
 
@@ -731,12 +732,12 @@ static const struct inode_operations 
proc_def_inode_operations = {
 static int proc_single_show(struct seq_file *m, void *v)
 {
struct inode *inode = m->private;
-   struct pid_namespace *ns;
struct pid *pid;
struct task_struct *task;
int ret;
 
-   ns = inode->i_sb->s_fs_info;
+   struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+   struct pid_namespace *ns = fs_info->pid_ns;
pid = proc_pid(inode);
task = get_pid_task(pid, PIDTYPE_PID);
if (!task)
@@ -1774,9 +1775,10 @@ struct inode *proc_pid_make_inode(struct super_block * 
sb,
 int pid_getattr(const struct path *path, struct kstat *stat,
u32 request_mask, unsigned int query_flags)
 {
-   struct inode *inode = d_inode(path->dentry);
struct task_struct *task;
-   struct pid_namespace *pid = path->dentry->d_sb->s_fs_info;
+   struct inode *inode = d_inode(path->dentry);
+   struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+   struct pid_namespace *pid = fs_info->pid_ns;
 
generic_fillattr(inode, stat);
 
@@ -2291,6 +2293,7 @@ static const struct seq_operations proc_timers_seq_ops = 

[PATCH RFC v3 2/7] proc: move /proc/{self|thread-self} dentries to proc_fs_info

2017-11-09 Thread Djalal Harouni
This is a preparation patch that moves /proc/{self|thread-self} dentries
to be stored inside procfs proc_fs_info struct instead of making them
per PID namespace. Since we want to support multiple procfs instances we
need to make sure that these dentries are also per-superblock instead of
per-pidns, and we want to make sure that unmounting a private procfs
won't clash with other procfs mounts.

Cc: Kees Cook <keesc...@chromium.org>
Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Suggested-by: Andy Lutomirski <l...@kernel.org>
Signed-off-by: Alexey Gladkov <gladkov.ale...@gmail.com>
Signed-off-by: Djalal Harouni <tix...@gmail.com>
---
 fs/proc/base.c| 4 ++--
 fs/proc/root.c| 8 
 fs/proc/self.c| 3 +--
 fs/proc/thread_self.c | 5 ++---
 include/linux/pid_namespace.h | 4 +---
 include/linux/proc_fs.h   | 2 ++
 6 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5fc2006..0d9b4214 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3214,13 +3214,13 @@ int proc_pid_readdir(struct file *file, struct 
dir_context *ctx)
return 0;
 
if (pos == TGID_OFFSET - 2) {
-   struct inode *inode = d_inode(ns->proc_self);
+   struct inode *inode = d_inode(fs_info->proc_self);
if (!dir_emit(ctx, "self", 4, inode->i_ino, DT_LNK))
return 0;
ctx->pos = pos = pos + 1;
}
if (pos == TGID_OFFSET - 1) {
-   struct inode *inode = d_inode(ns->proc_thread_self);
+   struct inode *inode = d_inode(fs_info->proc_thread_self);
if (!dir_emit(ctx, "thread-self", 11, inode->i_ino, DT_LNK))
return 0;
ctx->pos = pos = pos + 1;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 43e2639..b225ae5 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -166,10 +166,10 @@ static void proc_kill_sb(struct super_block *sb)
struct proc_fs_info *fs_info = proc_sb(sb);
struct pid_namespace *ns = fs_info->pid_ns;
 
-   if (ns->proc_self)
-   dput(ns->proc_self);
-   if (ns->proc_thread_self)
-   dput(ns->proc_thread_self);
+   if (fs_info->proc_self)
+   dput(fs_info->proc_self);
+   if (fs_info->proc_thread_self)
+   dput(fs_info->proc_thread_self);
kill_anon_super(sb);
put_pid_ns(ns);
kfree(fs_info);
diff --git a/fs/proc/self.c b/fs/proc/self.c
index f773301..8a67cf0 100644
--- a/fs/proc/self.c
+++ b/fs/proc/self.c
@@ -37,7 +37,6 @@ int proc_setup_self(struct super_block *s)
 {
struct inode *root_inode = d_inode(s->s_root);
struct proc_fs_info *fs_info = proc_sb(s);
-   struct pid_namespace *ns = fs_info->pid_ns;
struct dentry *self;
 
inode_lock(root_inode);
@@ -64,7 +63,7 @@ int proc_setup_self(struct super_block *s)
pr_err("proc_fill_super: can't allocate /proc/self\n");
return PTR_ERR(self);
}
-   ns->proc_self = self;
+   fs_info->proc_self = self;
return 0;
 }
 
diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
index 578887b..6e3225f 100644
--- a/fs/proc/thread_self.c
+++ b/fs/proc/thread_self.c
@@ -37,7 +37,6 @@ static unsigned thread_self_inum;
 int proc_setup_thread_self(struct super_block *s)
 {
struct proc_fs_info *fs_info = proc_sb(s);
-   struct pid_namespace *ns = fs_info->pid_ns;
struct inode *root_inode = d_inode(s->s_root);
struct dentry *thread_self;
 
@@ -62,10 +61,10 @@ int proc_setup_thread_self(struct super_block *s)
}
inode_unlock(root_inode);
if (IS_ERR(thread_self)) {
-   pr_err("proc_fill_super: can't allocate /proc/thread_self\n");
+   pr_err("proc_fill_super: can't allocate /proc/thread-self\n");
return PTR_ERR(thread_self);
}
-   ns->proc_thread_self = thread_self;
+   fs_info->proc_thread_self = thread_self;
return 0;
 }
 
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 49538b1..f91a8bf 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -31,9 +31,7 @@ struct pid_namespace {
unsigned int level;
struct pid_namespace *parent;
 #ifdef CONFIG_PROC_FS
-   struct vfsmount *proc_mnt;
-   struct dentry *proc_self;
-   struct dentry *proc_thread_self;
+   struct vfsmount *proc_mnt; /* Internal proc mounted during each new 
pidns */
 #endif
 #ifdef CONFIG_BSD_PROCESS_ACCT
struct fs_pin *bacct;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 9a3f6e9..8f89069 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ 

[PATCH RFC v3 2/7] proc: move /proc/{self|thread-self} dentries to proc_fs_info

2017-11-09 Thread Djalal Harouni
This is a preparation patch that moves /proc/{self|thread-self} dentries
to be stored inside procfs proc_fs_info struct instead of making them
per PID namespace. Since we want to support multiple procfs instances we
need to make sure that these dentries are also per-superblock instead of
per-pidns, and we want to make sure that unmounting a private procfs
won't clash with other procfs mounts.

Cc: Kees Cook 
Cc: Greg Kroah-Hartman 
Suggested-by: Andy Lutomirski 
Signed-off-by: Alexey Gladkov 
Signed-off-by: Djalal Harouni 
---
 fs/proc/base.c| 4 ++--
 fs/proc/root.c| 8 
 fs/proc/self.c| 3 +--
 fs/proc/thread_self.c | 5 ++---
 include/linux/pid_namespace.h | 4 +---
 include/linux/proc_fs.h   | 2 ++
 6 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5fc2006..0d9b4214 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3214,13 +3214,13 @@ int proc_pid_readdir(struct file *file, struct 
dir_context *ctx)
return 0;
 
if (pos == TGID_OFFSET - 2) {
-   struct inode *inode = d_inode(ns->proc_self);
+   struct inode *inode = d_inode(fs_info->proc_self);
if (!dir_emit(ctx, "self", 4, inode->i_ino, DT_LNK))
return 0;
ctx->pos = pos = pos + 1;
}
if (pos == TGID_OFFSET - 1) {
-   struct inode *inode = d_inode(ns->proc_thread_self);
+   struct inode *inode = d_inode(fs_info->proc_thread_self);
if (!dir_emit(ctx, "thread-self", 11, inode->i_ino, DT_LNK))
return 0;
ctx->pos = pos = pos + 1;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 43e2639..b225ae5 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -166,10 +166,10 @@ static void proc_kill_sb(struct super_block *sb)
struct proc_fs_info *fs_info = proc_sb(sb);
struct pid_namespace *ns = fs_info->pid_ns;
 
-   if (ns->proc_self)
-   dput(ns->proc_self);
-   if (ns->proc_thread_self)
-   dput(ns->proc_thread_self);
+   if (fs_info->proc_self)
+   dput(fs_info->proc_self);
+   if (fs_info->proc_thread_self)
+   dput(fs_info->proc_thread_self);
kill_anon_super(sb);
put_pid_ns(ns);
kfree(fs_info);
diff --git a/fs/proc/self.c b/fs/proc/self.c
index f773301..8a67cf0 100644
--- a/fs/proc/self.c
+++ b/fs/proc/self.c
@@ -37,7 +37,6 @@ int proc_setup_self(struct super_block *s)
 {
struct inode *root_inode = d_inode(s->s_root);
struct proc_fs_info *fs_info = proc_sb(s);
-   struct pid_namespace *ns = fs_info->pid_ns;
struct dentry *self;
 
inode_lock(root_inode);
@@ -64,7 +63,7 @@ int proc_setup_self(struct super_block *s)
pr_err("proc_fill_super: can't allocate /proc/self\n");
return PTR_ERR(self);
}
-   ns->proc_self = self;
+   fs_info->proc_self = self;
return 0;
 }
 
diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
index 578887b..6e3225f 100644
--- a/fs/proc/thread_self.c
+++ b/fs/proc/thread_self.c
@@ -37,7 +37,6 @@ static unsigned thread_self_inum;
 int proc_setup_thread_self(struct super_block *s)
 {
struct proc_fs_info *fs_info = proc_sb(s);
-   struct pid_namespace *ns = fs_info->pid_ns;
struct inode *root_inode = d_inode(s->s_root);
struct dentry *thread_self;
 
@@ -62,10 +61,10 @@ int proc_setup_thread_self(struct super_block *s)
}
inode_unlock(root_inode);
if (IS_ERR(thread_self)) {
-   pr_err("proc_fill_super: can't allocate /proc/thread_self\n");
+   pr_err("proc_fill_super: can't allocate /proc/thread-self\n");
return PTR_ERR(thread_self);
}
-   ns->proc_thread_self = thread_self;
+   fs_info->proc_thread_self = thread_self;
return 0;
 }
 
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 49538b1..f91a8bf 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -31,9 +31,7 @@ struct pid_namespace {
unsigned int level;
struct pid_namespace *parent;
 #ifdef CONFIG_PROC_FS
-   struct vfsmount *proc_mnt;
-   struct dentry *proc_self;
-   struct dentry *proc_thread_self;
+   struct vfsmount *proc_mnt; /* Internal proc mounted during each new 
pidns */
 #endif
 #ifdef CONFIG_BSD_PROCESS_ACCT
struct fs_pin *bacct;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 9a3f6e9..8f89069 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -10,6 +10,8 @@
 
 struct proc_fs_info {
struct pid_namespace *pid_ns;
+   struct dentry *proc_self; /* For /proc/self/ */
+   struct dentry *proc_thread_self; /* For /proc/thread-self/ */
 };
 
 struct proc_dir_entry;
-- 
2.7.4



[PATCH RFC v3 3/7] proc: add helpers to set and get proc hidepid and gid mount options

2017-11-09 Thread Djalal Harouni
This is a cleaning patch to add helpers to set and get proc mount
options instead of a direct access. This allows later to easily track
what's happening, how these fields are accessed, and in case we need
to update them in the future.

Later we will move these mount options to proc_fs_info struct. First
lets fix the access.

Cc: Kees Cook <keesc...@chromium.org>
Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Suggested-by: Andy Lutomirski <l...@kernel.org>
Signed-off-by: Alexey Gladkov <gladkov.ale...@gmail.com>
Signed-off-by: Djalal Harouni <tix...@gmail.com>
---
 fs/proc/base.c  | 16 +---
 fs/proc/inode.c |  5 +++--
 fs/proc/internal.h  |  2 +-
 fs/proc/root.c  | 15 ++-
 include/linux/proc_fs.h | 44 ++--
 5 files changed, 65 insertions(+), 17 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 0d9b4214..f324c49 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -682,13 +682,16 @@ int proc_setattr(struct dentry *dentry, struct iattr 
*attr)
  * May current process learn task's sched/cmdline info (for hide_pid_min=1)
  * or euid/egid (for hide_pid_min=2)?
  */
-static bool has_pid_permissions(struct pid_namespace *pid,
+static bool has_pid_permissions(struct proc_fs_info *fs_info,
 struct task_struct *task,
 int hide_pid_min)
 {
-   if (pid->hide_pid < hide_pid_min)
+   int hide_pid = proc_fs_hide_pid(fs_info);
+   kgid_t gid = proc_fs_pid_gid(fs_info);
+
+   if (hide_pid < hide_pid_min)
return true;
-   if (in_group_p(pid->pid_gid))
+   if (in_group_p(gid))
return true;
return ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
 }
@@ -704,7 +707,7 @@ static int proc_pid_permission(struct inode *inode, int 
mask)
task = get_proc_task(inode);
if (!task)
return -ESRCH;
-   has_perms = has_pid_permissions(pid, task, HIDEPID_NO_ACCESS);
+   has_perms = has_pid_permissions(fs_info, task, HIDEPID_NO_ACCESS);
put_task_struct(task);
 
if (!has_perms) {
@@ -1778,7 +1781,6 @@ int pid_getattr(const struct path *path, struct kstat 
*stat,
struct task_struct *task;
struct inode *inode = d_inode(path->dentry);
struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
-   struct pid_namespace *pid = fs_info->pid_ns;
 
generic_fillattr(inode, stat);
 
@@ -1787,7 +1789,7 @@ int pid_getattr(const struct path *path, struct kstat 
*stat,
stat->gid = GLOBAL_ROOT_GID;
task = pid_task(proc_pid(inode), PIDTYPE_PID);
if (task) {
-   if (!has_pid_permissions(pid, task, HIDEPID_INVISIBLE)) {
+   if (!has_pid_permissions(fs_info, task, HIDEPID_INVISIBLE)) {
rcu_read_unlock();
/*
 * This doesn't prevent learning whether PID exists,
@@ -3234,7 +3236,7 @@ int proc_pid_readdir(struct file *file, struct 
dir_context *ctx)
int len;
 
cond_resched();
-   if (!has_pid_permissions(ns, iter.task, HIDEPID_INVISIBLE))
+   if (!has_pid_permissions(fs_info, iter.task, HIDEPID_INVISIBLE))
continue;
 
len = snprintf(name, sizeof(name), "%d", iter.tgid);
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 9abc370..bdd808d 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -476,11 +476,12 @@ struct inode *proc_get_inode(struct super_block *sb, 
struct proc_dir_entry *de)
 int proc_fill_super(struct super_block *s, void *data, int silent)
 {
struct proc_fs_info *fs_info = proc_sb(s);
-   struct pid_namespace *ns = get_pid_ns(fs_info->pid_ns);
struct inode *root_inode;
int ret;
 
-   if (!proc_parse_options(data, ns))
+   get_pid_ns(fs_info->pid_ns);
+
+   if (!proc_parse_options(data, fs_info))
return -EINVAL;
 
/* User space would break if executables or devices appear on proc */
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 4a67188..10bc7be 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -240,7 +240,7 @@ static inline void proc_tty_init(void) {}
  * root.c
  */
 extern struct proc_dir_entry proc_root;
-extern int proc_parse_options(char *options, struct pid_namespace *pid);
+extern int proc_parse_options(char *options, struct proc_fs_info *fs_info);
 
 extern void proc_self_init(void);
 extern int proc_remount(struct super_block *, int *, char *);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index b225ae5..48cc481 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -37,11 +37,12 @@ static const match_table_t tokens = {
{Opt_err, NULL},
 };
 
-int proc_parse_options(char *options, struct pid_namespace *pid)
+int proc_parse_options(char *options, stru

[PATCH RFC v3 3/7] proc: add helpers to set and get proc hidepid and gid mount options

2017-11-09 Thread Djalal Harouni
This is a cleaning patch to add helpers to set and get proc mount
options instead of a direct access. This allows later to easily track
what's happening, how these fields are accessed, and in case we need
to update them in the future.

Later we will move these mount options to proc_fs_info struct. First
lets fix the access.

Cc: Kees Cook 
Cc: Greg Kroah-Hartman 
Suggested-by: Andy Lutomirski 
Signed-off-by: Alexey Gladkov 
Signed-off-by: Djalal Harouni 
---
 fs/proc/base.c  | 16 +---
 fs/proc/inode.c |  5 +++--
 fs/proc/internal.h  |  2 +-
 fs/proc/root.c  | 15 ++-
 include/linux/proc_fs.h | 44 ++--
 5 files changed, 65 insertions(+), 17 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 0d9b4214..f324c49 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -682,13 +682,16 @@ int proc_setattr(struct dentry *dentry, struct iattr 
*attr)
  * May current process learn task's sched/cmdline info (for hide_pid_min=1)
  * or euid/egid (for hide_pid_min=2)?
  */
-static bool has_pid_permissions(struct pid_namespace *pid,
+static bool has_pid_permissions(struct proc_fs_info *fs_info,
 struct task_struct *task,
 int hide_pid_min)
 {
-   if (pid->hide_pid < hide_pid_min)
+   int hide_pid = proc_fs_hide_pid(fs_info);
+   kgid_t gid = proc_fs_pid_gid(fs_info);
+
+   if (hide_pid < hide_pid_min)
return true;
-   if (in_group_p(pid->pid_gid))
+   if (in_group_p(gid))
return true;
return ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
 }
@@ -704,7 +707,7 @@ static int proc_pid_permission(struct inode *inode, int 
mask)
task = get_proc_task(inode);
if (!task)
return -ESRCH;
-   has_perms = has_pid_permissions(pid, task, HIDEPID_NO_ACCESS);
+   has_perms = has_pid_permissions(fs_info, task, HIDEPID_NO_ACCESS);
put_task_struct(task);
 
if (!has_perms) {
@@ -1778,7 +1781,6 @@ int pid_getattr(const struct path *path, struct kstat 
*stat,
struct task_struct *task;
struct inode *inode = d_inode(path->dentry);
struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
-   struct pid_namespace *pid = fs_info->pid_ns;
 
generic_fillattr(inode, stat);
 
@@ -1787,7 +1789,7 @@ int pid_getattr(const struct path *path, struct kstat 
*stat,
stat->gid = GLOBAL_ROOT_GID;
task = pid_task(proc_pid(inode), PIDTYPE_PID);
if (task) {
-   if (!has_pid_permissions(pid, task, HIDEPID_INVISIBLE)) {
+   if (!has_pid_permissions(fs_info, task, HIDEPID_INVISIBLE)) {
rcu_read_unlock();
/*
 * This doesn't prevent learning whether PID exists,
@@ -3234,7 +3236,7 @@ int proc_pid_readdir(struct file *file, struct 
dir_context *ctx)
int len;
 
cond_resched();
-   if (!has_pid_permissions(ns, iter.task, HIDEPID_INVISIBLE))
+   if (!has_pid_permissions(fs_info, iter.task, HIDEPID_INVISIBLE))
continue;
 
len = snprintf(name, sizeof(name), "%d", iter.tgid);
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 9abc370..bdd808d 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -476,11 +476,12 @@ struct inode *proc_get_inode(struct super_block *sb, 
struct proc_dir_entry *de)
 int proc_fill_super(struct super_block *s, void *data, int silent)
 {
struct proc_fs_info *fs_info = proc_sb(s);
-   struct pid_namespace *ns = get_pid_ns(fs_info->pid_ns);
struct inode *root_inode;
int ret;
 
-   if (!proc_parse_options(data, ns))
+   get_pid_ns(fs_info->pid_ns);
+
+   if (!proc_parse_options(data, fs_info))
return -EINVAL;
 
/* User space would break if executables or devices appear on proc */
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 4a67188..10bc7be 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -240,7 +240,7 @@ static inline void proc_tty_init(void) {}
  * root.c
  */
 extern struct proc_dir_entry proc_root;
-extern int proc_parse_options(char *options, struct pid_namespace *pid);
+extern int proc_parse_options(char *options, struct proc_fs_info *fs_info);
 
 extern void proc_self_init(void);
 extern int proc_remount(struct super_block *, int *, char *);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index b225ae5..48cc481 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -37,11 +37,12 @@ static const match_table_t tokens = {
{Opt_err, NULL},
 };
 
-int proc_parse_options(char *options, struct pid_namespace *pid)
+int proc_parse_options(char *options, struct proc_fs_info *fs_info)
 {
char *p;
substring_t args[MAX_OPT_ARGS];
int option;
+   kgid_t gid;
 
if (!opt

[PATCH RFC v3 5/7] proc: move hidepid definitions to proc files

2017-11-09 Thread Djalal Harouni
This moves the 'hidepid' definitions to proc files. The 'hidepid' is a
proc mount option, not really a per pid namespace value. It was there
since it was used inside PID namespaces, however now we have improved
proc logic and reduce the complexity and ties with PID namespaces lets
move this last bit to where it really belongs.

Cc: Kees Cook <keesc...@chromium.org>
Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Cc: Andy Lutomirski <l...@kernel.org>
Signed-off-by: Alexey Gladkov <gladkov.ale...@gmail.com>
Signed-off-by: Djalal Harouni <tix...@gmail.com>
---
 include/linux/pid_namespace.h | 6 --
 include/linux/proc_fs.h   | 6 ++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 786ea04..66f47f1 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -15,12 +15,6 @@
 
 struct fs_pin;
 
-enum { /* definitions for pid_namespace's hide_pid field */
-   HIDEPID_OFF   = 0,
-   HIDEPID_NO_ACCESS = 1,
-   HIDEPID_INVISIBLE = 2,
-};
-
 struct pid_namespace {
struct kref kref;
struct idr idr;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 408b51d..c123e5ec 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -12,6 +12,12 @@
 struct proc_dir_entry;
 struct pid_namespace;
 
+enum { /* definitions for 'hidepid' mount option */
+   HIDEPID_OFF   = 0,
+   HIDEPID_NO_ACCESS = 1,
+   HIDEPID_INVISIBLE = 2,
+};
+
 struct proc_fs_info {
struct pid_namespace *pid_ns;
struct dentry *proc_self; /* For /proc/self/ */
-- 
2.7.4



[PATCH RFC v3 5/7] proc: move hidepid definitions to proc files

2017-11-09 Thread Djalal Harouni
This moves the 'hidepid' definitions to proc files. The 'hidepid' is a
proc mount option, not really a per pid namespace value. It was there
since it was used inside PID namespaces, however now we have improved
proc logic and reduce the complexity and ties with PID namespaces lets
move this last bit to where it really belongs.

Cc: Kees Cook 
Cc: Greg Kroah-Hartman 
Cc: Andy Lutomirski 
Signed-off-by: Alexey Gladkov 
Signed-off-by: Djalal Harouni 
---
 include/linux/pid_namespace.h | 6 --
 include/linux/proc_fs.h   | 6 ++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 786ea04..66f47f1 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -15,12 +15,6 @@
 
 struct fs_pin;
 
-enum { /* definitions for pid_namespace's hide_pid field */
-   HIDEPID_OFF   = 0,
-   HIDEPID_NO_ACCESS = 1,
-   HIDEPID_INVISIBLE = 2,
-};
-
 struct pid_namespace {
struct kref kref;
struct idr idr;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 408b51d..c123e5ec 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -12,6 +12,12 @@
 struct proc_dir_entry;
 struct pid_namespace;
 
+enum { /* definitions for 'hidepid' mount option */
+   HIDEPID_OFF   = 0,
+   HIDEPID_NO_ACCESS = 1,
+   HIDEPID_INVISIBLE = 2,
+};
+
 struct proc_fs_info {
struct pid_namespace *pid_ns;
struct dentry *proc_self; /* For /proc/self/ */
-- 
2.7.4



[PATCH RFC v3 6/7] proc: support new 'pids=all|ptraceable' mount option

2017-11-09 Thread Djalal Harouni
This patch introduces the new 'pids' mount option, as it was discussed
and suggested by Andy Lutomirski [1].

* If 'pids=' is passed without 'newinstance' then it has no effect.

* If 'newinstance,pids=all' then all processes will be shown in proc.

* If 'newinstance,pids=ptraceable' then only ptraceable processes will be
shown.

* 'pids=' takes precendence over 'hidepid=' since 'hidepid=' can be
  ignored if "gid=" was set and caller has the "gid=" set in its groups.
  We want to guarantee that LSM have a security path there that can not
  be disabled with "gid=".

This allows to support lightweight sandboxes in Embedded Linux.

Later Yama LSM can be updated to check that processes are able only
able to see their children inside /proc/, allowing to support more tight
cases.

[1] https://lkml.org/lkml/2017/4/26/646

Cc: Kees Cook <keesc...@chromium.org>
Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Suggested-by: Andy Lutomirski <l...@kernel.org>
Signed-off-by: Alexey Gladkov <gladkov.ale...@gmail.com>
Signed-off-by: Djalal Harouni <tix...@gmail.com>
---
 fs/proc/base.c  | 36 +---
 fs/proc/inode.c |  6 +-
 fs/proc/root.c  | 20 ++--
 include/linux/proc_fs.h | 30 ++
 4 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 54b527c..88b92bc 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -686,13 +686,24 @@ static bool has_pid_permissions(struct proc_fs_info 
*fs_info,
 struct task_struct *task,
 int hide_pid_min)
 {
-   int hide_pid = proc_fs_hide_pid(fs_info);
-   kgid_t gid = proc_fs_pid_gid(fs_info);
+   int pids = proc_fs_pids(fs_info);
+
+   /*
+* If 'pids=all' or if it was not set then lets fallback
+* to 'hidepid' and 'gid', if those are not enforced too, then
+* ptrace checks are skipped. Otherwise ptrace permission is
+* required for all other cases.
+*/
+   if (pids == PIDS_ALL) {
+   int hide_pid = proc_fs_hide_pid(fs_info);
+   kgid_t gid = proc_fs_pid_gid(fs_info);
+
+   if (hide_pid < hide_pid_min)
+   return true;
 
-   if (hide_pid < hide_pid_min)
-   return true;
-   if (in_group_p(gid))
-   return true;
+   if (in_group_p(gid))
+   return true;
+   }
return ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
 }
 
@@ -701,6 +712,7 @@ static int proc_pid_permission(struct inode *inode, int 
mask)
 {
struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
int hide_pid = proc_fs_hide_pid(fs_info);
+   int pids = proc_fs_pids(fs_info);
struct task_struct *task;
bool has_perms;
 
@@ -711,7 +723,8 @@ static int proc_pid_permission(struct inode *inode, int 
mask)
put_task_struct(task);
 
if (!has_perms) {
-   if (hide_pid == HIDEPID_INVISIBLE) {
+   if (pids == PIDS_PTRACEABLE ||
+   hide_pid == HIDEPID_INVISIBLE) {
/*
 * Let's make getdents(), stat(), and open()
 * consistent with each other.  If a process
@@ -3140,6 +3153,7 @@ struct dentry *proc_pid_lookup(struct inode *dir, struct 
dentry * dentry, unsign
unsigned tgid;
struct proc_fs_info *fs_info = proc_sb(dir->i_sb);
struct pid_namespace *ns = fs_info->pid_ns;
+   int pids = proc_fs_pids(fs_info);
 
tgid = name_to_int(>d_name);
if (tgid == ~0U)
@@ -3153,7 +3167,15 @@ struct dentry *proc_pid_lookup(struct inode *dir, struct 
dentry * dentry, unsign
if (!task)
goto out;
 
+   /* Limit procfs to only ptraceable tasks */
+   if (pids != PIDS_ALL) {
+   cond_resched();
+   if (!has_pid_permissions(fs_info, task, HIDEPID_NO_ACCESS))
+   goto out_put_task;
+   }
+
result = proc_pid_instantiate(dir, dentry, task, NULL);
+out_put_task:
put_task_struct(task);
 out:
return ERR_PTR(result);
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index faec32a..2707d5f 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -108,8 +108,12 @@ static int proc_show_options(struct seq_file *seq, struct 
dentry *root)
int hide_pid = proc_fs_hide_pid(fs_info);
kgid_t pid_gid = proc_fs_pid_gid(fs_info);
 
-   if (proc_fs_newinstance(fs_info))
+   if (proc_fs_newinstance(fs_info)) {
+   int pids = proc_fs_pids(fs_info);
+
seq_printf(seq, ",newinstance");
+   seq_printf(seq, ",pids=%s", pids == PIDS_ALL ? "all" : 
"ptraceable");
+   }
 
if (!gid_eq(pid_gid, GLOBAL_ROOT_GID))

[PATCH RFC v3 6/7] proc: support new 'pids=all|ptraceable' mount option

2017-11-09 Thread Djalal Harouni
This patch introduces the new 'pids' mount option, as it was discussed
and suggested by Andy Lutomirski [1].

* If 'pids=' is passed without 'newinstance' then it has no effect.

* If 'newinstance,pids=all' then all processes will be shown in proc.

* If 'newinstance,pids=ptraceable' then only ptraceable processes will be
shown.

* 'pids=' takes precendence over 'hidepid=' since 'hidepid=' can be
  ignored if "gid=" was set and caller has the "gid=" set in its groups.
  We want to guarantee that LSM have a security path there that can not
  be disabled with "gid=".

This allows to support lightweight sandboxes in Embedded Linux.

Later Yama LSM can be updated to check that processes are able only
able to see their children inside /proc/, allowing to support more tight
cases.

[1] https://lkml.org/lkml/2017/4/26/646

Cc: Kees Cook 
Cc: Greg Kroah-Hartman 
Suggested-by: Andy Lutomirski 
Signed-off-by: Alexey Gladkov 
Signed-off-by: Djalal Harouni 
---
 fs/proc/base.c  | 36 +---
 fs/proc/inode.c |  6 +-
 fs/proc/root.c  | 20 ++--
 include/linux/proc_fs.h | 30 ++
 4 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 54b527c..88b92bc 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -686,13 +686,24 @@ static bool has_pid_permissions(struct proc_fs_info 
*fs_info,
 struct task_struct *task,
 int hide_pid_min)
 {
-   int hide_pid = proc_fs_hide_pid(fs_info);
-   kgid_t gid = proc_fs_pid_gid(fs_info);
+   int pids = proc_fs_pids(fs_info);
+
+   /*
+* If 'pids=all' or if it was not set then lets fallback
+* to 'hidepid' and 'gid', if those are not enforced too, then
+* ptrace checks are skipped. Otherwise ptrace permission is
+* required for all other cases.
+*/
+   if (pids == PIDS_ALL) {
+   int hide_pid = proc_fs_hide_pid(fs_info);
+   kgid_t gid = proc_fs_pid_gid(fs_info);
+
+   if (hide_pid < hide_pid_min)
+   return true;
 
-   if (hide_pid < hide_pid_min)
-   return true;
-   if (in_group_p(gid))
-   return true;
+   if (in_group_p(gid))
+   return true;
+   }
return ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
 }
 
@@ -701,6 +712,7 @@ static int proc_pid_permission(struct inode *inode, int 
mask)
 {
struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
int hide_pid = proc_fs_hide_pid(fs_info);
+   int pids = proc_fs_pids(fs_info);
struct task_struct *task;
bool has_perms;
 
@@ -711,7 +723,8 @@ static int proc_pid_permission(struct inode *inode, int 
mask)
put_task_struct(task);
 
if (!has_perms) {
-   if (hide_pid == HIDEPID_INVISIBLE) {
+   if (pids == PIDS_PTRACEABLE ||
+   hide_pid == HIDEPID_INVISIBLE) {
/*
 * Let's make getdents(), stat(), and open()
 * consistent with each other.  If a process
@@ -3140,6 +3153,7 @@ struct dentry *proc_pid_lookup(struct inode *dir, struct 
dentry * dentry, unsign
unsigned tgid;
struct proc_fs_info *fs_info = proc_sb(dir->i_sb);
struct pid_namespace *ns = fs_info->pid_ns;
+   int pids = proc_fs_pids(fs_info);
 
tgid = name_to_int(>d_name);
if (tgid == ~0U)
@@ -3153,7 +3167,15 @@ struct dentry *proc_pid_lookup(struct inode *dir, struct 
dentry * dentry, unsign
if (!task)
goto out;
 
+   /* Limit procfs to only ptraceable tasks */
+   if (pids != PIDS_ALL) {
+   cond_resched();
+   if (!has_pid_permissions(fs_info, task, HIDEPID_NO_ACCESS))
+   goto out_put_task;
+   }
+
result = proc_pid_instantiate(dir, dentry, task, NULL);
+out_put_task:
put_task_struct(task);
 out:
return ERR_PTR(result);
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index faec32a..2707d5f 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -108,8 +108,12 @@ static int proc_show_options(struct seq_file *seq, struct 
dentry *root)
int hide_pid = proc_fs_hide_pid(fs_info);
kgid_t pid_gid = proc_fs_pid_gid(fs_info);
 
-   if (proc_fs_newinstance(fs_info))
+   if (proc_fs_newinstance(fs_info)) {
+   int pids = proc_fs_pids(fs_info);
+
seq_printf(seq, ",newinstance");
+   seq_printf(seq, ",pids=%s", pids == PIDS_ALL ? "all" : 
"ptraceable");
+   }
 
if (!gid_eq(pid_gid, GLOBAL_ROOT_GID))
seq_printf(seq, ",gid=%u", 
from_kgid_munged(current_user_ns(),pid_gid));
diff --git a/fs/proc/root.c b/fs/pr

[PATCH RFC v3 4/7] proc: support mounting private procfs instances inside same pid namespace

2017-11-09 Thread Djalal Harouni
hanges of this patch:

* 'newinstance' mount option, it was also suggesed by Andy Lutomirski [5].
When this option is passed we automatically create a private procfs instance.

This is not the default behaviour since we do not want to break userspace
and we do not want to provide different devices IDs by default when
stat()ing inodes, I am not sure about all the use cases there [6].

* Also this patch moves the 'hidepid' and 'gid' mount options from being
defined and used inside PID namespaces to their private proc_fs_info
struct, cleaning both PID namespaces and procfs.

Use cases of 'newinstance' mount option:

* We create a private procfs instance that it is disconnected from the
shared or other procfs instances.

* "hidepid" instead of chaning all other mirrored procfs mounts, now
it will work only on the new private instance.

* "gid" instead of chaning all other mirrored procfs mounts, now it will
work only on the new private instance.

* The next patch that introduces "pids=ptraceable" mount option which
will take precendence over "hidepid" will only work when 'newinstance'
is set. Otherwise it is ignored.

This should allow later after real testing to have a smooth transition
to a procfs with default private instances.

[1] 
https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-January/004215.html
[2] http://www.openwall.com/lists/kernel-hardening/2017/10/05/5
[3] https://lwn.net/Articles/689539/
[4] 
http://lxr.free-electrons.com/source/Documentation/filesystems/devpts.txt?v=3.14
[5] https://lkml.org/lkml/2017/5/2/407
[6] https://lkml.org/lkml/2017/5/3/357

Cc: Kees Cook <keesc...@chromium.org>
Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Suggested-by: Andy Lutomirski <l...@kernel.org>
Signed-off-by: Alexey Gladkov <gladkov.ale...@gmail.com>
Signed-off-by: Djalal Harouni <tix...@gmail.com>
---
 fs/proc/base.c|  4 +--
 fs/proc/inode.c   | 14 +---
 fs/proc/root.c| 78 ---
 include/linux/pid_namespace.h |  2 --
 include/linux/proc_fs.h   | 30 ++---
 5 files changed, 110 insertions(+), 18 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f324c49..54b527c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -700,7 +700,7 @@ static bool has_pid_permissions(struct proc_fs_info 
*fs_info,
 static int proc_pid_permission(struct inode *inode, int mask)
 {
struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
-   struct pid_namespace *pid = fs_info->pid_ns;
+   int hide_pid = proc_fs_hide_pid(fs_info);
struct task_struct *task;
bool has_perms;
 
@@ -711,7 +711,7 @@ static int proc_pid_permission(struct inode *inode, int 
mask)
put_task_struct(task);
 
if (!has_perms) {
-   if (pid->hide_pid == HIDEPID_INVISIBLE) {
+   if (hide_pid == HIDEPID_INVISIBLE) {
/*
 * Let's make getdents(), stat(), and open()
 * consistent with each other.  If a process
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index bdd808d..faec32a 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -105,12 +105,16 @@ static int proc_show_options(struct seq_file *seq, struct 
dentry *root)
 {
struct super_block *sb = root->d_sb;
struct proc_fs_info *fs_info = proc_sb(sb);
-   struct pid_namespace *pid = fs_info->pid_ns;
+   int hide_pid = proc_fs_hide_pid(fs_info);
+   kgid_t pid_gid = proc_fs_pid_gid(fs_info);
 
-   if (!gid_eq(pid->pid_gid, GLOBAL_ROOT_GID))
-   seq_printf(seq, ",gid=%u", from_kgid_munged(_user_ns, 
pid->pid_gid));
-   if (pid->hide_pid != HIDEPID_OFF)
-   seq_printf(seq, ",hidepid=%u", pid->hide_pid);
+   if (proc_fs_newinstance(fs_info))
+   seq_printf(seq, ",newinstance");
+
+   if (!gid_eq(pid_gid, GLOBAL_ROOT_GID))
+   seq_printf(seq, ",gid=%u", 
from_kgid_munged(current_user_ns(),pid_gid));
+   if (hide_pid != HIDEPID_OFF)
+   seq_printf(seq, ",hidepid=%u", hide_pid);
 
return 0;
 }
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 48cc481..33ab965 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -28,15 +28,57 @@
 #include "internal.h"
 
 enum {
-   Opt_gid, Opt_hidepid, Opt_err,
+   Opt_gid, Opt_hidepid, Opt_newinstance, Opt_err,
 };
 
 static const match_table_t tokens = {
{Opt_hidepid, "hidepid=%u"},
{Opt_gid, "gid=%u"},
+   {Opt_newinstance, "newinstance"},
{Opt_err, NULL},
 };
 
+/* We only parse 'newinstance' option here */
+int proc_parse_early_options(char *options, struct proc_fs_info *fs_info)
+{
+   char *p, *opts, *orig;
+   substring_t args[MAX_OPT_ARGS];
+
+   if (!options)
+  

[PATCH RFC v3 7/7] proc: flush dcache entries from all procfs instances

2017-11-09 Thread Djalal Harouni
Flush dcache entries of a task when it terminates. The task may have
showed up in multiple procfs mounts per pid namespace, and we need to
walk the mounts and invalidate any left entires.

Cc: Kees Cook <keesc...@chromium.org>
Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: Alexey Gladkov <gladkov.ale...@gmail.com>
Signed-off-by: Djalal Harouni <tix...@gmail.com>
---
 fs/proc/base.c| 27 +++-
 fs/proc/inode.c   |  9 +++-
 fs/proc/root.c| 10 +
 include/linux/pid_namespace.h | 49 +++
 include/linux/proc_fs.h   |  2 ++
 5 files changed, 91 insertions(+), 6 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 88b92bc..27e52aa 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3034,7 +3034,8 @@ static const struct inode_operations 
proc_tgid_base_inode_operations = {
.permission = proc_pid_permission,
 };
 
-static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
+static void proc_flush_task_mnt_root(struct dentry *mnt_root,
+pid_t pid, pid_t tgid)
 {
struct dentry *dentry, *leader, *dir;
char buf[PROC_NUMBUF];
@@ -3043,7 +3044,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, 
pid_t pid, pid_t tgid)
name.name = buf;
name.len = snprintf(buf, sizeof(buf), "%d", pid);
/* no ->d_hash() rejects on procfs */
-   dentry = d_hash_and_lookup(mnt->mnt_root, );
+   dentry = d_hash_and_lookup(mnt_root, );
if (dentry) {
d_invalidate(dentry);
dput(dentry);
@@ -3054,7 +3055,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, 
pid_t pid, pid_t tgid)
 
name.name = buf;
name.len = snprintf(buf, sizeof(buf), "%d", tgid);
-   leader = d_hash_and_lookup(mnt->mnt_root, );
+   leader = d_hash_and_lookup(mnt_root, );
if (!leader)
goto out;
 
@@ -3109,14 +3110,30 @@ void proc_flush_task(struct task_struct *task)
int i;
struct pid *pid, *tgid;
struct upid *upid;
+   struct proc_fs_info *fs_info_entry;
+   struct pid_namespace *pid_ns;
+   struct dentry *mnt_root;
 
pid = task_pid(task);
tgid = task_tgid(task);
 
for (i = 0; i <= pid->level; i++) {
upid = >numbers[i];
-   proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
-   tgid->numbers[i].nr);
+   pid_ns = upid->ns;
+
+   pidns_proc_lock_shared(pid_ns);
+   list_for_each_entry(fs_info_entry, _ns->procfs_mounts,
+   pidns_entry) {
+   if (proc_fs_newinstance(fs_info_entry)) {
+   mnt_root = fs_info_entry->sb->s_root;
+   proc_flush_task_mnt_root(mnt_root, upid->nr,
+tgid->numbers[i].nr);
+   }
+   }
+   pidns_proc_unlock_shared(pid_ns);
+
+   mnt_root = pid_ns->proc_mnt->mnt_root;
+   proc_flush_task_mnt_root(mnt_root, upid->nr, 
tgid->numbers[i].nr);
}
 }
 
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 2707d5f..8fcf0d7 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -484,10 +484,17 @@ struct inode *proc_get_inode(struct super_block *sb, 
struct proc_dir_entry *de)
 int proc_fill_super(struct super_block *s, void *data, int silent)
 {
struct proc_fs_info *fs_info = proc_sb(s);
+   struct pid_namespace *ns = get_pid_ns(fs_info->pid_ns);
struct inode *root_inode;
int ret;
 
-   get_pid_ns(fs_info->pid_ns);
+   fs_info->sb = s;
+
+   if (proc_fs_newinstance(fs_info)) {
+   pidns_proc_lock(ns);
+   list_add_tail(_info->pidns_entry, >procfs_mounts);
+   pidns_proc_unlock(ns);
+   }
 
if (!proc_parse_options(data, fs_info))
return -EINVAL;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 5cdff69..5503799 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -259,6 +259,13 @@ static void proc_kill_sb(struct super_block *sb)
dput(fs_info->proc_self);
if (fs_info->proc_thread_self)
dput(fs_info->proc_thread_self);
+
+   if (proc_fs_newinstance(fs_info)) {
+   pidns_proc_lock(ns);
+   list_del(_info->pidns_entry);
+   pidns_proc_unlock(ns);
+   }
+
kill_anon_super(sb);
put_pid_ns(ns);
kfree(fs_info);
@@ -374,6 +381,9 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
return PTR_ERR(mnt);
 
ns->proc_mnt = mnt;
+   init_r

[PATCH RFC v3 4/7] proc: support mounting private procfs instances inside same pid namespace

2017-11-09 Thread Djalal Harouni
hanges of this patch:

* 'newinstance' mount option, it was also suggesed by Andy Lutomirski [5].
When this option is passed we automatically create a private procfs instance.

This is not the default behaviour since we do not want to break userspace
and we do not want to provide different devices IDs by default when
stat()ing inodes, I am not sure about all the use cases there [6].

* Also this patch moves the 'hidepid' and 'gid' mount options from being
defined and used inside PID namespaces to their private proc_fs_info
struct, cleaning both PID namespaces and procfs.

Use cases of 'newinstance' mount option:

* We create a private procfs instance that it is disconnected from the
shared or other procfs instances.

* "hidepid" instead of chaning all other mirrored procfs mounts, now
it will work only on the new private instance.

* "gid" instead of chaning all other mirrored procfs mounts, now it will
work only on the new private instance.

* The next patch that introduces "pids=ptraceable" mount option which
will take precendence over "hidepid" will only work when 'newinstance'
is set. Otherwise it is ignored.

This should allow later after real testing to have a smooth transition
to a procfs with default private instances.

[1] 
https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-January/004215.html
[2] http://www.openwall.com/lists/kernel-hardening/2017/10/05/5
[3] https://lwn.net/Articles/689539/
[4] 
http://lxr.free-electrons.com/source/Documentation/filesystems/devpts.txt?v=3.14
[5] https://lkml.org/lkml/2017/5/2/407
[6] https://lkml.org/lkml/2017/5/3/357

Cc: Kees Cook 
Cc: Greg Kroah-Hartman 
Suggested-by: Andy Lutomirski 
Signed-off-by: Alexey Gladkov 
Signed-off-by: Djalal Harouni 
---
 fs/proc/base.c|  4 +--
 fs/proc/inode.c   | 14 +---
 fs/proc/root.c| 78 ---
 include/linux/pid_namespace.h |  2 --
 include/linux/proc_fs.h   | 30 ++---
 5 files changed, 110 insertions(+), 18 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f324c49..54b527c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -700,7 +700,7 @@ static bool has_pid_permissions(struct proc_fs_info 
*fs_info,
 static int proc_pid_permission(struct inode *inode, int mask)
 {
struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
-   struct pid_namespace *pid = fs_info->pid_ns;
+   int hide_pid = proc_fs_hide_pid(fs_info);
struct task_struct *task;
bool has_perms;
 
@@ -711,7 +711,7 @@ static int proc_pid_permission(struct inode *inode, int 
mask)
put_task_struct(task);
 
if (!has_perms) {
-   if (pid->hide_pid == HIDEPID_INVISIBLE) {
+   if (hide_pid == HIDEPID_INVISIBLE) {
/*
 * Let's make getdents(), stat(), and open()
 * consistent with each other.  If a process
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index bdd808d..faec32a 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -105,12 +105,16 @@ static int proc_show_options(struct seq_file *seq, struct 
dentry *root)
 {
struct super_block *sb = root->d_sb;
struct proc_fs_info *fs_info = proc_sb(sb);
-   struct pid_namespace *pid = fs_info->pid_ns;
+   int hide_pid = proc_fs_hide_pid(fs_info);
+   kgid_t pid_gid = proc_fs_pid_gid(fs_info);
 
-   if (!gid_eq(pid->pid_gid, GLOBAL_ROOT_GID))
-   seq_printf(seq, ",gid=%u", from_kgid_munged(_user_ns, 
pid->pid_gid));
-   if (pid->hide_pid != HIDEPID_OFF)
-   seq_printf(seq, ",hidepid=%u", pid->hide_pid);
+   if (proc_fs_newinstance(fs_info))
+   seq_printf(seq, ",newinstance");
+
+   if (!gid_eq(pid_gid, GLOBAL_ROOT_GID))
+   seq_printf(seq, ",gid=%u", 
from_kgid_munged(current_user_ns(),pid_gid));
+   if (hide_pid != HIDEPID_OFF)
+   seq_printf(seq, ",hidepid=%u", hide_pid);
 
return 0;
 }
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 48cc481..33ab965 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -28,15 +28,57 @@
 #include "internal.h"
 
 enum {
-   Opt_gid, Opt_hidepid, Opt_err,
+   Opt_gid, Opt_hidepid, Opt_newinstance, Opt_err,
 };
 
 static const match_table_t tokens = {
{Opt_hidepid, "hidepid=%u"},
{Opt_gid, "gid=%u"},
+   {Opt_newinstance, "newinstance"},
{Opt_err, NULL},
 };
 
+/* We only parse 'newinstance' option here */
+int proc_parse_early_options(char *options, struct proc_fs_info *fs_info)
+{
+   char *p, *opts, *orig;
+   substring_t args[MAX_OPT_ARGS];
+
+   if (!options)
+   return 0;
+
+   opts = kstrdup(options, GFP_KERNEL);
+   if (!opts)
+   return -ENOMEM;
+
+   orig = opts

[PATCH RFC v3 7/7] proc: flush dcache entries from all procfs instances

2017-11-09 Thread Djalal Harouni
Flush dcache entries of a task when it terminates. The task may have
showed up in multiple procfs mounts per pid namespace, and we need to
walk the mounts and invalidate any left entires.

Cc: Kees Cook 
Cc: Greg Kroah-Hartman 
Cc: Andy Lutomirski 
Cc: Alexey Gladkov 
Signed-off-by: Djalal Harouni 
---
 fs/proc/base.c| 27 +++-
 fs/proc/inode.c   |  9 +++-
 fs/proc/root.c| 10 +
 include/linux/pid_namespace.h | 49 +++
 include/linux/proc_fs.h   |  2 ++
 5 files changed, 91 insertions(+), 6 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 88b92bc..27e52aa 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3034,7 +3034,8 @@ static const struct inode_operations 
proc_tgid_base_inode_operations = {
.permission = proc_pid_permission,
 };
 
-static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
+static void proc_flush_task_mnt_root(struct dentry *mnt_root,
+pid_t pid, pid_t tgid)
 {
struct dentry *dentry, *leader, *dir;
char buf[PROC_NUMBUF];
@@ -3043,7 +3044,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, 
pid_t pid, pid_t tgid)
name.name = buf;
name.len = snprintf(buf, sizeof(buf), "%d", pid);
/* no ->d_hash() rejects on procfs */
-   dentry = d_hash_and_lookup(mnt->mnt_root, );
+   dentry = d_hash_and_lookup(mnt_root, );
if (dentry) {
d_invalidate(dentry);
dput(dentry);
@@ -3054,7 +3055,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, 
pid_t pid, pid_t tgid)
 
name.name = buf;
name.len = snprintf(buf, sizeof(buf), "%d", tgid);
-   leader = d_hash_and_lookup(mnt->mnt_root, );
+   leader = d_hash_and_lookup(mnt_root, );
if (!leader)
goto out;
 
@@ -3109,14 +3110,30 @@ void proc_flush_task(struct task_struct *task)
int i;
struct pid *pid, *tgid;
struct upid *upid;
+   struct proc_fs_info *fs_info_entry;
+   struct pid_namespace *pid_ns;
+   struct dentry *mnt_root;
 
pid = task_pid(task);
tgid = task_tgid(task);
 
for (i = 0; i <= pid->level; i++) {
upid = >numbers[i];
-   proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
-   tgid->numbers[i].nr);
+   pid_ns = upid->ns;
+
+   pidns_proc_lock_shared(pid_ns);
+   list_for_each_entry(fs_info_entry, _ns->procfs_mounts,
+   pidns_entry) {
+   if (proc_fs_newinstance(fs_info_entry)) {
+   mnt_root = fs_info_entry->sb->s_root;
+   proc_flush_task_mnt_root(mnt_root, upid->nr,
+tgid->numbers[i].nr);
+   }
+   }
+   pidns_proc_unlock_shared(pid_ns);
+
+   mnt_root = pid_ns->proc_mnt->mnt_root;
+   proc_flush_task_mnt_root(mnt_root, upid->nr, 
tgid->numbers[i].nr);
}
 }
 
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 2707d5f..8fcf0d7 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -484,10 +484,17 @@ struct inode *proc_get_inode(struct super_block *sb, 
struct proc_dir_entry *de)
 int proc_fill_super(struct super_block *s, void *data, int silent)
 {
struct proc_fs_info *fs_info = proc_sb(s);
+   struct pid_namespace *ns = get_pid_ns(fs_info->pid_ns);
struct inode *root_inode;
int ret;
 
-   get_pid_ns(fs_info->pid_ns);
+   fs_info->sb = s;
+
+   if (proc_fs_newinstance(fs_info)) {
+   pidns_proc_lock(ns);
+   list_add_tail(_info->pidns_entry, >procfs_mounts);
+   pidns_proc_unlock(ns);
+   }
 
if (!proc_parse_options(data, fs_info))
return -EINVAL;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 5cdff69..5503799 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -259,6 +259,13 @@ static void proc_kill_sb(struct super_block *sb)
dput(fs_info->proc_self);
if (fs_info->proc_thread_self)
dput(fs_info->proc_thread_self);
+
+   if (proc_fs_newinstance(fs_info)) {
+   pidns_proc_lock(ns);
+   list_del(_info->pidns_entry);
+   pidns_proc_unlock(ns);
+   }
+
kill_anon_super(sb);
put_pid_ns(ns);
kfree(fs_info);
@@ -374,6 +381,9 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
return PTR_ERR(mnt);
 
ns->proc_mnt = mnt;
+   init_rwsem(>rw_procfs_mnts);
+   INIT_LIST_HEAD(>procfs_mounts);
+
return 0;
 }
 
diff --git a/include/linux/pid_namespace.h b

[PATCH RFC v3 0/7] proc: modernize proc to support multiple private instances

2017-11-09 Thread Djalal Harouni
ved 'unshared' mount option and replaced it with 'limit_pids'
   which is attached to the current procfs mount.
   Suggested-by Andy Lutomirski <l...@kernel.org>
*) Do not fill dcache with pid entries that we can not ptrace.
*) Many bug fixes.


Djalal Harouni (7):
 [PATCH 1/7] proc: add proc_fs_info struct to store proc information
 [PATCH 2/7] proc: move /proc/{self|thread-self} dentries to proc_fs_info
 [PATCH 3/7] proc: add helpers to set and get proc hidepid and gid mount options
 [PATCH 4/7] proc: support mounting private procfs instances inside same pid 
namespace
 [PATCH 5/7] proc: move hidepid definitions to proc files
 [PATCH 6/7] proc: support new 'pids=all|ptraceable' mount option
 [patch 7/7] proc: flush dcache entries from all procfs instances


 fs/locks.c|   6 +-
 fs/proc/base.c| 103 ---
 fs/proc/inode.c   |  34 ++--
 fs/proc/internal.h|   2 +-
 fs/proc/root.c| 188 +++


[PATCH RFC v3 0/7] proc: modernize proc to support multiple private instances

2017-11-09 Thread Djalal Harouni
t_pids'
   which is attached to the current procfs mount.
   Suggested-by Andy Lutomirski 
*) Do not fill dcache with pid entries that we can not ptrace.
*) Many bug fixes.


Djalal Harouni (7):
 [PATCH 1/7] proc: add proc_fs_info struct to store proc information
 [PATCH 2/7] proc: move /proc/{self|thread-self} dentries to proc_fs_info
 [PATCH 3/7] proc: add helpers to set and get proc hidepid and gid mount options
 [PATCH 4/7] proc: support mounting private procfs instances inside same pid 
namespace
 [PATCH 5/7] proc: move hidepid definitions to proc files
 [PATCH 6/7] proc: support new 'pids=all|ptraceable' mount option
 [patch 7/7] proc: flush dcache entries from all procfs instances


 fs/locks.c|   6 +-
 fs/proc/base.c| 103 ---
 fs/proc/inode.c   |  34 ++--
 fs/proc/internal.h|   2 +-
 fs/proc/root.c| 188 +++


Re: [PATCH v2 2/2] pidmap(2)

2017-09-25 Thread Djalal Harouni
Hi Alexey,

On Sun, Sep 24, 2017 at 9:08 PM, Alexey Dobriyan  wrote:
> From: Tatsiana Brouka 
>
> Implement system call for bulk retrieveing of pids in binary form.
>
> Using /proc is slower than necessary: 3 syscalls + another 3 for each thread +
> converting with atoi() + instantiating dentries and inodes.
>
> /proc may be not mounted especially in containers. Natural extension of
> hidepid=2 efforts is to not mount /proc at all.

Actually I am not sure if software will work if /proc is not mounted,
last time (years) I
checked glibc was doing extra checks during initialization using
/proc/self/* memory
inodes and it may fail. Also fexecve() glibc is implemented using
/proc/self/... so it
depends on which library and the use case for cloud containers...

Also for the natural extension of hidepid=2 where we only want pids inside /proc
without kernel data, we have already a clean patch on top of the procfs
modernization [1] , this is the result of the previous months.


>
> It could be used by programs like ps, top or CRIU. Speed increase will
> become more drastic once combined with bulk retrieval of process statistics.

Yes the numbers are nice, seems that you want to move from filesystem syscalls
on procfs, to only use direct syscalls, hmm this does not help to fix
procfs. Tools
like ps, top and others can be updated, but anyone can *continue* to use
open+read on procfs and access the data.

I think this will be a bit hard to fix from our side, since with your
patches you are
doing it from current context, where from procfs it will be from:
current+procfs mount context.

What if procfs is mounted with "ptracepids=true" the new "hidepid=" but whithout
"gid=" interaction, and then you read from /proc//pidmap/* as suggested
by Andy ? /proc//pidmap/{tasks|proc|children} I am not sure about the
PIDMAP_IGNORE_KTHREADS case...


> Benchmark:
>
> N=1<<16 times
> ~130 processes (~250 task_structs) on a regular desktop system
> opendir + readdir + closedir /proc + the same for every 
> /proc/$PID/task
> (roughly what htop(1) does) vs pidmap
>
> /proc 16.80 ± 0.73%
> pidmap 0.06 ± 0.31%

Thanks!


[1]  
https://github.com/legionus/linux/commit/993a2a5b9af95b0ac901ff41d32124b72ed676e3

P.S. for the procfs modernization we are planning patches next days.

-- 
tixxdz


Re: [PATCH v2 2/2] pidmap(2)

2017-09-25 Thread Djalal Harouni
Hi Alexey,

On Sun, Sep 24, 2017 at 9:08 PM, Alexey Dobriyan  wrote:
> From: Tatsiana Brouka 
>
> Implement system call for bulk retrieveing of pids in binary form.
>
> Using /proc is slower than necessary: 3 syscalls + another 3 for each thread +
> converting with atoi() + instantiating dentries and inodes.
>
> /proc may be not mounted especially in containers. Natural extension of
> hidepid=2 efforts is to not mount /proc at all.

Actually I am not sure if software will work if /proc is not mounted,
last time (years) I
checked glibc was doing extra checks during initialization using
/proc/self/* memory
inodes and it may fail. Also fexecve() glibc is implemented using
/proc/self/... so it
depends on which library and the use case for cloud containers...

Also for the natural extension of hidepid=2 where we only want pids inside /proc
without kernel data, we have already a clean patch on top of the procfs
modernization [1] , this is the result of the previous months.


>
> It could be used by programs like ps, top or CRIU. Speed increase will
> become more drastic once combined with bulk retrieval of process statistics.

Yes the numbers are nice, seems that you want to move from filesystem syscalls
on procfs, to only use direct syscalls, hmm this does not help to fix
procfs. Tools
like ps, top and others can be updated, but anyone can *continue* to use
open+read on procfs and access the data.

I think this will be a bit hard to fix from our side, since with your
patches you are
doing it from current context, where from procfs it will be from:
current+procfs mount context.

What if procfs is mounted with "ptracepids=true" the new "hidepid=" but whithout
"gid=" interaction, and then you read from /proc//pidmap/* as suggested
by Andy ? /proc//pidmap/{tasks|proc|children} I am not sure about the
PIDMAP_IGNORE_KTHREADS case...


> Benchmark:
>
> N=1<<16 times
> ~130 processes (~250 task_structs) on a regular desktop system
> opendir + readdir + closedir /proc + the same for every 
> /proc/$PID/task
> (roughly what htop(1) does) vs pidmap
>
> /proc 16.80 ± 0.73%
> pidmap 0.06 ± 0.31%

Thanks!


[1]  
https://github.com/legionus/linux/commit/993a2a5b9af95b0ac901ff41d32124b72ed676e3

P.S. for the procfs modernization we are planning patches next days.

-- 
tixxdz


Re: [PATCH 1/2] pidmap(2)

2017-09-06 Thread Djalal Harouni
Hi Alexey,

On Thu, Sep 7, 2017 at 4:04 AM, Andy Lutomirski  wrote:
> On Wed, Sep 6, 2017 at 2:04 AM, Alexey Dobriyan  wrote:
>> On 9/6/17, Randy Dunlap  wrote:
>>> On 09/05/17 15:53, Andrew Morton wrote:
[...]
>>>
>>> also, I expect that the tiny kernel people will want kconfig options for
>>> these syscalls.
>>
>> We'll add it but the question if it is a good idea. Ideally these system 
>> calls
>> should be mandatory and /proc optional.
>>
>> $ size kernel/pidmap.o fs/fdmap.o
>>textdata bss dec hex filename
>> 560   0   0 560 230 kernel/pidmap.o
>> 617   0   0 617 269 fs/fdmap.o
>
> After much discussion at LPC/KS last year, I thought the idea was to
> try to speed up /proc rather than replacing it outright.  The two
> specific ideas I recall were:
>
> 1. Add a syscall like readfileat() that you can use to, in a single
> operation, open, read, and close a /proc file (or other file).  This
> should vastly reduce locking and RCU overhead.
>
> 2. Add a /proc file that has a nice binary format for task info.  (nl_attr?)
>
> I don't see why pidmap() deserves to be significantly faster than getdents().
>
> Also, a pidmap() syscall like this inherently bypasses any security
> restrictions implied by the way that /proc is mounted.  It can respect
> hidepid, but hidepid (as a per-namespace concept) is an enormous turd
> that badly needs to be deprecated, and Djalal is working on exactly
> that.

Yes as noted by Andy, me and Alexey Gladkov are working on modernizing
procfs [1] and to reduce/remove ties within pid namespaces which has lot
of problems now.

We just picked the task again, and this was the result of discussion
with Andy some months ago, on how to improve hidepid, but also how to
improve procfs in general, so we can add other mechanisms to hide or return
NULL on other /proc/_file_not_needed_by_containers_  or
/proc/_specific_module_files_ everything that is not virtualized , or mount only
some specific view of the whole /proc API this will also be used by containers.
This also should make it hard for attackers since we are planning to have
a backward compatible options on how to better treat some of these files in
regard of some namespaces.

The syscall or readfileat() for one operation is a nice addition
definitively. But
in general it would be better to treat /proc as a filesystem and not add other
specific interfaces that may abstract it with pidns, as it is the situation now
which make it from userspace perspective: hard to use especially for security
context.

Alexey, could you please Cc'us on future, thank you very much!


[1] https://lkml.org/lkml/2017/4/25/282

-- 
tixxdz


Re: [PATCH 1/2] pidmap(2)

2017-09-06 Thread Djalal Harouni
Hi Alexey,

On Thu, Sep 7, 2017 at 4:04 AM, Andy Lutomirski  wrote:
> On Wed, Sep 6, 2017 at 2:04 AM, Alexey Dobriyan  wrote:
>> On 9/6/17, Randy Dunlap  wrote:
>>> On 09/05/17 15:53, Andrew Morton wrote:
[...]
>>>
>>> also, I expect that the tiny kernel people will want kconfig options for
>>> these syscalls.
>>
>> We'll add it but the question if it is a good idea. Ideally these system 
>> calls
>> should be mandatory and /proc optional.
>>
>> $ size kernel/pidmap.o fs/fdmap.o
>>textdata bss dec hex filename
>> 560   0   0 560 230 kernel/pidmap.o
>> 617   0   0 617 269 fs/fdmap.o
>
> After much discussion at LPC/KS last year, I thought the idea was to
> try to speed up /proc rather than replacing it outright.  The two
> specific ideas I recall were:
>
> 1. Add a syscall like readfileat() that you can use to, in a single
> operation, open, read, and close a /proc file (or other file).  This
> should vastly reduce locking and RCU overhead.
>
> 2. Add a /proc file that has a nice binary format for task info.  (nl_attr?)
>
> I don't see why pidmap() deserves to be significantly faster than getdents().
>
> Also, a pidmap() syscall like this inherently bypasses any security
> restrictions implied by the way that /proc is mounted.  It can respect
> hidepid, but hidepid (as a per-namespace concept) is an enormous turd
> that badly needs to be deprecated, and Djalal is working on exactly
> that.

Yes as noted by Andy, me and Alexey Gladkov are working on modernizing
procfs [1] and to reduce/remove ties within pid namespaces which has lot
of problems now.

We just picked the task again, and this was the result of discussion
with Andy some months ago, on how to improve hidepid, but also how to
improve procfs in general, so we can add other mechanisms to hide or return
NULL on other /proc/_file_not_needed_by_containers_  or
/proc/_specific_module_files_ everything that is not virtualized , or mount only
some specific view of the whole /proc API this will also be used by containers.
This also should make it hard for attackers since we are planning to have
a backward compatible options on how to better treat some of these files in
regard of some namespaces.

The syscall or readfileat() for one operation is a nice addition
definitively. But
in general it would be better to treat /proc as a filesystem and not add other
specific interfaces that may abstract it with pidns, as it is the situation now
which make it from userspace perspective: hard to use especially for security
context.

Alexey, could you please Cc'us on future, thank you very much!


[1] https://lkml.org/lkml/2017/4/25/282

-- 
tixxdz


Re: [PATCH v4 next 1/3] modules:capabilities: allow __request_module() to take a capability argument

2017-09-02 Thread Djalal Harouni
Hi Kees,

On Thu, Jun 1, 2017 at 9:10 PM, Kees Cook <keesc...@google.com> wrote:
> On Thu, Jun 1, 2017 at 7:56 AM, Djalal Harouni <tix...@gmail.com> wrote:
...
>
>> BTW Kees, also in next version I won't remove the
>> capable(CAP_NET_ADMIN) check from [1]
>> even if there is the new request_module_cap(), I would like it to be
>> in a different patches, this way we go incremental
>> and maybe it is better to merge what we have now ?  and follow up
>> later, and of course if other maintainers agree too!
>
> Yes, incremental. I would suggest first creating the API changes to
> move a basic require_cap test into the LSM (which would drop the
> open-coded capable() checks in the net code), and then add the
> autoload logic in the following patches. That way the "infrastructure"
> changes happen separately and do not change any behaviors, but moves
> the caps test down where its wanted in the LSM, before then augmenting
> the logic.
>
>> I just need a bit of free time to check again everything and will send
>> a v5 with all requested changes.
>
> Great, thank you!
>

So sorry was busy these last months, I picked it again, will send v5 after the
merge window.

Kees I am looking on a way to integrate a test for it, we should use
something like
the example here [1] or maybe something else ? and which module to use ?

I still did not sort this out, if anyone has some suggestions, thank
you in advance!


[1] http://openwall.com/lists/kernel-hardening/2017/05/22/7

-- 
tixxdz


Re: [PATCH v4 next 1/3] modules:capabilities: allow __request_module() to take a capability argument

2017-09-02 Thread Djalal Harouni
Hi Kees,

On Thu, Jun 1, 2017 at 9:10 PM, Kees Cook  wrote:
> On Thu, Jun 1, 2017 at 7:56 AM, Djalal Harouni  wrote:
...
>
>> BTW Kees, also in next version I won't remove the
>> capable(CAP_NET_ADMIN) check from [1]
>> even if there is the new request_module_cap(), I would like it to be
>> in a different patches, this way we go incremental
>> and maybe it is better to merge what we have now ?  and follow up
>> later, and of course if other maintainers agree too!
>
> Yes, incremental. I would suggest first creating the API changes to
> move a basic require_cap test into the LSM (which would drop the
> open-coded capable() checks in the net code), and then add the
> autoload logic in the following patches. That way the "infrastructure"
> changes happen separately and do not change any behaviors, but moves
> the caps test down where its wanted in the LSM, before then augmenting
> the logic.
>
>> I just need a bit of free time to check again everything and will send
>> a v5 with all requested changes.
>
> Great, thank you!
>

So sorry was busy these last months, I picked it again, will send v5 after the
merge window.

Kees I am looking on a way to integrate a test for it, we should use
something like
the example here [1] or maybe something else ? and which module to use ?

I still did not sort this out, if anyone has some suggestions, thank
you in advance!


[1] http://openwall.com/lists/kernel-hardening/2017/05/22/7

-- 
tixxdz


Re: [PATCH v4 next 1/3] modules:capabilities: allow __request_module() to take a capability argument

2017-06-01 Thread Djalal Harouni
On Tue, May 30, 2017 at 7:59 PM, Kees Cook  wrote:
[...]
>>> I see a few options:
>>>
>>> 1) keep what you have for v4, and hope other places don't use
>>> __request_module. (I'm not a fan of this.)
>>
>> Yes even if it is documented I wouldn't bet on it, though. :-)
>
> Okay, we seem to agree: we'll not use #1.
>
>>> 2) switch the logic on autoload==1 from OR to AND: both the specified
>>> caps _and_ CAP_SYS_MODULE are required. (This seems like it might make
>>> autoload==1 less useful.)
>>
>> That will restrict some userspace that works only with CAP_NET_ADMIN.
>
> Nor #2.
>
>>> 3) use the request_module_cap() outlined above, which requires that
>>> modules being loaded under a CAP_SYS_MODULE-aliased capability are at
>>> least restricted to a subset of kernel module names.
>>
>> This one tends to allow usability.
>
> Right, discussed below...
>
>>> 4) same as 3 but also insert autoload==2 level that switches from OR
>>> to AND (bumping existing ==2 to ==3).
>>
>> I wouldn't expose autoload to callers, I think it is better if it
>> stays a property of the module subsystem. But lets use the bump idea,
>> please see below.
>
> If we can't agree below, I think #4 would be a good way to allow for
> both states.

Ok!


>>> What do you think?
>>
>> Ok so given that we already have modules_autoload_mode=2 disabled,
>> maybe we go with 3)  like this ?
>>
>> int __request_module(bool wait, int required_cap, const char *prefix,
>> const char *name, ...);
>> #define request_module(mod...) \
>> __request_module(true, -1, NULL, mod)
>> #define request_module_cap(required_cap, prefix, mod...) \
>> __request_module(true, required_cap, prefix, mod)
>>
>> and we require allow_cap and prefix to be set.
>>
>> request_module_cap(CAP_NET_ADMIN, "netdev-", "%s", name) for
>> net/core/dev_ioctl.c:dev_load()
>>
>> request_module_cap(CAP_NET_ADMIN, "tcp_", "%s", name) for
>> net/ipv4/tcp_cong.c  functions.
>>
>>
>> Then
>> __request_module()
>>   -> security_kernel_module_request(module_name, required_cap, prefix)
>>  -> may_autoload_module(current, module_name, required_cap, prefix)
>>
>>
>> And update may_autoload_module() as below ? we hard code CAP_NET_ADMIN
>> and CAP_SYS_MODULE inside and make them the only capabilities needed
>> for a privileged auto-load operation.
>
> I still think making a specific exception for CAP_NET_ADMIN is not the
> right solution, instead allowing for non-CAP_SYS_MODULE caps when
> using a distinct prefix.

Alright! I would have loved to avoid capabilities game, but these
patches also use them... so worst scenario is the per-task can always
be set, "task->module_autoload_mode=2" and block it if necessary.


>> request_module_cap(CAP_SYS_MODULE, ...) or
>> request_module_cap(CAP_NET_ADMIN, ...) if the autoload should be a
>> privileged operation.
>>
>> Kees will this work ?
>>
>> Jessica,  Rusty,  Serge. What do you think ? I definitively think that
>> module_autoload should be contained only inside the module subsystem..
>
> I'd change it like this:
>
>> +int may_autoload_module(struct task_struct *task, char *kmod_name,
>> +   int require_cap, char *prefix)
>> +{
>> +   unsigned int autoload;
>> +   int module_require_cap = 0;
>
> I'd initialize this to module_require_cap = CAP_SYS_MODULE;

Ok, please see below.



>> +
>> +   if (require_cap > 0) {
>> +   if (prefix == NULL || *prefix == '\0')
>> +   return -EPERM;
>
> Since an unprefixed module load should only be CAP_SYS_MODULE, change
> the above "if" to:
>
> if (require_cap > 0 && prefix != NULL && *prefix != '\0')
>
>> +
>> +   /*
>> +* We only allow CAP_SYS_MODULE or CAP_NET_ADMIN for
>> +* 'netdev-%s' modules for backward compatibility.
>> +* Please do not overload capabilities.
>> +*/
>> +   if (require_cap == CAP_SYS_MODULE ||
>> +   require_cap == CAP_NET_ADMIN)
>> +   module_require_cap = require_cap;
>> +   else
>> +   return -EPERM;
>> +   }
>
> And then drop all these checks, leaving only:
>
> module_require_cap = require_cap;
>
>> +
>> +   /* Get max value of sysctl and task "modules_autoload_mode" */
>> +   autoload = max_t(unsigned int, modules_autoload_mode,
>> +task->modules_autoload_mode);
>> +
>> +   /*
>> +* If autoload is disabled then fail here and not bother at all
>> +*/
>> +   if (autoload == MODULES_AUTOLOAD_DISABLED)
>> +   return -EPERM;
>> +
>> +   /*
>> +* If caller require capabilities then we may not allow
>> +* automatic module loading. We should not bypass callers.
>> +* This allows to support networking code that uses CAP_NET_ADMIN
>> +* for some aliased 'netdev-%s' modules.
>> +*
>> +* Explicitly bump autoload here 

Re: [PATCH v4 next 1/3] modules:capabilities: allow __request_module() to take a capability argument

2017-06-01 Thread Djalal Harouni
On Tue, May 30, 2017 at 7:59 PM, Kees Cook  wrote:
[...]
>>> I see a few options:
>>>
>>> 1) keep what you have for v4, and hope other places don't use
>>> __request_module. (I'm not a fan of this.)
>>
>> Yes even if it is documented I wouldn't bet on it, though. :-)
>
> Okay, we seem to agree: we'll not use #1.
>
>>> 2) switch the logic on autoload==1 from OR to AND: both the specified
>>> caps _and_ CAP_SYS_MODULE are required. (This seems like it might make
>>> autoload==1 less useful.)
>>
>> That will restrict some userspace that works only with CAP_NET_ADMIN.
>
> Nor #2.
>
>>> 3) use the request_module_cap() outlined above, which requires that
>>> modules being loaded under a CAP_SYS_MODULE-aliased capability are at
>>> least restricted to a subset of kernel module names.
>>
>> This one tends to allow usability.
>
> Right, discussed below...
>
>>> 4) same as 3 but also insert autoload==2 level that switches from OR
>>> to AND (bumping existing ==2 to ==3).
>>
>> I wouldn't expose autoload to callers, I think it is better if it
>> stays a property of the module subsystem. But lets use the bump idea,
>> please see below.
>
> If we can't agree below, I think #4 would be a good way to allow for
> both states.

Ok!


>>> What do you think?
>>
>> Ok so given that we already have modules_autoload_mode=2 disabled,
>> maybe we go with 3)  like this ?
>>
>> int __request_module(bool wait, int required_cap, const char *prefix,
>> const char *name, ...);
>> #define request_module(mod...) \
>> __request_module(true, -1, NULL, mod)
>> #define request_module_cap(required_cap, prefix, mod...) \
>> __request_module(true, required_cap, prefix, mod)
>>
>> and we require allow_cap and prefix to be set.
>>
>> request_module_cap(CAP_NET_ADMIN, "netdev-", "%s", name) for
>> net/core/dev_ioctl.c:dev_load()
>>
>> request_module_cap(CAP_NET_ADMIN, "tcp_", "%s", name) for
>> net/ipv4/tcp_cong.c  functions.
>>
>>
>> Then
>> __request_module()
>>   -> security_kernel_module_request(module_name, required_cap, prefix)
>>  -> may_autoload_module(current, module_name, required_cap, prefix)
>>
>>
>> And update may_autoload_module() as below ? we hard code CAP_NET_ADMIN
>> and CAP_SYS_MODULE inside and make them the only capabilities needed
>> for a privileged auto-load operation.
>
> I still think making a specific exception for CAP_NET_ADMIN is not the
> right solution, instead allowing for non-CAP_SYS_MODULE caps when
> using a distinct prefix.

Alright! I would have loved to avoid capabilities game, but these
patches also use them... so worst scenario is the per-task can always
be set, "task->module_autoload_mode=2" and block it if necessary.


>> request_module_cap(CAP_SYS_MODULE, ...) or
>> request_module_cap(CAP_NET_ADMIN, ...) if the autoload should be a
>> privileged operation.
>>
>> Kees will this work ?
>>
>> Jessica,  Rusty,  Serge. What do you think ? I definitively think that
>> module_autoload should be contained only inside the module subsystem..
>
> I'd change it like this:
>
>> +int may_autoload_module(struct task_struct *task, char *kmod_name,
>> +   int require_cap, char *prefix)
>> +{
>> +   unsigned int autoload;
>> +   int module_require_cap = 0;
>
> I'd initialize this to module_require_cap = CAP_SYS_MODULE;

Ok, please see below.



>> +
>> +   if (require_cap > 0) {
>> +   if (prefix == NULL || *prefix == '\0')
>> +   return -EPERM;
>
> Since an unprefixed module load should only be CAP_SYS_MODULE, change
> the above "if" to:
>
> if (require_cap > 0 && prefix != NULL && *prefix != '\0')
>
>> +
>> +   /*
>> +* We only allow CAP_SYS_MODULE or CAP_NET_ADMIN for
>> +* 'netdev-%s' modules for backward compatibility.
>> +* Please do not overload capabilities.
>> +*/
>> +   if (require_cap == CAP_SYS_MODULE ||
>> +   require_cap == CAP_NET_ADMIN)
>> +   module_require_cap = require_cap;
>> +   else
>> +   return -EPERM;
>> +   }
>
> And then drop all these checks, leaving only:
>
> module_require_cap = require_cap;
>
>> +
>> +   /* Get max value of sysctl and task "modules_autoload_mode" */
>> +   autoload = max_t(unsigned int, modules_autoload_mode,
>> +task->modules_autoload_mode);
>> +
>> +   /*
>> +* If autoload is disabled then fail here and not bother at all
>> +*/
>> +   if (autoload == MODULES_AUTOLOAD_DISABLED)
>> +   return -EPERM;
>> +
>> +   /*
>> +* If caller require capabilities then we may not allow
>> +* automatic module loading. We should not bypass callers.
>> +* This allows to support networking code that uses CAP_NET_ADMIN
>> +* for some aliased 'netdev-%s' modules.
>> +*
>> +* Explicitly bump autoload here if necessary
>> +

Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions

2017-05-24 Thread Djalal Harouni
On Tue, May 23, 2017 at 9:48 AM, Solar Designer <so...@openwall.com> wrote:
>> >>> On Mon, May 22, 2017 at 2:08 PM, Solar Designer <so...@openwall.com> 
>> >>> wrote:
>> >>> > On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote:
>> >>> >> *) When modules_autoload_mode is set to (2), automatic module loading 
>> >>> >> is
>> >>> >> disabled for all. Once set, this value can not be changed.
>> >>> >
>> >>> > What purpose does this securelevel-like property ("Once set, this value
>> >>> > can not be changed.") serve here?  I think this mode 2 is needed, but
>> >>> > without this extra property, which is bypassable by e.g. explicitly
>> >>> > loaded kernel modules anyway (and that's OK).
>
> On Mon, May 22, 2017 at 04:07:56PM -0700, Kees Cook wrote:
>> I'm on the fence. For modules_disabled and Yama, it was tied to
>> CAP_SYS_ADMIN, basically designed to be a at-boot setting that could
>> not later be undone by an attacker gaining that privilege, keeping
>> them out of either kernel memory or existing user process memory.
>> Here, it's CAP_SYS_MODULE... it's hard to imagine the situation where
>> a CAP_SYS_MODULE-capable process could write to this sysctl but NOT
>> issue direct modprobe requests, but it's _possible_ via crazy symlink
>> games to trick capable processes into writing to sysctls. We've seen
>> this multiple times before, and it's a way for attackers to turn a
>> single privileged write into a privileged exec.
>
> OK, tricking a process via crazy symlink games is finally a potentially
> valid reason.  The question then becomes: are there perhaps so many
> other important sysctl's, disk files, etc. (which the vulnerable capable
> process could similarly be tricked into writing) so that specifically
> resetting modules_autoload_mode isn't particularly lucrative?  I think
> that the answer to that is usually yes.  Another related question: do we
> really want to inconsistently single out a handful of sysctl's for this
> kind of extra protection?  I think not.
>
> I agree there are some other settings where being unable to reset them
> makes sense, but I think this isn't one of those.
>

Alright, I already replied to Andy, since it was requested I will drop
it. I definitely prefer that we have something merged and usable ;-)


>> I might turn the question around, though: why would we want to have it
>> changeable at this setting?
>
> Convenience for the sysadmin - being able to correct one's error (e.g.,
> wrong order of shell commands), respond to new findings (thought module
> autoloading was unneeded after some point, then found out some software
> relies on it), change one's mind, reuse a system differently than
> originally intended without a forced reboot.
>
>> I'm fine leaving that piece off, either way.
>
> I'm also fine with either decision.  I just thought I'd point out what
> looked weird to me.
>
> I think this is an important patch that should get in, but primarily
> for modules_autoload_mode=1, which many distros could make the default
> (and maybe the kernel eventually should?)

I do not think that desktop, or interactive systems will set it.
Ubuntu has snaps for their app store, and they can use the per-task
flag and other vendors too Flatpak, etc.


> For modules_autoload_mode=2, we already seem to have the equivalent of
> modprobe=/bin/true (or does it differ subtly, maybe in return values?),
> which I already use at startup on a GPU box like this (preloading
> modules so that the OpenCL backends wouldn't need the autoloading):
>
> nvidia-smi
> nvidia-modprobe -u -c=0
> #modprobe nvidia_uvm
> #modprobe fglrx
>
> sysctl -w kernel.modprobe=/bin/true
> sysctl -w kernel.hotplug=/bin/true
>
> but it's good to also have this supported more explicitly and more
> consistently through modules_autoload_mode=2 while we're at it.  So I
> support having this mode as well.  I just question the need to have it
> non-resettable.
>

Ok, yes with mode=2 it is clear interface, it logs what was denied, it
is consistent with the per-task flag that we want for sandboxes and
desktop, it will work with CONFIG_STATIC_USERMODEHELPER, more
importantly it will avoid doing the usermod upcall, even if one day
the kernel support upcall into namespaces, I'm pretty sure that no one
will like the idea of namespaced modprobe paths, modules are global
anyway, this allows to say we are safe by default from any future
change that may make an upcall into the wrong context, we just avoid
that.


-- 
tixxdz


Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions

2017-05-24 Thread Djalal Harouni
On Tue, May 23, 2017 at 9:48 AM, Solar Designer  wrote:
>> >>> On Mon, May 22, 2017 at 2:08 PM, Solar Designer  
>> >>> wrote:
>> >>> > On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote:
>> >>> >> *) When modules_autoload_mode is set to (2), automatic module loading 
>> >>> >> is
>> >>> >> disabled for all. Once set, this value can not be changed.
>> >>> >
>> >>> > What purpose does this securelevel-like property ("Once set, this value
>> >>> > can not be changed.") serve here?  I think this mode 2 is needed, but
>> >>> > without this extra property, which is bypassable by e.g. explicitly
>> >>> > loaded kernel modules anyway (and that's OK).
>
> On Mon, May 22, 2017 at 04:07:56PM -0700, Kees Cook wrote:
>> I'm on the fence. For modules_disabled and Yama, it was tied to
>> CAP_SYS_ADMIN, basically designed to be a at-boot setting that could
>> not later be undone by an attacker gaining that privilege, keeping
>> them out of either kernel memory or existing user process memory.
>> Here, it's CAP_SYS_MODULE... it's hard to imagine the situation where
>> a CAP_SYS_MODULE-capable process could write to this sysctl but NOT
>> issue direct modprobe requests, but it's _possible_ via crazy symlink
>> games to trick capable processes into writing to sysctls. We've seen
>> this multiple times before, and it's a way for attackers to turn a
>> single privileged write into a privileged exec.
>
> OK, tricking a process via crazy symlink games is finally a potentially
> valid reason.  The question then becomes: are there perhaps so many
> other important sysctl's, disk files, etc. (which the vulnerable capable
> process could similarly be tricked into writing) so that specifically
> resetting modules_autoload_mode isn't particularly lucrative?  I think
> that the answer to that is usually yes.  Another related question: do we
> really want to inconsistently single out a handful of sysctl's for this
> kind of extra protection?  I think not.
>
> I agree there are some other settings where being unable to reset them
> makes sense, but I think this isn't one of those.
>

Alright, I already replied to Andy, since it was requested I will drop
it. I definitely prefer that we have something merged and usable ;-)


>> I might turn the question around, though: why would we want to have it
>> changeable at this setting?
>
> Convenience for the sysadmin - being able to correct one's error (e.g.,
> wrong order of shell commands), respond to new findings (thought module
> autoloading was unneeded after some point, then found out some software
> relies on it), change one's mind, reuse a system differently than
> originally intended without a forced reboot.
>
>> I'm fine leaving that piece off, either way.
>
> I'm also fine with either decision.  I just thought I'd point out what
> looked weird to me.
>
> I think this is an important patch that should get in, but primarily
> for modules_autoload_mode=1, which many distros could make the default
> (and maybe the kernel eventually should?)

I do not think that desktop, or interactive systems will set it.
Ubuntu has snaps for their app store, and they can use the per-task
flag and other vendors too Flatpak, etc.


> For modules_autoload_mode=2, we already seem to have the equivalent of
> modprobe=/bin/true (or does it differ subtly, maybe in return values?),
> which I already use at startup on a GPU box like this (preloading
> modules so that the OpenCL backends wouldn't need the autoloading):
>
> nvidia-smi
> nvidia-modprobe -u -c=0
> #modprobe nvidia_uvm
> #modprobe fglrx
>
> sysctl -w kernel.modprobe=/bin/true
> sysctl -w kernel.hotplug=/bin/true
>
> but it's good to also have this supported more explicitly and more
> consistently through modules_autoload_mode=2 while we're at it.  So I
> support having this mode as well.  I just question the need to have it
> non-resettable.
>

Ok, yes with mode=2 it is clear interface, it logs what was denied, it
is consistent with the per-task flag that we want for sandboxes and
desktop, it will work with CONFIG_STATIC_USERMODEHELPER, more
importantly it will avoid doing the usermod upcall, even if one day
the kernel support upcall into namespaces, I'm pretty sure that no one
will like the idea of namespaced modprobe paths, modules are global
anyway, this allows to say we are safe by default from any future
change that may make an upcall into the wrong context, we just avoid
that.


-- 
tixxdz


Re: [PATCH v4 next 1/3] modules:capabilities: allow __request_module() to take a capability argument

2017-05-24 Thread Djalal Harouni
On Tue, May 23, 2017 at 9:19 PM, Kees Cook <keesc...@google.com> wrote:
> On Tue, May 23, 2017 at 3:29 AM, Djalal Harouni <tix...@gmail.com> wrote:
[...]

>> I think if there is an interface request_module_capable() , then code
>> will use it. The DCCP code path did not check capabilities at all and
>> called request_module(), other code does the same.
>>
>> A new interface can be abused, the result of this: we may break
>> "modules_autoload_mode" in mode 0 and 1. In the long term code will
>> want to change may_autoload_module() to also allow mode 1 to load a
>> module with CAP_NET_ADMIN or other caps in its own userns, resulting
>> in "modules_autoload_mode == 0 == 1". Without userns in the game we
>> may just see request_module_capable(CAP_SYS_ADMIN, ...)  . There is
>> already some code maybe phonet sockets ? that require CAP_SYS_ADMIN to
>> get the appropriate protocol and no one will be able to review all
>> this code or track new patches with request_module_capable() callers.
>
> I'm having some trouble following what you're saying here, but if I
> understand, you're worried about getting the kernel into a state where
> autoload state 0 == 1. Autoload 0 is "business as usual", and autoload
> 1 is "CAP_SYS_MODULE required to be able to trigger a module auto-load
> operation, or CAP_NET_ADMIN for modules with a 'netdev-%s' alias."

Indeed.


> In the v4 patch, under autoload==1, CAP_NET_ADMIN is needed to load
> netdev- modules:
>
> if (no_module && capable(CAP_NET_ADMIN))
>no_module = __request_module(true, CAP_NET_ADMIN,
> "netdev-%s", name);
>
> and in the LSM hook, CAP_NET_ADMIN is passed as an allowable "alias"
> for the CAP_SYS_MODULE requirement:
>
>else if (modules_autoload_mode == MODULES_AUTOLOAD_PRIVILEGED) {
>/* Check CAP_SYS_MODULE then allow_cap if valid */
>if (capable(CAP_SYS_MODULE) ||
>(allow_cap > 0 && capable(allow_cap)))
>   return 0;
>}
>
> What I see is some needless double-checking. Since you're making
> changes to the request_module() API, it would be possible to have

That check is *not* a double check and it is *really* needed in v4
since how may_autoload_module() was implemented. It first checks if
'autoload' == 0 == ALLOWED, if so then it allows the operation
regardless of the capability. That's why I didn't want to touch
current network logic and assumed that net code knows what it should
do.


> request_module_cap(), which could be checked instead of open-coding
> it:
>
>  if (no_module)
> no_module = request_module_cap(CAP_NET_ADMIN, "netdev-%s", name);
>
> If I'm understanding your objection correctly, it's that you want to
> ONLY ever provide this one-time alias for CAP_SYS_MODULE with the
> netdev-%s things, and you don't want to risk having other module
> loading start using request_module_cap() which would lead to
> CAP_SYS_MODULE aliases in other places?

Yes. We can't really track capabilities usage or new code.


> If the goal is to make sure that only privileged processes are
> autoloading, I don't think adding a well defined interface for
> cap-checks (request_module_cap()) would lead to a slippery slope. The
> worst case scenario (which would never happen) would be all
> request_module() users would convert to request_module_cap(). This

I am also concerned a bit with new code. In the documentation we
explicitly say CAP_SYS_MODULE, and new code should not break that
assumption.


> would mean that all module loading would require specific privileges.
> That seems in line with autoload==1. They would not be tied to
> CAP_SYS_MODULE, though, which is, I suspect, what you're concerned
> about.

Indeed, it is just easy to say hey it needs CAP_SYS_MODULE. The
capability usage in the module subsystem more precisely with explicit
loading is clean. CAP_SYS_MODULE is not overloaded, it has clear
focus. As you say it, we should be concerned if we blindly trust
callers and end up *aliasing* CAP_SYS_MODULE with some other cap...


> Even in the existing code, there is a sense about CAP_NET_ADMIN and
> CAP_SYS_MODULE having different privilege levels, in that
> CAP_NET_ADMIN can only load netdev-%s modules, but CAP_SYS_MODULE can
> load any module. What about refining request_module_cap() to _require_
> an explicit string prefix instead of an arbitrary format string? e.g.
> request_module_cap(CAP_NET_ADMIN, "netdev", "%s", name) which would
> make requests for ("netdev-%s", name)
>
> I see a few options:
>
> 1) keep what you

Re: [PATCH v4 next 1/3] modules:capabilities: allow __request_module() to take a capability argument

2017-05-24 Thread Djalal Harouni
On Tue, May 23, 2017 at 9:19 PM, Kees Cook  wrote:
> On Tue, May 23, 2017 at 3:29 AM, Djalal Harouni  wrote:
[...]

>> I think if there is an interface request_module_capable() , then code
>> will use it. The DCCP code path did not check capabilities at all and
>> called request_module(), other code does the same.
>>
>> A new interface can be abused, the result of this: we may break
>> "modules_autoload_mode" in mode 0 and 1. In the long term code will
>> want to change may_autoload_module() to also allow mode 1 to load a
>> module with CAP_NET_ADMIN or other caps in its own userns, resulting
>> in "modules_autoload_mode == 0 == 1". Without userns in the game we
>> may just see request_module_capable(CAP_SYS_ADMIN, ...)  . There is
>> already some code maybe phonet sockets ? that require CAP_SYS_ADMIN to
>> get the appropriate protocol and no one will be able to review all
>> this code or track new patches with request_module_capable() callers.
>
> I'm having some trouble following what you're saying here, but if I
> understand, you're worried about getting the kernel into a state where
> autoload state 0 == 1. Autoload 0 is "business as usual", and autoload
> 1 is "CAP_SYS_MODULE required to be able to trigger a module auto-load
> operation, or CAP_NET_ADMIN for modules with a 'netdev-%s' alias."

Indeed.


> In the v4 patch, under autoload==1, CAP_NET_ADMIN is needed to load
> netdev- modules:
>
> if (no_module && capable(CAP_NET_ADMIN))
>no_module = __request_module(true, CAP_NET_ADMIN,
> "netdev-%s", name);
>
> and in the LSM hook, CAP_NET_ADMIN is passed as an allowable "alias"
> for the CAP_SYS_MODULE requirement:
>
>else if (modules_autoload_mode == MODULES_AUTOLOAD_PRIVILEGED) {
>/* Check CAP_SYS_MODULE then allow_cap if valid */
>if (capable(CAP_SYS_MODULE) ||
>(allow_cap > 0 && capable(allow_cap)))
>   return 0;
>}
>
> What I see is some needless double-checking. Since you're making
> changes to the request_module() API, it would be possible to have

That check is *not* a double check and it is *really* needed in v4
since how may_autoload_module() was implemented. It first checks if
'autoload' == 0 == ALLOWED, if so then it allows the operation
regardless of the capability. That's why I didn't want to touch
current network logic and assumed that net code knows what it should
do.


> request_module_cap(), which could be checked instead of open-coding
> it:
>
>  if (no_module)
> no_module = request_module_cap(CAP_NET_ADMIN, "netdev-%s", name);
>
> If I'm understanding your objection correctly, it's that you want to
> ONLY ever provide this one-time alias for CAP_SYS_MODULE with the
> netdev-%s things, and you don't want to risk having other module
> loading start using request_module_cap() which would lead to
> CAP_SYS_MODULE aliases in other places?

Yes. We can't really track capabilities usage or new code.


> If the goal is to make sure that only privileged processes are
> autoloading, I don't think adding a well defined interface for
> cap-checks (request_module_cap()) would lead to a slippery slope. The
> worst case scenario (which would never happen) would be all
> request_module() users would convert to request_module_cap(). This

I am also concerned a bit with new code. In the documentation we
explicitly say CAP_SYS_MODULE, and new code should not break that
assumption.


> would mean that all module loading would require specific privileges.
> That seems in line with autoload==1. They would not be tied to
> CAP_SYS_MODULE, though, which is, I suspect, what you're concerned
> about.

Indeed, it is just easy to say hey it needs CAP_SYS_MODULE. The
capability usage in the module subsystem more precisely with explicit
loading is clean. CAP_SYS_MODULE is not overloaded, it has clear
focus. As you say it, we should be concerned if we blindly trust
callers and end up *aliasing* CAP_SYS_MODULE with some other cap...


> Even in the existing code, there is a sense about CAP_NET_ADMIN and
> CAP_SYS_MODULE having different privilege levels, in that
> CAP_NET_ADMIN can only load netdev-%s modules, but CAP_SYS_MODULE can
> load any module. What about refining request_module_cap() to _require_
> an explicit string prefix instead of an arbitrary format string? e.g.
> request_module_cap(CAP_NET_ADMIN, "netdev", "%s", name) which would
> make requests for ("netdev-%s", name)
>
> I see a few options:
>
> 1) keep what you have for v4, and hope other places don't u

Re: [RFC][PATCH 0/9] Make containers kernel objects

2017-05-23 Thread Djalal Harouni
On Tue, May 23, 2017 at 2:54 PM, Eric W. Biederman
 wrote:
> Jeff Layton  writes:
>
>> On Mon, 2017-05-22 at 14:04 -0500, Eric W. Biederman wrote:
>>> David Howells  writes:
>>>
>>> > Here are a set of patches to define a container object for the kernel and
>>> > to provide some methods to create and manipulate them.
>>> >
>>> > The reason I think this is necessary is that the kernel has no idea how to
>>> > direct upcalls to what userspace considers to be a container - current
>>> > Linux practice appears to make a "container" just an arbitrarily chosen
>>> > junction of namespaces, control groups and files, which may be changed
>>> > individually within the "container".
>>> >
>>>
>>> I think this might possibly be a useful abstraction for solving the
>>> keyring upcalls if it was something created implicitly.
>>>
>>> fork_into_container for use by keyring upcalls is currently a security
>>> vulnerability as it allows escaping all of a containers cgroups.  But
>>> you have that on your list of things to fix.  However you don't have
>>> seccomp and a few other things.
>>>
>>> Before we had kthreadd in the kernel upcalls always had issues because
>>> the code to reset all of the userspace bits and make the forked
>>> task suitable for running upcalls was always missing some detail.  It is
>>> a very bug-prone kind of idiom that you are talking about.  It is doubly
>>> bug-prone because the wrongness is visible to userspace and as such
>>> might get become a frozen KABI guarantee.
>>>
>>> Let me suggest a concrete alternative:
>>>
>>> - At the time of mount observer the mounters user namespace.
>>> - Find the mounters pid namespace.
>>> - If the mounters pid namespace is owned by the mounters user namespace
>>>   walk up the pid namespace tree to the first pid namespace owned by
>>>   that user namespace.
>>> - If the mounters pid namespace is not owned by the mounters user
>>>   namespace fail the mount it is going to need to make upcalls as
>>>   will not be possible.
>>> - Hold a reference to the pid namespace that was found.
>>>
>>> Then when an upcall needs to be made fork a child of the init process
>>> of the specified pid namespace.  Or fail if the init process of the
>>> pid namespace has died.
>>>
>>> That should always work and it does not require keeping expensive state
>>> where we did not have it previously.  Further because the semantics are
>>> fork a child of a particular pid namespace's init as features get added
>>> to the kernel this code remains well defined.
>>>
>>> For ordinary request-key upcalls we should be able to use the same rules
>>> and just not save/restore things in the kernel.
>>>
>>
>> OK, that does seem like a reasonable idea. Note that it's not just
>> request-key upcalls here that we're interested in, but anything that
>> we'd typically spawn from kthreadd otherwise.
>
> General user mode helper *Nod*.
>
>> That said, I worry a little about this. If the init process does a setns
>> at the wrong time, suddenly you're doing the upcall in different
>> namespaces than you intended.
>>
>> Might it be better to use the init process of the container as the
>> template like you suggest, but snapshot its "context" at a particular
>> point in time instead?
>>
>> knfsd could do this when it's started, for instance...
>
> The danger of a snapshot it time is something important (like cgroup
> membership) might change.
>
> It might be necessary to have this be an opt-in.   Perhaps even to the
> point of starting a dedicated kthreadd.
>
> Right now I think we need to figure out what it will take to solve this
> in the kernel because I strongly suspect that solving this in userspace
> is a cop out and we really aren't providing enough information to
> userspace to run the helper in the proper context.And I strongly
> suspect that providing enough information from the kernel will be
> roughly equivalent to solving this in the kernel.

Maybe it depends on the cases, a general approach can be too difficult
to handle especially from the security point. Maybe it is better to
identify what operations need what context, and a userspace
service/proxy can act using kthreadd with the right context... at
least the shift to this model has been done for years now in the
mobile industry.


-- 
tixxdz


Re: [RFC][PATCH 0/9] Make containers kernel objects

2017-05-23 Thread Djalal Harouni
On Tue, May 23, 2017 at 2:54 PM, Eric W. Biederman
 wrote:
> Jeff Layton  writes:
>
>> On Mon, 2017-05-22 at 14:04 -0500, Eric W. Biederman wrote:
>>> David Howells  writes:
>>>
>>> > Here are a set of patches to define a container object for the kernel and
>>> > to provide some methods to create and manipulate them.
>>> >
>>> > The reason I think this is necessary is that the kernel has no idea how to
>>> > direct upcalls to what userspace considers to be a container - current
>>> > Linux practice appears to make a "container" just an arbitrarily chosen
>>> > junction of namespaces, control groups and files, which may be changed
>>> > individually within the "container".
>>> >
>>>
>>> I think this might possibly be a useful abstraction for solving the
>>> keyring upcalls if it was something created implicitly.
>>>
>>> fork_into_container for use by keyring upcalls is currently a security
>>> vulnerability as it allows escaping all of a containers cgroups.  But
>>> you have that on your list of things to fix.  However you don't have
>>> seccomp and a few other things.
>>>
>>> Before we had kthreadd in the kernel upcalls always had issues because
>>> the code to reset all of the userspace bits and make the forked
>>> task suitable for running upcalls was always missing some detail.  It is
>>> a very bug-prone kind of idiom that you are talking about.  It is doubly
>>> bug-prone because the wrongness is visible to userspace and as such
>>> might get become a frozen KABI guarantee.
>>>
>>> Let me suggest a concrete alternative:
>>>
>>> - At the time of mount observer the mounters user namespace.
>>> - Find the mounters pid namespace.
>>> - If the mounters pid namespace is owned by the mounters user namespace
>>>   walk up the pid namespace tree to the first pid namespace owned by
>>>   that user namespace.
>>> - If the mounters pid namespace is not owned by the mounters user
>>>   namespace fail the mount it is going to need to make upcalls as
>>>   will not be possible.
>>> - Hold a reference to the pid namespace that was found.
>>>
>>> Then when an upcall needs to be made fork a child of the init process
>>> of the specified pid namespace.  Or fail if the init process of the
>>> pid namespace has died.
>>>
>>> That should always work and it does not require keeping expensive state
>>> where we did not have it previously.  Further because the semantics are
>>> fork a child of a particular pid namespace's init as features get added
>>> to the kernel this code remains well defined.
>>>
>>> For ordinary request-key upcalls we should be able to use the same rules
>>> and just not save/restore things in the kernel.
>>>
>>
>> OK, that does seem like a reasonable idea. Note that it's not just
>> request-key upcalls here that we're interested in, but anything that
>> we'd typically spawn from kthreadd otherwise.
>
> General user mode helper *Nod*.
>
>> That said, I worry a little about this. If the init process does a setns
>> at the wrong time, suddenly you're doing the upcall in different
>> namespaces than you intended.
>>
>> Might it be better to use the init process of the container as the
>> template like you suggest, but snapshot its "context" at a particular
>> point in time instead?
>>
>> knfsd could do this when it's started, for instance...
>
> The danger of a snapshot it time is something important (like cgroup
> membership) might change.
>
> It might be necessary to have this be an opt-in.   Perhaps even to the
> point of starting a dedicated kthreadd.
>
> Right now I think we need to figure out what it will take to solve this
> in the kernel because I strongly suspect that solving this in userspace
> is a cop out and we really aren't providing enough information to
> userspace to run the helper in the proper context.And I strongly
> suspect that providing enough information from the kernel will be
> roughly equivalent to solving this in the kernel.

Maybe it depends on the cases, a general approach can be too difficult
to handle especially from the security point. Maybe it is better to
identify what operations need what context, and a userspace
service/proxy can act using kthreadd with the right context... at
least the shift to this model has been done for years now in the
mobile industry.


-- 
tixxdz


Re: [RFC][PATCH 0/9] Make containers kernel objects

2017-05-23 Thread Djalal Harouni
On Tue, May 23, 2017 at 12:22 AM, Jeff Layton  wrote:
> On Mon, 2017-05-22 at 14:04 -0500, Eric W. Biederman wrote:
>> David Howells  writes:
>>
>> > Here are a set of patches to define a container object for the kernel and
>> > to provide some methods to create and manipulate them.
>> >
>> > The reason I think this is necessary is that the kernel has no idea how to
>> > direct upcalls to what userspace considers to be a container - current
>> > Linux practice appears to make a "container" just an arbitrarily chosen
>> > junction of namespaces, control groups and files, which may be changed
>> > individually within the "container".
>> >
>>
>> I think this might possibly be a useful abstraction for solving the
>> keyring upcalls if it was something created implicitly.
>>
>> fork_into_container for use by keyring upcalls is currently a security
>> vulnerability as it allows escaping all of a containers cgroups.  But
>> you have that on your list of things to fix.  However you don't have
>> seccomp and a few other things.
>>
>> Before we had kthreadd in the kernel upcalls always had issues because
>> the code to reset all of the userspace bits and make the forked
>> task suitable for running upcalls was always missing some detail.  It is
>> a very bug-prone kind of idiom that you are talking about.  It is doubly
>> bug-prone because the wrongness is visible to userspace and as such
>> might get become a frozen KABI guarantee.
>>
>> Let me suggest a concrete alternative:
>>
>> - At the time of mount observer the mounters user namespace.
>> - Find the mounters pid namespace.
>> - If the mounters pid namespace is owned by the mounters user namespace
>>   walk up the pid namespace tree to the first pid namespace owned by
>>   that user namespace.
>> - If the mounters pid namespace is not owned by the mounters user
>>   namespace fail the mount it is going to need to make upcalls as
>>   will not be possible.
>> - Hold a reference to the pid namespace that was found.
>>
>> Then when an upcall needs to be made fork a child of the init process
>> of the specified pid namespace.  Or fail if the init process of the
>> pid namespace has died.
>>
>> That should always work and it does not require keeping expensive state
>> where we did not have it previously.  Further because the semantics are
>> fork a child of a particular pid namespace's init as features get added
>> to the kernel this code remains well defined.
>>
>> For ordinary request-key upcalls we should be able to use the same rules
>> and just not save/restore things in the kernel.
>>
>
> OK, that does seem like a reasonable idea. Note that it's not just
> request-key upcalls here that we're interested in, but anything that
> we'd typically spawn from kthreadd otherwise.

Generalizing it will expose the kernel to exploits, today containers
setup the mount namespace for images from the net, outdated
filesystems, and users just do it,  it is easy. Having kthread running
inside such contexts is not a good idea. That's today usecases.


> That said, I worry a little about this. If the init process does a setns
> at the wrong time, suddenly you're doing the upcall in different
> namespaces than you intended.

That init process or whatever process inside owns that context and files.

Maybe for some cases it is better to use userspace that you can talk
to through a standard kernel bus endpoint and request a resource as it
is done within modern apps. The application at the other end acts
using kthread helpers in the appropriate context.


-- 
tixxdz


Re: [RFC][PATCH 0/9] Make containers kernel objects

2017-05-23 Thread Djalal Harouni
On Tue, May 23, 2017 at 12:22 AM, Jeff Layton  wrote:
> On Mon, 2017-05-22 at 14:04 -0500, Eric W. Biederman wrote:
>> David Howells  writes:
>>
>> > Here are a set of patches to define a container object for the kernel and
>> > to provide some methods to create and manipulate them.
>> >
>> > The reason I think this is necessary is that the kernel has no idea how to
>> > direct upcalls to what userspace considers to be a container - current
>> > Linux practice appears to make a "container" just an arbitrarily chosen
>> > junction of namespaces, control groups and files, which may be changed
>> > individually within the "container".
>> >
>>
>> I think this might possibly be a useful abstraction for solving the
>> keyring upcalls if it was something created implicitly.
>>
>> fork_into_container for use by keyring upcalls is currently a security
>> vulnerability as it allows escaping all of a containers cgroups.  But
>> you have that on your list of things to fix.  However you don't have
>> seccomp and a few other things.
>>
>> Before we had kthreadd in the kernel upcalls always had issues because
>> the code to reset all of the userspace bits and make the forked
>> task suitable for running upcalls was always missing some detail.  It is
>> a very bug-prone kind of idiom that you are talking about.  It is doubly
>> bug-prone because the wrongness is visible to userspace and as such
>> might get become a frozen KABI guarantee.
>>
>> Let me suggest a concrete alternative:
>>
>> - At the time of mount observer the mounters user namespace.
>> - Find the mounters pid namespace.
>> - If the mounters pid namespace is owned by the mounters user namespace
>>   walk up the pid namespace tree to the first pid namespace owned by
>>   that user namespace.
>> - If the mounters pid namespace is not owned by the mounters user
>>   namespace fail the mount it is going to need to make upcalls as
>>   will not be possible.
>> - Hold a reference to the pid namespace that was found.
>>
>> Then when an upcall needs to be made fork a child of the init process
>> of the specified pid namespace.  Or fail if the init process of the
>> pid namespace has died.
>>
>> That should always work and it does not require keeping expensive state
>> where we did not have it previously.  Further because the semantics are
>> fork a child of a particular pid namespace's init as features get added
>> to the kernel this code remains well defined.
>>
>> For ordinary request-key upcalls we should be able to use the same rules
>> and just not save/restore things in the kernel.
>>
>
> OK, that does seem like a reasonable idea. Note that it's not just
> request-key upcalls here that we're interested in, but anything that
> we'd typically spawn from kthreadd otherwise.

Generalizing it will expose the kernel to exploits, today containers
setup the mount namespace for images from the net, outdated
filesystems, and users just do it,  it is easy. Having kthread running
inside such contexts is not a good idea. That's today usecases.


> That said, I worry a little about this. If the init process does a setns
> at the wrong time, suddenly you're doing the upcall in different
> namespaces than you intended.

That init process or whatever process inside owns that context and files.

Maybe for some cases it is better to use userspace that you can talk
to through a standard kernel bus endpoint and request a resource as it
is done within modern apps. The application at the other end acts
using kthread helpers in the appropriate context.


-- 
tixxdz


Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions

2017-05-23 Thread Djalal Harouni
On Tue, May 23, 2017 at 1:38 AM, Andy Lutomirski <l...@kernel.org> wrote:
> On Mon, May 22, 2017 at 4:07 PM, Kees Cook <keesc...@chromium.org> wrote:
>> On Mon, May 22, 2017 at 12:55 PM, Djalal Harouni <tix...@gmail.com> wrote:
>>> On Mon, May 22, 2017 at 6:43 PM, Solar Designer <so...@openwall.com> wrote:
>>>> On Mon, May 22, 2017 at 03:49:15PM +0200, Djalal Harouni wrote:
>>>>> On Mon, May 22, 2017 at 2:08 PM, Solar Designer <so...@openwall.com> 
>>>>> wrote:
>>>>> > On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote:
>>>>> >> *) When modules_autoload_mode is set to (2), automatic module loading 
>>>>> >> is
>>>>> >> disabled for all. Once set, this value can not be changed.
>>>>> >
>>>>> > What purpose does this securelevel-like property ("Once set, this value
>>>>> > can not be changed.") serve here?  I think this mode 2 is needed, but
>>>>> > without this extra property, which is bypassable by e.g. explicitly
>>>>> > loaded kernel modules anyway (and that's OK).
>>>>>
>>>>> My reasoning about "Once set, this value can not be changed" is mainly 
>>>>> for:
>>>>>
>>>>> If you have some systems where modules are not updated for any given
>>>>> reason, then the only one who will be able to load a module is an
>>>>> administrator, basically this is a shortcut for:
>>>>>
>>>>> * Apps/services can run with CAP_NET_ADMIN but they are not allowed to
>>>>> auto-load 'netdev' modules.
>>>>>
>>>>> * Explicitly loading modules can be guarded by seccomp filters *per*
>>>>> app, so even if these apps have
>>>>>   CAP_SYS_MODULE they won't be able to explicitly load modules, one
>>>>> has to remount some sysctl /proc/ entries read-only here and remove
>>>>> CAP_SYS_ADMIN for all apps anyway.
>>>>>
>>>>> This mainly serves the purpose of these systems that do not receive
>>>>> updates, if I don't want to expose those kernel interfaces what should
>>>>> I do ? then if I want to unload old versions and replace them with new
>>>>> ones what operation should be allowed ? and only real root of the
>>>>> system can do it. Hence, the "Once set, this value can not be changed"
>>>>> is more of a shortcut, also the idea was put in my mind based on how
>>>>> "modules_disabled" is disabled forever, and some other interfaces. I
>>>>> would say: it is easy to handle a transition from 1) "hey this system
>>>>> is still up to date, some features should be exposed" to 2) "this
>>>>> system is not up to date anymore, only root should expose some
>>>>> features..."
>>>>>
>>>>> Hmm, I am not sure if this answers your question ? :-)
>>>>
>>>> This answers my question, but in a way that I summarize as "there's no
>>>> good reason to include this securelevel-like property".
>>>>
>>>
>>> Hmm, sorry I did forget to add in my previous comment that with such
>>> systems, CAP_SYS_MODULE can be used to reset the
>>> "modules_autoload_mode" sysctl back from mode 2 to mode 1, even if we
>>> disable it privileged tasks can be triggered to overwrite the sysctl
>>> flag and get it back unless /proc is read-only... that's one of the
>>> points, it should not be so easy to relax it.
>>
>> I'm on the fence. For modules_disabled and Yama, it was tied to
>> CAP_SYS_ADMIN, basically designed to be a at-boot setting that could
>> not later be undone by an attacker gaining that privilege, keeping
>> them out of either kernel memory or existing user process memory.
>> Here, it's CAP_SYS_MODULE... it's hard to imagine the situation where
>> a CAP_SYS_MODULE-capable process could write to this sysctl but NOT
>> issue direct modprobe requests, but it's _possible_ via crazy symlink
>> games to trick capable processes into writing to sysctls. We've seen
>> this multiple times before, and it's a way for attackers to turn a
>> single privileged write into a privileged exec.
>>
>> I might turn the question around, though: why would we want to have it
>> changeable at this setting?
>>
>> I'm fine leaving that piece off, either way.
>
> I think that having the un-resettable mode is unnecessary.  We should
> have option that disables loading modules entirely and cannot be
> unset.  (That means no explicit loads and not implicit loads.)  Maybe
> we already have this.  Otherwise, tightening caps needed for implicit
> loads should just be a normal yes/no setting IMO.

Ok, so as requested by you and Alexander in the other email, I will
remove the un-resettable property, if there is no general agreement,
better leave it for future.

I would have preferred if it is "yes/no" (in systemd the module
protection and other directives are also only 'yes/no').  But 1)
backward compatibility for unprivileged loading, 2) capabilities:
CAP_SYS_MODULE, CAP_NET_ADMIN use cases and exploits against
CAP_NET_ADMIN modules, and finally 3) separate and disable implicit
operation from the explicit one for various reasons that are noted in
the commit log like explicitly load the good module version, concluded
that we need three modes!

Thanks!

-- 
tixxdz


Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions

2017-05-23 Thread Djalal Harouni
On Tue, May 23, 2017 at 1:38 AM, Andy Lutomirski  wrote:
> On Mon, May 22, 2017 at 4:07 PM, Kees Cook  wrote:
>> On Mon, May 22, 2017 at 12:55 PM, Djalal Harouni  wrote:
>>> On Mon, May 22, 2017 at 6:43 PM, Solar Designer  wrote:
>>>> On Mon, May 22, 2017 at 03:49:15PM +0200, Djalal Harouni wrote:
>>>>> On Mon, May 22, 2017 at 2:08 PM, Solar Designer  
>>>>> wrote:
>>>>> > On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote:
>>>>> >> *) When modules_autoload_mode is set to (2), automatic module loading 
>>>>> >> is
>>>>> >> disabled for all. Once set, this value can not be changed.
>>>>> >
>>>>> > What purpose does this securelevel-like property ("Once set, this value
>>>>> > can not be changed.") serve here?  I think this mode 2 is needed, but
>>>>> > without this extra property, which is bypassable by e.g. explicitly
>>>>> > loaded kernel modules anyway (and that's OK).
>>>>>
>>>>> My reasoning about "Once set, this value can not be changed" is mainly 
>>>>> for:
>>>>>
>>>>> If you have some systems where modules are not updated for any given
>>>>> reason, then the only one who will be able to load a module is an
>>>>> administrator, basically this is a shortcut for:
>>>>>
>>>>> * Apps/services can run with CAP_NET_ADMIN but they are not allowed to
>>>>> auto-load 'netdev' modules.
>>>>>
>>>>> * Explicitly loading modules can be guarded by seccomp filters *per*
>>>>> app, so even if these apps have
>>>>>   CAP_SYS_MODULE they won't be able to explicitly load modules, one
>>>>> has to remount some sysctl /proc/ entries read-only here and remove
>>>>> CAP_SYS_ADMIN for all apps anyway.
>>>>>
>>>>> This mainly serves the purpose of these systems that do not receive
>>>>> updates, if I don't want to expose those kernel interfaces what should
>>>>> I do ? then if I want to unload old versions and replace them with new
>>>>> ones what operation should be allowed ? and only real root of the
>>>>> system can do it. Hence, the "Once set, this value can not be changed"
>>>>> is more of a shortcut, also the idea was put in my mind based on how
>>>>> "modules_disabled" is disabled forever, and some other interfaces. I
>>>>> would say: it is easy to handle a transition from 1) "hey this system
>>>>> is still up to date, some features should be exposed" to 2) "this
>>>>> system is not up to date anymore, only root should expose some
>>>>> features..."
>>>>>
>>>>> Hmm, I am not sure if this answers your question ? :-)
>>>>
>>>> This answers my question, but in a way that I summarize as "there's no
>>>> good reason to include this securelevel-like property".
>>>>
>>>
>>> Hmm, sorry I did forget to add in my previous comment that with such
>>> systems, CAP_SYS_MODULE can be used to reset the
>>> "modules_autoload_mode" sysctl back from mode 2 to mode 1, even if we
>>> disable it privileged tasks can be triggered to overwrite the sysctl
>>> flag and get it back unless /proc is read-only... that's one of the
>>> points, it should not be so easy to relax it.
>>
>> I'm on the fence. For modules_disabled and Yama, it was tied to
>> CAP_SYS_ADMIN, basically designed to be a at-boot setting that could
>> not later be undone by an attacker gaining that privilege, keeping
>> them out of either kernel memory or existing user process memory.
>> Here, it's CAP_SYS_MODULE... it's hard to imagine the situation where
>> a CAP_SYS_MODULE-capable process could write to this sysctl but NOT
>> issue direct modprobe requests, but it's _possible_ via crazy symlink
>> games to trick capable processes into writing to sysctls. We've seen
>> this multiple times before, and it's a way for attackers to turn a
>> single privileged write into a privileged exec.
>>
>> I might turn the question around, though: why would we want to have it
>> changeable at this setting?
>>
>> I'm fine leaving that piece off, either way.
>
> I think that having the un-resettable mode is unnecessary.  We should
> have option that disables loading modules entirely and cannot be
> unset.  (That means no explicit loads and not implicit loads.)  Maybe
> we already have this.  Otherwise, tightening caps needed for implicit
> loads should just be a normal yes/no setting IMO.

Ok, so as requested by you and Alexander in the other email, I will
remove the un-resettable property, if there is no general agreement,
better leave it for future.

I would have preferred if it is "yes/no" (in systemd the module
protection and other directives are also only 'yes/no').  But 1)
backward compatibility for unprivileged loading, 2) capabilities:
CAP_SYS_MODULE, CAP_NET_ADMIN use cases and exploits against
CAP_NET_ADMIN modules, and finally 3) separate and disable implicit
operation from the explicit one for various reasons that are noted in
the commit log like explicitly load the good module version, concluded
that we need three modes!

Thanks!

-- 
tixxdz


Re: [PATCH v4 next 1/3] modules:capabilities: allow __request_module() to take a capability argument

2017-05-23 Thread Djalal Harouni
On Tue, May 23, 2017 at 12:20 AM, Kees Cook <keesc...@chromium.org> wrote:
> On Mon, May 22, 2017 at 4:57 AM, Djalal Harouni <tix...@gmail.com> wrote:
>> This is a preparation patch for the module auto-load restriction feature.
>>
>> In order to restrict module auto-load operations we need to check if the
>> caller has CAP_SYS_MODULE capability. This allows to align security
>> checks of automatic module loading with the checks of the explicit 
>> operations.
>>
>> However for "netdev-%s" modules, they are allowed to be loaded if
>> CAP_NET_ADMIN is set. Therefore, in order to not break this assumption,
>> and allow userspace to only load "netdev-%s" modules with CAP_NET_ADMIN
>> capability which is considered a privileged operation, we have two
>> choices: 1) parse "netdev-%s" alias and check the capability or 2) hand
>> the capability form request_module() to security_kernel_module_request()
>> hook and let the capability subsystem decide.
>>
>> After a discussion with Rusty Russell [1], the suggestion was to pass
>> the capability from request_module() to security_kernel_module_request()
>> for 'netdev-%s' modules that need CAP_NET_ADMIN.
>>
>> The patch does not update request_module(), it updates the internal
>> __request_module() that will take an extra "allow_cap" argument. If
>> positive, then automatic module load operation can be allowed.
>
> I find this refactor slightly confusing. I would expect to collapse
> the existing caps checks in net/core/dev_ioctl.c and
> net/ipv4/tcp_cong.c, and make this a "required cap" argument, and to
> add a new non-__ function instead of requiring callers use
> __request_module.
>
> request_module_capable(int cap_required, fmt, args);
>
> adjust __request_module() for the new arg, and when cap_required !=
> -1, perform a cap check.
>
> Then make request_module pass -1 to __request_module(), and change
> dev_ioctl.c (and tcp_cong.c) from:
>
> if (no_module && capable(CAP_NET_ADMIN))
> no_module = request_module("netdev-%s", name);
> if (no_module && capable(CAP_SYS_MODULE))
> request_module("%s", name);
>
> to:
>
> if (no_module)
> no_module = request_module_capable(CAP_NET_ADMIN,
> "netdev-%s", name);
> if (no_module)
> no_module = request_module_capable(CAP_SYS_MODULE, "%s", 
> name);
>
> that'll make the code cleaner, too.

The refactoring in the patch is more for backward compatibility with
CAP_NET_ADMIN,
as discussed here: https://lkml.org/lkml/2017/4/26/147

I think if there is an interface request_module_capable() , then code
will use it. The DCCP code path did not check capabilities at all and
called request_module(), other code does the same.

A new interface can be abused, the result of this: we may break
"modules_autoload_mode" in mode 0 and 1. In the long term code will
want to change may_autoload_module() to also allow mode 1 to load a
module with CAP_NET_ADMIN or other caps in its own userns, resulting
in "modules_autoload_mode == 0 == 1". Without userns in the game we
may just see request_module_capable(CAP_SYS_ADMIN, ...)  . There is
already some code maybe phonet sockets ? that require CAP_SYS_ADMIN to
get the appropriate protocol and no one will be able to review all
this code or track new patches with request_module_capable() callers.

Kernel modules are global resources, and this patchset makes sure to
treat them that way. It aligns explicit and implicit module loading
operations with CAP_SYS_MODULE. The description is using PRIVILEGED in
regard for CAP_NET_ADMIN backward compatibility only. Capabilities
just got more useful thanks to the ambient caps, but please lets make
sure that inherited caps do not break "modules_autoload_mode=1" and
make it act like "modules_autoload_mode=0". We end up handing the
permission checks from the module subsystem to the ones that are
requesting it, the module subsystem should not blindly trust
request_module() calls that come from outdated modules.

I don't mind refactoring the code so it is better, but IMO we should
avoid exposing or playing the capability game unless there is a good
reason. If CAP_SYS_MODULE is not set, then my devices should not
autoload *old* modules without me, it is easy to set and to track.

I would also add, that the per-task module auto-load flag is using the
same *may_autoload_module()* checks for one reason: consistency. Some
capability usage in the kernel is not consistent, these patches make
sure to not fall into that trap. If you set the global sysctl for your
secure server, or the per-task for contain

Re: [PATCH v4 next 1/3] modules:capabilities: allow __request_module() to take a capability argument

2017-05-23 Thread Djalal Harouni
On Tue, May 23, 2017 at 12:20 AM, Kees Cook  wrote:
> On Mon, May 22, 2017 at 4:57 AM, Djalal Harouni  wrote:
>> This is a preparation patch for the module auto-load restriction feature.
>>
>> In order to restrict module auto-load operations we need to check if the
>> caller has CAP_SYS_MODULE capability. This allows to align security
>> checks of automatic module loading with the checks of the explicit 
>> operations.
>>
>> However for "netdev-%s" modules, they are allowed to be loaded if
>> CAP_NET_ADMIN is set. Therefore, in order to not break this assumption,
>> and allow userspace to only load "netdev-%s" modules with CAP_NET_ADMIN
>> capability which is considered a privileged operation, we have two
>> choices: 1) parse "netdev-%s" alias and check the capability or 2) hand
>> the capability form request_module() to security_kernel_module_request()
>> hook and let the capability subsystem decide.
>>
>> After a discussion with Rusty Russell [1], the suggestion was to pass
>> the capability from request_module() to security_kernel_module_request()
>> for 'netdev-%s' modules that need CAP_NET_ADMIN.
>>
>> The patch does not update request_module(), it updates the internal
>> __request_module() that will take an extra "allow_cap" argument. If
>> positive, then automatic module load operation can be allowed.
>
> I find this refactor slightly confusing. I would expect to collapse
> the existing caps checks in net/core/dev_ioctl.c and
> net/ipv4/tcp_cong.c, and make this a "required cap" argument, and to
> add a new non-__ function instead of requiring callers use
> __request_module.
>
> request_module_capable(int cap_required, fmt, args);
>
> adjust __request_module() for the new arg, and when cap_required !=
> -1, perform a cap check.
>
> Then make request_module pass -1 to __request_module(), and change
> dev_ioctl.c (and tcp_cong.c) from:
>
> if (no_module && capable(CAP_NET_ADMIN))
> no_module = request_module("netdev-%s", name);
> if (no_module && capable(CAP_SYS_MODULE))
> request_module("%s", name);
>
> to:
>
> if (no_module)
> no_module = request_module_capable(CAP_NET_ADMIN,
> "netdev-%s", name);
> if (no_module)
> no_module = request_module_capable(CAP_SYS_MODULE, "%s", 
> name);
>
> that'll make the code cleaner, too.

The refactoring in the patch is more for backward compatibility with
CAP_NET_ADMIN,
as discussed here: https://lkml.org/lkml/2017/4/26/147

I think if there is an interface request_module_capable() , then code
will use it. The DCCP code path did not check capabilities at all and
called request_module(), other code does the same.

A new interface can be abused, the result of this: we may break
"modules_autoload_mode" in mode 0 and 1. In the long term code will
want to change may_autoload_module() to also allow mode 1 to load a
module with CAP_NET_ADMIN or other caps in its own userns, resulting
in "modules_autoload_mode == 0 == 1". Without userns in the game we
may just see request_module_capable(CAP_SYS_ADMIN, ...)  . There is
already some code maybe phonet sockets ? that require CAP_SYS_ADMIN to
get the appropriate protocol and no one will be able to review all
this code or track new patches with request_module_capable() callers.

Kernel modules are global resources, and this patchset makes sure to
treat them that way. It aligns explicit and implicit module loading
operations with CAP_SYS_MODULE. The description is using PRIVILEGED in
regard for CAP_NET_ADMIN backward compatibility only. Capabilities
just got more useful thanks to the ambient caps, but please lets make
sure that inherited caps do not break "modules_autoload_mode=1" and
make it act like "modules_autoload_mode=0". We end up handing the
permission checks from the module subsystem to the ones that are
requesting it, the module subsystem should not blindly trust
request_module() calls that come from outdated modules.

I don't mind refactoring the code so it is better, but IMO we should
avoid exposing or playing the capability game unless there is a good
reason. If CAP_SYS_MODULE is not set, then my devices should not
autoload *old* modules without me, it is easy to set and to track.

I would also add, that the per-task module auto-load flag is using the
same *may_autoload_module()* checks for one reason: consistency. Some
capability usage in the kernel is not consistent, these patches make
sure to not fall into that trap. If you set the global sysctl for your
secure server, or the per-task for containers and sandboxes for
cluster nodes, desktops or whate

Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions

2017-05-22 Thread Djalal Harouni
On Mon, May 22, 2017 at 6:43 PM, Solar Designer <so...@openwall.com> wrote:
> On Mon, May 22, 2017 at 03:49:15PM +0200, Djalal Harouni wrote:
>> On Mon, May 22, 2017 at 2:08 PM, Solar Designer <so...@openwall.com> wrote:
>> > On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote:
>> >> *) When modules_autoload_mode is set to (2), automatic module loading is
>> >> disabled for all. Once set, this value can not be changed.
>> >
>> > What purpose does this securelevel-like property ("Once set, this value
>> > can not be changed.") serve here?  I think this mode 2 is needed, but
>> > without this extra property, which is bypassable by e.g. explicitly
>> > loaded kernel modules anyway (and that's OK).
>>
>> My reasoning about "Once set, this value can not be changed" is mainly for:
>>
>> If you have some systems where modules are not updated for any given
>> reason, then the only one who will be able to load a module is an
>> administrator, basically this is a shortcut for:
>>
>> * Apps/services can run with CAP_NET_ADMIN but they are not allowed to
>> auto-load 'netdev' modules.
>>
>> * Explicitly loading modules can be guarded by seccomp filters *per*
>> app, so even if these apps have
>>   CAP_SYS_MODULE they won't be able to explicitly load modules, one
>> has to remount some sysctl /proc/ entries read-only here and remove
>> CAP_SYS_ADMIN for all apps anyway.
>>
>> This mainly serves the purpose of these systems that do not receive
>> updates, if I don't want to expose those kernel interfaces what should
>> I do ? then if I want to unload old versions and replace them with new
>> ones what operation should be allowed ? and only real root of the
>> system can do it. Hence, the "Once set, this value can not be changed"
>> is more of a shortcut, also the idea was put in my mind based on how
>> "modules_disabled" is disabled forever, and some other interfaces. I
>> would say: it is easy to handle a transition from 1) "hey this system
>> is still up to date, some features should be exposed" to 2) "this
>> system is not up to date anymore, only root should expose some
>> features..."
>>
>> Hmm, I am not sure if this answers your question ? :-)
>
> This answers my question, but in a way that I summarize as "there's no
> good reason to include this securelevel-like property".
>

Hmm, sorry I did forget to add in my previous comment that with such
systems, CAP_SYS_MODULE can be used to reset the
"modules_autoload_mode" sysctl back from mode 2 to mode 1, even if we
disable it privileged tasks can be triggered to overwrite the sysctl
flag and get it back unless /proc is read-only... that's one of the
points, it should not be so easy to relax it.



>> I definitively don't want to fall into "modules_disabled" trap where
>> is it too strict! "Once set, this value can not be changed" means for
>> some users do not set it otherwise the system is unusable...
>>
>> Maybe an extra "4" mode for that ? better get it right.
>
> I think you should simply exclude this property from mode 2.
>

Ok, maybe my comment above answers this ?

What I was referring to here, is to have one small window where it is
disable for privileged and that securelevel-like like property or
disable definitively are separated. I don't have a strong opinion
here, having a usable system is important.


> The module autoloading restrictions aren't meant to reduce root's
> powers; they're only meant to protect processes from shooting themselves
> and the system in the foot inadvertently (confused deputy).
>
> modules_disabled may be different in that respect, although with the
> rest of the kernel lacking securelevel-like support the point is moot.
>
> We had working securelevel in 2.0.34 through 2.0.40 inclusive, but
> we've lost it in 2.1+ with cap-bound apparently never becoming as
> complete a replacement for it and having been lost/broken further in
> 2.6.25+.  I regret this, but that's a different story.  Like I say,
> module autoloading doesn't even fit in with those restrictions - it's
> about a totally different threat model.
>

Ok, thanks for the information, so yes it seems we do not have such a
consistent way, but this did not block Yama LSM and other sysctl to
implement their own cases, maybe it did show that it is not that easy
to have a generic securelevel mechanism ? and what we currently have
is more practical ? I can't tell here. But we definitively want to
block privileged tasks to revert the sysctl mode if the administrator
do not want automatic module loading.

Thanks!

-- 
tixxdz


Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions

2017-05-22 Thread Djalal Harouni
On Mon, May 22, 2017 at 6:43 PM, Solar Designer  wrote:
> On Mon, May 22, 2017 at 03:49:15PM +0200, Djalal Harouni wrote:
>> On Mon, May 22, 2017 at 2:08 PM, Solar Designer  wrote:
>> > On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote:
>> >> *) When modules_autoload_mode is set to (2), automatic module loading is
>> >> disabled for all. Once set, this value can not be changed.
>> >
>> > What purpose does this securelevel-like property ("Once set, this value
>> > can not be changed.") serve here?  I think this mode 2 is needed, but
>> > without this extra property, which is bypassable by e.g. explicitly
>> > loaded kernel modules anyway (and that's OK).
>>
>> My reasoning about "Once set, this value can not be changed" is mainly for:
>>
>> If you have some systems where modules are not updated for any given
>> reason, then the only one who will be able to load a module is an
>> administrator, basically this is a shortcut for:
>>
>> * Apps/services can run with CAP_NET_ADMIN but they are not allowed to
>> auto-load 'netdev' modules.
>>
>> * Explicitly loading modules can be guarded by seccomp filters *per*
>> app, so even if these apps have
>>   CAP_SYS_MODULE they won't be able to explicitly load modules, one
>> has to remount some sysctl /proc/ entries read-only here and remove
>> CAP_SYS_ADMIN for all apps anyway.
>>
>> This mainly serves the purpose of these systems that do not receive
>> updates, if I don't want to expose those kernel interfaces what should
>> I do ? then if I want to unload old versions and replace them with new
>> ones what operation should be allowed ? and only real root of the
>> system can do it. Hence, the "Once set, this value can not be changed"
>> is more of a shortcut, also the idea was put in my mind based on how
>> "modules_disabled" is disabled forever, and some other interfaces. I
>> would say: it is easy to handle a transition from 1) "hey this system
>> is still up to date, some features should be exposed" to 2) "this
>> system is not up to date anymore, only root should expose some
>> features..."
>>
>> Hmm, I am not sure if this answers your question ? :-)
>
> This answers my question, but in a way that I summarize as "there's no
> good reason to include this securelevel-like property".
>

Hmm, sorry I did forget to add in my previous comment that with such
systems, CAP_SYS_MODULE can be used to reset the
"modules_autoload_mode" sysctl back from mode 2 to mode 1, even if we
disable it privileged tasks can be triggered to overwrite the sysctl
flag and get it back unless /proc is read-only... that's one of the
points, it should not be so easy to relax it.



>> I definitively don't want to fall into "modules_disabled" trap where
>> is it too strict! "Once set, this value can not be changed" means for
>> some users do not set it otherwise the system is unusable...
>>
>> Maybe an extra "4" mode for that ? better get it right.
>
> I think you should simply exclude this property from mode 2.
>

Ok, maybe my comment above answers this ?

What I was referring to here, is to have one small window where it is
disable for privileged and that securelevel-like like property or
disable definitively are separated. I don't have a strong opinion
here, having a usable system is important.


> The module autoloading restrictions aren't meant to reduce root's
> powers; they're only meant to protect processes from shooting themselves
> and the system in the foot inadvertently (confused deputy).
>
> modules_disabled may be different in that respect, although with the
> rest of the kernel lacking securelevel-like support the point is moot.
>
> We had working securelevel in 2.0.34 through 2.0.40 inclusive, but
> we've lost it in 2.1+ with cap-bound apparently never becoming as
> complete a replacement for it and having been lost/broken further in
> 2.6.25+.  I regret this, but that's a different story.  Like I say,
> module autoloading doesn't even fit in with those restrictions - it's
> about a totally different threat model.
>

Ok, thanks for the information, so yes it seems we do not have such a
consistent way, but this did not block Yama LSM and other sysctl to
implement their own cases, maybe it did show that it is not that easy
to have a generic securelevel mechanism ? and what we currently have
is more practical ? I can't tell here. But we definitively want to
block privileged tasks to revert the sysctl mode if the administrator
do not want automatic module loading.

Thanks!

-- 
tixxdz


Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions

2017-05-22 Thread Djalal Harouni
Hi Alexander,

On Mon, May 22, 2017 at 2:08 PM, Solar Designer <so...@openwall.com> wrote:
> Hi Djalal,
>
> Thank you for your work on this!
>
> On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote:
>> *) When modules_autoload_mode is set to (2), automatic module loading is
>> disabled for all. Once set, this value can not be changed.
>
> What purpose does this securelevel-like property ("Once set, this value
> can not be changed.") serve here?  I think this mode 2 is needed, but
> without this extra property, which is bypassable by e.g. explicitly
> loaded kernel modules anyway (and that's OK).

My reasoning about "Once set, this value can not be changed" is mainly for:

If you have some systems where modules are not updated for any given
reason, then the only one who will be able to load a module is an
administrator, basically this is a shortcut for:

* Apps/services can run with CAP_NET_ADMIN but they are not allowed to
auto-load 'netdev' modules.

* Explicitly loading modules can be guarded by seccomp filters *per*
app, so even if these apps have
  CAP_SYS_MODULE they won't be able to explicitly load modules, one
has to remount some sysctl /proc/ entries read-only here and remove
CAP_SYS_ADMIN for all apps anyway.

This mainly serves the purpose of these systems that do not receive
updates, if I don't want to expose those kernel interfaces what should
I do ? then if I want to unload old versions and replace them with new
ones what operation should be allowed ? and only real root of the
system can do it. Hence, the "Once set, this value can not be changed"
is more of a shortcut, also the idea was put in my mind based on how
"modules_disabled" is disabled forever, and some other interfaces. I
would say: it is easy to handle a transition from 1) "hey this system
is still up to date, some features should be exposed" to 2) "this
system is not up to date anymore, only root should expose some
features..."

Hmm, I am not sure if this answers your question ? :-)

I definitively don't want to fall into "modules_disabled" trap where
is it too strict! "Once set, this value can not be changed" means for
some users do not set it otherwise the system is unusable...

Maybe an extra "4" mode for that ? better get it right.

Thanks!

-- 
tixxdz


Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions

2017-05-22 Thread Djalal Harouni
Hi Alexander,

On Mon, May 22, 2017 at 2:08 PM, Solar Designer  wrote:
> Hi Djalal,
>
> Thank you for your work on this!
>
> On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote:
>> *) When modules_autoload_mode is set to (2), automatic module loading is
>> disabled for all. Once set, this value can not be changed.
>
> What purpose does this securelevel-like property ("Once set, this value
> can not be changed.") serve here?  I think this mode 2 is needed, but
> without this extra property, which is bypassable by e.g. explicitly
> loaded kernel modules anyway (and that's OK).

My reasoning about "Once set, this value can not be changed" is mainly for:

If you have some systems where modules are not updated for any given
reason, then the only one who will be able to load a module is an
administrator, basically this is a shortcut for:

* Apps/services can run with CAP_NET_ADMIN but they are not allowed to
auto-load 'netdev' modules.

* Explicitly loading modules can be guarded by seccomp filters *per*
app, so even if these apps have
  CAP_SYS_MODULE they won't be able to explicitly load modules, one
has to remount some sysctl /proc/ entries read-only here and remove
CAP_SYS_ADMIN for all apps anyway.

This mainly serves the purpose of these systems that do not receive
updates, if I don't want to expose those kernel interfaces what should
I do ? then if I want to unload old versions and replace them with new
ones what operation should be allowed ? and only real root of the
system can do it. Hence, the "Once set, this value can not be changed"
is more of a shortcut, also the idea was put in my mind based on how
"modules_disabled" is disabled forever, and some other interfaces. I
would say: it is easy to handle a transition from 1) "hey this system
is still up to date, some features should be exposed" to 2) "this
system is not up to date anymore, only root should expose some
features..."

Hmm, I am not sure if this answers your question ? :-)

I definitively don't want to fall into "modules_disabled" trap where
is it too strict! "Once set, this value can not be changed" means for
some users do not set it otherwise the system is unusable...

Maybe an extra "4" mode for that ? better get it right.

Thanks!

-- 
tixxdz


[PATCH v4 next 2/3] modules:capabilities: automatic module loading restriction

2017-05-22 Thread Djalal Harouni
"dccp: fix freeing skb too early for IPV6_RECVPKTINFO".

Before:
$ lsmod | grep dccp
$ strace ./dccp_trigger
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = 3
...
$ lsmod | grep dccp
dccp_ipv6  24576  5
dccp_ipv4  24576  5 dccp_ipv6
dccp  102400  2 dccp_ipv6,dccp_ipv4

After:
Only privileged:
# echo 1 > /proc/sys/kernel/modules_autoload_mode
$ lsmod | grep dccp
$ strace ./dccp_trigger
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = -1 ESOCKTNOSUPPORT (Socket type not 
supported)
...
$ lsmod | grep dccp
$ dmesg
...
[  175.945063] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1390] was denied
[  175.947952] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1390] was denied
[  175.956061] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1390] was denied
[  175.959733] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1390] was denied

$ sudo strace ./dccp_trigger
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = 3
...
$ lsmod | grep dccp
dccp_ipv6  24576  6
dccp_ipv4  24576  5 dccp_ipv6
dccp  102400  2 dccp_ipv6,dccp_ipv4

Disable automatic module loading:
$ lsmod | grep dccp
$ su - root
# echo 2 > /proc/sys/kernel/modules_autoload_mode
# strace ./dccp_trigger
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = -1 ESOCKTNOSUPPORT (Socket type not 
supported)
...
$ lsmod | grep dccp
$ dmesg
...
[  126.596545] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1291] was denied
[  126.598800] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1291] was denied
[  126.601264] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1291] was denied
[  126.602839] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1291] was denied

As an example, this blocks abuses, DCCP still can be explicilty loaded by
an administrator using modprobe, at same time automatic module loading is
disabled forever.

[1] http://www.openwall.com/lists/oss-security/2017/02/22/3
[2] http://www.openwall.com/lists/oss-security/2017/03/29/2
[3] https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-6074

Cc: Ben Hutchings <ben.hutchi...@codethink.co.uk>
Cc: Rusty Russell <ru...@rustcorp.com.au>
Cc: James Morris <james.l.mor...@oracle.com>
Cc: Serge Hallyn <se...@hallyn.com>
Cc: Andy Lutomirski <l...@kernel.org>
Suggested-by: Kees Cook <keesc...@chromium.org>
Signed-off-by: Djalal Harouni <tix...@gmail.com>
---
 Documentation/sysctl/kernel.txt | 51 +
 include/linux/module.h  | 19 ++-
 include/linux/security.h|  3 ++-
 kernel/module.c | 42 +
 kernel/sysctl.c | 40 
 security/commoncap.c| 24 +++
 6 files changed, 177 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bac23c1..3cc6592 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -43,6 +43,7 @@ show up in /proc/sys/kernel:
 - l2cr[ PPC only ]
 - modprobe==> Documentation/debugging-modules.txt
 - modules_disabled
+- modules_autoload_mode
 - msg_next_id[ sysv ipc ]
 - msgmax
 - msgmnb
@@ -411,6 +412,56 @@ to false.  Generally used with the "kexec_load_disabled" 
toggle.
 
 ==
 
+modules_autoload_mode:
+
+A sysctl to control if modules auto-load feature is allowed or not.
+This sysctl complements "modules_disabled" which is for all module
+operations where this flag applies only to automatic module loading.
+Automatic module loading happens when programs request a kernel
+feature that is implemented by an unloaded module, the kernel
+automatically runs the program pointed by "modprobe" sysctl in order
+to load the corresponding module.
+
+Historically, the kernel was always able to automatically load modules
+if they are not blacklisted. This is one of the most important and
+transparent operations of Linux, it allows to provide numerous other
+features as they are needed which is crucial for a better user experience.
+However, as Linux is popular now and used for different appliances some
+of these may need to control such operations. For such systems, recent
+needs showed that in some cases allowing to control automatic module
+loading is as important as the operation itself. Restricting unprivileged
+programs or attackers that abuse this feature to load unused modules or
+modules that contain bugs is a significant security measure.
+
+The three m

[PATCH v4 next 2/3] modules:capabilities: automatic module loading restriction

2017-05-22 Thread Djalal Harouni
"dccp: fix freeing skb too early for IPV6_RECVPKTINFO".

Before:
$ lsmod | grep dccp
$ strace ./dccp_trigger
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = 3
...
$ lsmod | grep dccp
dccp_ipv6  24576  5
dccp_ipv4  24576  5 dccp_ipv6
dccp  102400  2 dccp_ipv6,dccp_ipv4

After:
Only privileged:
# echo 1 > /proc/sys/kernel/modules_autoload_mode
$ lsmod | grep dccp
$ strace ./dccp_trigger
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = -1 ESOCKTNOSUPPORT (Socket type not 
supported)
...
$ lsmod | grep dccp
$ dmesg
...
[  175.945063] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1390] was denied
[  175.947952] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1390] was denied
[  175.956061] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1390] was denied
[  175.959733] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1390] was denied

$ sudo strace ./dccp_trigger
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = 3
...
$ lsmod | grep dccp
dccp_ipv6  24576  6
dccp_ipv4  24576  5 dccp_ipv6
dccp  102400  2 dccp_ipv6,dccp_ipv4

Disable automatic module loading:
$ lsmod | grep dccp
$ su - root
# echo 2 > /proc/sys/kernel/modules_autoload_mode
# strace ./dccp_trigger
...
socket(AF_INET6, SOCK_DCCP, IPPROTO_IP) = -1 ESOCKTNOSUPPORT (Socket type not 
supported)
...
$ lsmod | grep dccp
$ dmesg
...
[  126.596545] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1291] was denied
[  126.598800] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1291] was denied
[  126.601264] module: automatic module loading of net-pf-10-proto-0-type-6 by 
"dccp_trigger"[1291] was denied
[  126.602839] module: automatic module loading of net-pf-10-proto-0 by 
"dccp_trigger"[1291] was denied

As an example, this blocks abuses, DCCP still can be explicilty loaded by
an administrator using modprobe, at same time automatic module loading is
disabled forever.

[1] http://www.openwall.com/lists/oss-security/2017/02/22/3
[2] http://www.openwall.com/lists/oss-security/2017/03/29/2
[3] https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-6074

Cc: Ben Hutchings 
Cc: Rusty Russell 
Cc: James Morris 
Cc: Serge Hallyn 
Cc: Andy Lutomirski 
Suggested-by: Kees Cook 
Signed-off-by: Djalal Harouni 
---
 Documentation/sysctl/kernel.txt | 51 +
 include/linux/module.h  | 19 ++-
 include/linux/security.h|  3 ++-
 kernel/module.c | 42 +
 kernel/sysctl.c | 40 
 security/commoncap.c| 24 +++
 6 files changed, 177 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bac23c1..3cc6592 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -43,6 +43,7 @@ show up in /proc/sys/kernel:
 - l2cr[ PPC only ]
 - modprobe==> Documentation/debugging-modules.txt
 - modules_disabled
+- modules_autoload_mode
 - msg_next_id[ sysv ipc ]
 - msgmax
 - msgmnb
@@ -411,6 +412,56 @@ to false.  Generally used with the "kexec_load_disabled" 
toggle.
 
 ==
 
+modules_autoload_mode:
+
+A sysctl to control if modules auto-load feature is allowed or not.
+This sysctl complements "modules_disabled" which is for all module
+operations where this flag applies only to automatic module loading.
+Automatic module loading happens when programs request a kernel
+feature that is implemented by an unloaded module, the kernel
+automatically runs the program pointed by "modprobe" sysctl in order
+to load the corresponding module.
+
+Historically, the kernel was always able to automatically load modules
+if they are not blacklisted. This is one of the most important and
+transparent operations of Linux, it allows to provide numerous other
+features as they are needed which is crucial for a better user experience.
+However, as Linux is popular now and used for different appliances some
+of these may need to control such operations. For such systems, recent
+needs showed that in some cases allowing to control automatic module
+loading is as important as the operation itself. Restricting unprivileged
+programs or attackers that abuse this feature to load unused modules or
+modules that contain bugs is a significant security measure.
+
+The three modes that "modules_autoload_mode" support allow to provide
+restrictions on automatic module loading without breaking user
+experience.
+
+When modules_autoload_mode is set to (0), the d

[PATCH v4 next 3/3] modules:capabilities: add a per-task modules auto-load mode

2017-05-22 Thread Djalal Harouni
proto-0 by 
"dccp_trigger"[1873] was denied

As showed, this blocks automatic module loading per-task. This allows to
provide a usable system, where only some sandboxed apps or containers will be
restricted to trigger automatic module loading, other parts of the
system can continue to use the system as it is which is the case of the
desktop.

[1] http://www.openwall.com/lists/oss-security/2017/02/22/3
[2] http://www.openwall.com/lists/oss-security/2017/03/29/2
[3] https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-6074

Cc: Ben Hutchings <ben.hutchi...@codethink.co.uk>
Cc: Rusty Russell <ru...@rustcorp.com.au>
Cc: James Morris <james.l.mor...@oracle.com>
Cc: Serge Hallyn <se...@hallyn.com>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: Kees Cook <keesc...@chromium.org>
Signed-off-by: Djalal Harouni <tix...@gmail.com>
---
 Documentation/filesystems/proc.txt |   3 +
 Documentation/userspace-api/index.rst  |   1 +
 .../userspace-api/modules_autoload_mode.rst| 115 +
 fs/proc/array.c|   6 ++
 include/linux/init_task.h  |   8 ++
 include/linux/module.h |  26 -
 include/linux/sched.h  |   5 +
 include/uapi/linux/prctl.h |   8 ++
 kernel/module.c|  61 ++-
 security/commoncap.c   |  38 ++-
 10 files changed, 263 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/userspace-api/modules_autoload_mode.rst

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index adba21b..58127f0 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -194,6 +194,7 @@ read the file /proc/PID/status:
   CapBnd: 
   NoNewPrivs: 0
   Seccomp:0
+  ModulesAutoloadMode:0
   voluntary_ctxt_switches:0
   nonvoluntary_ctxt_switches: 1
 
@@ -267,6 +268,8 @@ Table 1-2: Contents of the status files (as of 4.8)
  CapBnd  bitmap of capabilities bounding set
  NoNewPrivs  no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...)
  Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...)
+ ModulesAutoloadMode modules auto-load mode, like
+ prctl(PR_GET_MODULES_AUTOLOAD_MODE, ...)
  Cpus_allowedmask of CPUs on which this process may run
  Cpus_allowed_list   Same as previous, but in "list format"
  Mems_allowedmask of memory nodes allowed to this process
diff --git a/Documentation/userspace-api/index.rst 
b/Documentation/userspace-api/index.rst
index 7b2eb1b..bfd51b7 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -17,6 +17,7 @@ place where this information is gathered.
:maxdepth: 2
 
no_new_privs
+   modules_autoload_mode
seccomp_filter
unshare
 
diff --git a/Documentation/userspace-api/modules_autoload_mode.rst 
b/Documentation/userspace-api/modules_autoload_mode.rst
new file mode 100644
index 000..7355b00
--- /dev/null
+++ b/Documentation/userspace-api/modules_autoload_mode.rst
@@ -0,0 +1,115 @@
+==
+Per-task module auto-load restrictions
+==
+
+
+Introduction
+
+
+Usually a request to a kernel feature that is implemented by a module
+that is not loaded may trigger automatic module loading feature, allowing
+to transparently satisfy userspace, and provide numerous other features
+as they are needed. In this case an implicit kernel module load
+operation happens.
+
+In most cases to load or unload a kernel module, an explicit operation
+happens where programs are required to have ``CAP_SYS_MODULE`` capability
+to perform so. However, with implicit module loading, no capabilities are
+required, or only ``CAP_NET_ADMIN`` in rare cases where the module has the
+'netdev-%s' alias. Historically this was always the case as automatic
+module loading is one of the most important and transparent operations
+of Linux, users expect that their programs just work, yet, recent cases
+showed that this can be abused by unprivileged users or attackers to load
+modules that were not updated, or modules that contain bugs and
+vulnerabilities.
+
+Currently most of Linux code is in a form of modules, hence, allowing to
+control automatic module loading in some cases is as important as the
+operation itself, especially in the context where Linux is used in
+different appliances.
+
+Restricting automatic module loading allows administratros to have the
+appropriate time to update or deny module autoloading in advance. In a
+container or sandbox world where apps can be moved from one context to
+another, the ability to restrict some contain

[PATCH v4 next 3/3] modules:capabilities: add a per-task modules auto-load mode

2017-05-22 Thread Djalal Harouni
proto-0 by 
"dccp_trigger"[1873] was denied

As showed, this blocks automatic module loading per-task. This allows to
provide a usable system, where only some sandboxed apps or containers will be
restricted to trigger automatic module loading, other parts of the
system can continue to use the system as it is which is the case of the
desktop.

[1] http://www.openwall.com/lists/oss-security/2017/02/22/3
[2] http://www.openwall.com/lists/oss-security/2017/03/29/2
[3] https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-6074

Cc: Ben Hutchings 
Cc: Rusty Russell 
Cc: James Morris 
Cc: Serge Hallyn 
Cc: Andy Lutomirski 
Cc: Kees Cook 
Signed-off-by: Djalal Harouni 
---
 Documentation/filesystems/proc.txt |   3 +
 Documentation/userspace-api/index.rst  |   1 +
 .../userspace-api/modules_autoload_mode.rst| 115 +
 fs/proc/array.c|   6 ++
 include/linux/init_task.h  |   8 ++
 include/linux/module.h |  26 -
 include/linux/sched.h  |   5 +
 include/uapi/linux/prctl.h |   8 ++
 kernel/module.c|  61 ++-
 security/commoncap.c   |  38 ++-
 10 files changed, 263 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/userspace-api/modules_autoload_mode.rst

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index adba21b..58127f0 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -194,6 +194,7 @@ read the file /proc/PID/status:
   CapBnd: 
   NoNewPrivs: 0
   Seccomp:0
+  ModulesAutoloadMode:0
   voluntary_ctxt_switches:0
   nonvoluntary_ctxt_switches: 1
 
@@ -267,6 +268,8 @@ Table 1-2: Contents of the status files (as of 4.8)
  CapBnd  bitmap of capabilities bounding set
  NoNewPrivs  no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...)
  Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...)
+ ModulesAutoloadMode modules auto-load mode, like
+ prctl(PR_GET_MODULES_AUTOLOAD_MODE, ...)
  Cpus_allowedmask of CPUs on which this process may run
  Cpus_allowed_list   Same as previous, but in "list format"
  Mems_allowedmask of memory nodes allowed to this process
diff --git a/Documentation/userspace-api/index.rst 
b/Documentation/userspace-api/index.rst
index 7b2eb1b..bfd51b7 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -17,6 +17,7 @@ place where this information is gathered.
:maxdepth: 2
 
no_new_privs
+   modules_autoload_mode
seccomp_filter
unshare
 
diff --git a/Documentation/userspace-api/modules_autoload_mode.rst 
b/Documentation/userspace-api/modules_autoload_mode.rst
new file mode 100644
index 000..7355b00
--- /dev/null
+++ b/Documentation/userspace-api/modules_autoload_mode.rst
@@ -0,0 +1,115 @@
+==
+Per-task module auto-load restrictions
+==
+
+
+Introduction
+
+
+Usually a request to a kernel feature that is implemented by a module
+that is not loaded may trigger automatic module loading feature, allowing
+to transparently satisfy userspace, and provide numerous other features
+as they are needed. In this case an implicit kernel module load
+operation happens.
+
+In most cases to load or unload a kernel module, an explicit operation
+happens where programs are required to have ``CAP_SYS_MODULE`` capability
+to perform so. However, with implicit module loading, no capabilities are
+required, or only ``CAP_NET_ADMIN`` in rare cases where the module has the
+'netdev-%s' alias. Historically this was always the case as automatic
+module loading is one of the most important and transparent operations
+of Linux, users expect that their programs just work, yet, recent cases
+showed that this can be abused by unprivileged users or attackers to load
+modules that were not updated, or modules that contain bugs and
+vulnerabilities.
+
+Currently most of Linux code is in a form of modules, hence, allowing to
+control automatic module loading in some cases is as important as the
+operation itself, especially in the context where Linux is used in
+different appliances.
+
+Restricting automatic module loading allows administratros to have the
+appropriate time to update or deny module autoloading in advance. In a
+container or sandbox world where apps can be moved from one context to
+another, the ability to restrict some containers or apps to load extra
+kernel modules will prevent exposing some kernel interfaces that may not
+receive the same care as some other parts of the core. The DCCP vulnerability
+CVE-2017-6074 that can

[PATCH v4 next 1/3] modules:capabilities: allow __request_module() to take a capability argument

2017-05-22 Thread Djalal Harouni
This is a preparation patch for the module auto-load restriction feature.

In order to restrict module auto-load operations we need to check if the
caller has CAP_SYS_MODULE capability. This allows to align security
checks of automatic module loading with the checks of the explicit operations.

However for "netdev-%s" modules, they are allowed to be loaded if
CAP_NET_ADMIN is set. Therefore, in order to not break this assumption,
and allow userspace to only load "netdev-%s" modules with CAP_NET_ADMIN
capability which is considered a privileged operation, we have two
choices: 1) parse "netdev-%s" alias and check the capability or 2) hand
the capability form request_module() to security_kernel_module_request()
hook and let the capability subsystem decide.

After a discussion with Rusty Russell [1], the suggestion was to pass
the capability from request_module() to security_kernel_module_request()
for 'netdev-%s' modules that need CAP_NET_ADMIN.

The patch does not update request_module(), it updates the internal
__request_module() that will take an extra "allow_cap" argument. If
positive, then automatic module load operation can be allowed.

__request_module() will be only called by networking code which is the
exception to this, so we do not break userspace and CAP_NET_ADMIN can
continue to load 'netdev-%s' modules. Other kernel code should continue
to use request_module() which calls security_kernel_module_request() and
will check for CAP_SYS_MODULE capability in next patch. Allowing more
control on who can trigger automatic module loading.

This patch updates security_kernel_module_request() to take the
'allow_cap' argument and SELinux which is currently the only user of
security_kernel_module_request() hook.

Based on patch by Rusty Russell:
https://lkml.org/lkml/2017/4/26/735

Cc: Serge Hallyn <se...@hallyn.com>
Cc: Andy Lutomirski <l...@kernel.org>
Suggested-by: Rusty Russell <ru...@rustcorp.com.au>
Suggested-by: Kees Cook <keesc...@chromium.org>
Signed-off-by: Djalal Harouni <tix...@gmail.com>

[1] https://lkml.org/lkml/2017/4/24/7
---
 include/linux/kmod.h  | 15 ---
 include/linux/lsm_hooks.h |  4 +++-
 include/linux/security.h  |  4 ++--
 kernel/kmod.c | 15 +--
 net/core/dev_ioctl.c  | 10 +-
 security/security.c   |  4 ++--
 security/selinux/hooks.c  |  2 +-
 7 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/include/linux/kmod.h b/include/linux/kmod.h
index c4e441e..a314432 100644
--- a/include/linux/kmod.h
+++ b/include/linux/kmod.h
@@ -32,18 +32,19 @@
 extern char modprobe_path[]; /* for sysctl */
 /* modprobe exit status on success, -ve on error.  Return value
  * usually useless though. */
-extern __printf(2, 3)
-int __request_module(bool wait, const char *name, ...);
-#define request_module(mod...) __request_module(true, mod)
-#define request_module_nowait(mod...) __request_module(false, mod)
+extern __printf(3, 4)
+int __request_module(bool wait, int allow_cap, const char *name, ...);
 #define try_then_request_module(x, mod...) \
-   ((x) ?: (__request_module(true, mod), (x)))
+   ((x) ?: (__request_module(true, -1, mod), (x)))
 #else
-static inline int request_module(const char *name, ...) { return -ENOSYS; }
-static inline int request_module_nowait(const char *name, ...) { return 
-ENOSYS; }
+static inline __printf(3, 4)
+int __request_module(bool wait, int allow_cap, const char *name, ...)
+{ return -ENOSYS; }
 #define try_then_request_module(x, mod...) (x)
 #endif
 
+#define request_module(mod...) __request_module(true, -1, mod)
+#define request_module_nowait(mod...) __request_module(false, -1, mod)
 
 struct cred;
 struct file;
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index f7914d9..7688f79 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -578,6 +578,8 @@
  * Ability to trigger the kernel to automatically upcall to userspace for
  * userspace to load a kernel module with the given name.
  * @kmod_name name of the module requested by the kernel
+ * @allow_cap capability that allows to automatically load a kernel
+ * module.
  * Return 0 if successful.
  * @kernel_read_file:
  * Read a file specified by userspace.
@@ -1516,7 +1518,7 @@ union security_list_options {
void (*cred_transfer)(struct cred *new, const struct cred *old);
int (*kernel_act_as)(struct cred *new, u32 secid);
int (*kernel_create_files_as)(struct cred *new, struct inode *inode);
-   int (*kernel_module_request)(char *kmod_name);
+   int (*kernel_module_request)(char *kmod_name, int allow_cap);
int (*kernel_read_file)(struct file *file, enum kernel_read_file_id id);
int (*kernel_post_read_file)(struct file *file, char *buf, loff_t size,
 enum kernel_read_file_id id);
diff --git a/include/linux/security.h b/incl

[PATCH v4 next 1/3] modules:capabilities: allow __request_module() to take a capability argument

2017-05-22 Thread Djalal Harouni
This is a preparation patch for the module auto-load restriction feature.

In order to restrict module auto-load operations we need to check if the
caller has CAP_SYS_MODULE capability. This allows to align security
checks of automatic module loading with the checks of the explicit operations.

However for "netdev-%s" modules, they are allowed to be loaded if
CAP_NET_ADMIN is set. Therefore, in order to not break this assumption,
and allow userspace to only load "netdev-%s" modules with CAP_NET_ADMIN
capability which is considered a privileged operation, we have two
choices: 1) parse "netdev-%s" alias and check the capability or 2) hand
the capability form request_module() to security_kernel_module_request()
hook and let the capability subsystem decide.

After a discussion with Rusty Russell [1], the suggestion was to pass
the capability from request_module() to security_kernel_module_request()
for 'netdev-%s' modules that need CAP_NET_ADMIN.

The patch does not update request_module(), it updates the internal
__request_module() that will take an extra "allow_cap" argument. If
positive, then automatic module load operation can be allowed.

__request_module() will be only called by networking code which is the
exception to this, so we do not break userspace and CAP_NET_ADMIN can
continue to load 'netdev-%s' modules. Other kernel code should continue
to use request_module() which calls security_kernel_module_request() and
will check for CAP_SYS_MODULE capability in next patch. Allowing more
control on who can trigger automatic module loading.

This patch updates security_kernel_module_request() to take the
'allow_cap' argument and SELinux which is currently the only user of
security_kernel_module_request() hook.

Based on patch by Rusty Russell:
https://lkml.org/lkml/2017/4/26/735

Cc: Serge Hallyn 
Cc: Andy Lutomirski 
Suggested-by: Rusty Russell 
Suggested-by: Kees Cook 
Signed-off-by: Djalal Harouni 

[1] https://lkml.org/lkml/2017/4/24/7
---
 include/linux/kmod.h  | 15 ---
 include/linux/lsm_hooks.h |  4 +++-
 include/linux/security.h  |  4 ++--
 kernel/kmod.c | 15 +--
 net/core/dev_ioctl.c  | 10 +-
 security/security.c   |  4 ++--
 security/selinux/hooks.c  |  2 +-
 7 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/include/linux/kmod.h b/include/linux/kmod.h
index c4e441e..a314432 100644
--- a/include/linux/kmod.h
+++ b/include/linux/kmod.h
@@ -32,18 +32,19 @@
 extern char modprobe_path[]; /* for sysctl */
 /* modprobe exit status on success, -ve on error.  Return value
  * usually useless though. */
-extern __printf(2, 3)
-int __request_module(bool wait, const char *name, ...);
-#define request_module(mod...) __request_module(true, mod)
-#define request_module_nowait(mod...) __request_module(false, mod)
+extern __printf(3, 4)
+int __request_module(bool wait, int allow_cap, const char *name, ...);
 #define try_then_request_module(x, mod...) \
-   ((x) ?: (__request_module(true, mod), (x)))
+   ((x) ?: (__request_module(true, -1, mod), (x)))
 #else
-static inline int request_module(const char *name, ...) { return -ENOSYS; }
-static inline int request_module_nowait(const char *name, ...) { return 
-ENOSYS; }
+static inline __printf(3, 4)
+int __request_module(bool wait, int allow_cap, const char *name, ...)
+{ return -ENOSYS; }
 #define try_then_request_module(x, mod...) (x)
 #endif
 
+#define request_module(mod...) __request_module(true, -1, mod)
+#define request_module_nowait(mod...) __request_module(false, -1, mod)
 
 struct cred;
 struct file;
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index f7914d9..7688f79 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -578,6 +578,8 @@
  * Ability to trigger the kernel to automatically upcall to userspace for
  * userspace to load a kernel module with the given name.
  * @kmod_name name of the module requested by the kernel
+ * @allow_cap capability that allows to automatically load a kernel
+ * module.
  * Return 0 if successful.
  * @kernel_read_file:
  * Read a file specified by userspace.
@@ -1516,7 +1518,7 @@ union security_list_options {
void (*cred_transfer)(struct cred *new, const struct cred *old);
int (*kernel_act_as)(struct cred *new, u32 secid);
int (*kernel_create_files_as)(struct cred *new, struct inode *inode);
-   int (*kernel_module_request)(char *kmod_name);
+   int (*kernel_module_request)(char *kmod_name, int allow_cap);
int (*kernel_read_file)(struct file *file, enum kernel_read_file_id id);
int (*kernel_post_read_file)(struct file *file, char *buf, loff_t size,
 enum kernel_read_file_id id);
diff --git a/include/linux/security.h b/include/linux/security.h
index 549cb82..2f4c9d3 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -325,7 +325,7 @

[PATCH v4 next 0/3] modules: automatic module loading restrictions

2017-05-22 Thread Djalal Harouni
sktop.


Finally we already have a use case for the prctl() interface to enforce
some systemd services [7], and we plan to use it for our containers and
sandboxes. That pull request will be updated if this feature is merged,
We will provide "ProtectKernelModulesMode=strict" as a new directive for
users that can be enforced to make sure that if their services/apps are
compromised they won't be able to abuse the module auto-load operation.


# Changes since v3:
*) Renamed the sysctl from "modules_autoload" to "modules_autoload_mode"
   and the prctl() operation flag to "PR_{SET|GET}_MODULES_AUTOLOAD_MODE"
   as it was requested.

   Suggested-by: Ben Hutchings <ben.hutchi...@codethink.co.uk>


*) Updated __request_module() to take the capability that may allow to
   auto-load a module with the appropriate alias. This way we never
   parse aliases as it was requested by Rusty Russell. Security and
   SELinux hooks were updated too.

   Suggested-by: Rusty Russell <ru...@rustcorp.com.au>
   https://lkml.org/lkml/2017/4/24/7


*) Updated code to set prctl(PR_SET_MODULES_AUTOLOAD_MODE, 1, 0, 0, 0),
   the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) before or run with
   CAP_SYS_ADMIN privileges in its namespace. If these are not true,
   -EACCES will be returned.

   Suggested-by: Andy Lutomirski <l...@amacapital.net>
   https://lkml.org/lkml/2017/4/22/22


*) Remove task initialization logic and other cleanups
   Suggested-by: Kees Cook <keesc...@chromium.org>


*) Other code and documentation cleanups.
   

# Changes since v2:
*) Implemented as a core kernel feature inside capabilities subsystem
*) Renamed sysctl to "modules_autoload" to align with "modules_disabled"

   Suggested-by: Kees Cook <keesc...@chromium.org>

*) Improved documentation.
*) Removed unused code.


# Changes since v1:
*) Renamed module to ModAutoRestrict
*) Improved documentation to explicity refer to module autoloading.
*) Switched to use the new task_security_alloc() hook.
*) Switched from rhash tables to use task->security since it is in
   linux-security/next branch now.
*) Check all parameters passed to prctl() syscall.
*) Many other bug fixes and documentation improvements.


Patches (3) Djalal Harouni:
 (1/3) modules:capabilities: allow __request_module() to take a capability 
argument
 (2/3) modules:capabilities: automatic module loading restriction
 (3/3) modules:capabilities: add a per-task modules auto-load mode

 Documentation/filesystems/proc.txt |   3 +
 Documentation/sysctl/kernel.txt|  51 +
 Documentation/userspace-api/index.rst  |   1 +
 .../userspace-api/modules_autoload_mode.rst| 115 +
 fs/proc/array.c|   6 ++
 include/linux/init_task.h  |   8 ++
 include/linux/kmod.h   |  15 +--
 include/linux/lsm_hooks.h  |   4 +-
 include/linux/module.h |  41 +++-
 include/linux/sched.h  |   5 +
 include/linux/security.h   |   7 +-
 include/uapi/linux/prctl.h |   8 ++
 kernel/kmod.c  |  15 ++-
 kernel/module.c|  93 +
 kernel/sysctl.c|  40 +++
 net/core/dev_ioctl.c   |  10 +-
 security/commoncap.c   |  60 +++
 security/security.c|   4 +-
 security/selinux/hooks.c   |   2 +-
 19 files changed, 470 insertions(+), 18 deletions(-)


References:
[1] http://www.openwall.com/lists/kernel-hardening/2017/02/02/21
[2] http://www.openwall.com/lists/kernel-hardening/2017/04/09/1
[3] https://lkml.org/lkml/2017/4/19/1086
[4] http://www.openwall.com/lists/oss-security/2017/02/22/3
[5] http://www.openwall.com/lists/oss-security/2017/03/29/2
[6] https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-6074
[7] https://github.com/systemd/systemd/pull/5736


[PATCH v4 next 0/3] modules: automatic module loading restrictions

2017-05-22 Thread Djalal Harouni
sktop.


Finally we already have a use case for the prctl() interface to enforce
some systemd services [7], and we plan to use it for our containers and
sandboxes. That pull request will be updated if this feature is merged,
We will provide "ProtectKernelModulesMode=strict" as a new directive for
users that can be enforced to make sure that if their services/apps are
compromised they won't be able to abuse the module auto-load operation.


# Changes since v3:
*) Renamed the sysctl from "modules_autoload" to "modules_autoload_mode"
   and the prctl() operation flag to "PR_{SET|GET}_MODULES_AUTOLOAD_MODE"
   as it was requested.

   Suggested-by: Ben Hutchings 


*) Updated __request_module() to take the capability that may allow to
   auto-load a module with the appropriate alias. This way we never
   parse aliases as it was requested by Rusty Russell. Security and
   SELinux hooks were updated too.

   Suggested-by: Rusty Russell 
   https://lkml.org/lkml/2017/4/24/7


*) Updated code to set prctl(PR_SET_MODULES_AUTOLOAD_MODE, 1, 0, 0, 0),
   the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) before or run with
   CAP_SYS_ADMIN privileges in its namespace. If these are not true,
   -EACCES will be returned.

   Suggested-by: Andy Lutomirski 
   https://lkml.org/lkml/2017/4/22/22


*) Remove task initialization logic and other cleanups
   Suggested-by: Kees Cook 


*) Other code and documentation cleanups.
   

# Changes since v2:
*) Implemented as a core kernel feature inside capabilities subsystem
*) Renamed sysctl to "modules_autoload" to align with "modules_disabled"

   Suggested-by: Kees Cook 

*) Improved documentation.
*) Removed unused code.


# Changes since v1:
*) Renamed module to ModAutoRestrict
*) Improved documentation to explicity refer to module autoloading.
*) Switched to use the new task_security_alloc() hook.
*) Switched from rhash tables to use task->security since it is in
   linux-security/next branch now.
*) Check all parameters passed to prctl() syscall.
*) Many other bug fixes and documentation improvements.


Patches (3) Djalal Harouni:
 (1/3) modules:capabilities: allow __request_module() to take a capability 
argument
 (2/3) modules:capabilities: automatic module loading restriction
 (3/3) modules:capabilities: add a per-task modules auto-load mode

 Documentation/filesystems/proc.txt |   3 +
 Documentation/sysctl/kernel.txt|  51 +
 Documentation/userspace-api/index.rst  |   1 +
 .../userspace-api/modules_autoload_mode.rst| 115 +
 fs/proc/array.c|   6 ++
 include/linux/init_task.h  |   8 ++
 include/linux/kmod.h   |  15 +--
 include/linux/lsm_hooks.h  |   4 +-
 include/linux/module.h |  41 +++-
 include/linux/sched.h  |   5 +
 include/linux/security.h   |   7 +-
 include/uapi/linux/prctl.h |   8 ++
 kernel/kmod.c  |  15 ++-
 kernel/module.c|  93 +
 kernel/sysctl.c|  40 +++
 net/core/dev_ioctl.c   |  10 +-
 security/commoncap.c   |  60 +++
 security/security.c|   4 +-
 security/selinux/hooks.c   |   2 +-
 19 files changed, 470 insertions(+), 18 deletions(-)


References:
[1] http://www.openwall.com/lists/kernel-hardening/2017/02/02/21
[2] http://www.openwall.com/lists/kernel-hardening/2017/04/09/1
[3] https://lkml.org/lkml/2017/4/19/1086
[4] http://www.openwall.com/lists/oss-security/2017/02/22/3
[5] http://www.openwall.com/lists/oss-security/2017/03/29/2
[6] https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-6074
[7] https://github.com/systemd/systemd/pull/5736


Re: [RFC][PATCH 0/9] VFS: Introduce mount context

2017-05-08 Thread Djalal Harouni
On Wed, May 3, 2017 at 6:04 PM, David Howells  wrote:
>
> Here are a set of patches to create a mount context prior to setting up a
> new mount, populating it with the parsed options/binary data and then
> effecting the mount.
>
> This allows namespaces and other information to be conveyed through the
> mount procedure.  It also allows extra error information to be returned
> (so many things can go wrong during a mount that a small integer isn't
> really sufficient to convey the issue).
>
> This also allows Miklós Szeredi's idea of doing:
>
> fd = fsopen("nfs");
> write(fd, "option=val", ...);
> fsmount(fd, "/mnt");


This may help to clear the boundary between what you can do with a
vfsmount (bind) and the filesystem. In containers, orchestration
tools, etc bind mounts are treated in a dynamic way, there is
assumption on github where developers and users expect that they can
dynamically add/move mounts between namespaces, however this won't
work with userns, so maybe this will help... My other suggestions:
Clear documentation and code comments will really help! I posted and
used some UID shifting within VFS layer patches a year ago, and it
seems that they really need something like this... !

I'm not sure where I did read about netlink, but at least it should
count userspace capabilities and namespace privacy/context...

> that he presented at LSF-2017 to be implemented (see the relevant patches
> in the series), to which I can add:
>
> read(fd, error_buffer, ...);
>
> to read back any error message.  I didn't use netlink as that would make it
> depend on CONFIG_NET and would introduce network namespacing issues.
>
> I've implemented mount context handling for procfs and nfs.
>
> Further developments:
>
>  (*) Implement mount context support in more filesystems, ext4 being next
>  on my list.
>
>  (*) Move the walk-from-root stuff that nfs has to generic code so that you
>  can do something akin to:
>
> mount /dev/sda1:/foo/bar /mnt
>
>  See nfs_follow_remote_path() and mount_subtree().  This is slightly
>  tricky in NFS as we have to prevent referral loops.
>
>  (*) Move the pid_ns pointer from struct mount_context to struct
>  proc_mount_context as I'm not sure it's necessary for anything other
>  than procfs.

FWIW the RFC "proc: support private proc instances per pidnamespace"
[1] that I have to clean will hide pid_ns under procfs filesystem, so
maybe that's a good reason to move it then get rid of it.

Thanks!


[1] https://lkml.org/lkml/2017/4/25/282

-- 
tixxdz


Re: [RFC][PATCH 0/9] VFS: Introduce mount context

2017-05-08 Thread Djalal Harouni
On Wed, May 3, 2017 at 6:04 PM, David Howells  wrote:
>
> Here are a set of patches to create a mount context prior to setting up a
> new mount, populating it with the parsed options/binary data and then
> effecting the mount.
>
> This allows namespaces and other information to be conveyed through the
> mount procedure.  It also allows extra error information to be returned
> (so many things can go wrong during a mount that a small integer isn't
> really sufficient to convey the issue).
>
> This also allows Miklós Szeredi's idea of doing:
>
> fd = fsopen("nfs");
> write(fd, "option=val", ...);
> fsmount(fd, "/mnt");


This may help to clear the boundary between what you can do with a
vfsmount (bind) and the filesystem. In containers, orchestration
tools, etc bind mounts are treated in a dynamic way, there is
assumption on github where developers and users expect that they can
dynamically add/move mounts between namespaces, however this won't
work with userns, so maybe this will help... My other suggestions:
Clear documentation and code comments will really help! I posted and
used some UID shifting within VFS layer patches a year ago, and it
seems that they really need something like this... !

I'm not sure where I did read about netlink, but at least it should
count userspace capabilities and namespace privacy/context...

> that he presented at LSF-2017 to be implemented (see the relevant patches
> in the series), to which I can add:
>
> read(fd, error_buffer, ...);
>
> to read back any error message.  I didn't use netlink as that would make it
> depend on CONFIG_NET and would introduce network namespacing issues.
>
> I've implemented mount context handling for procfs and nfs.
>
> Further developments:
>
>  (*) Implement mount context support in more filesystems, ext4 being next
>  on my list.
>
>  (*) Move the walk-from-root stuff that nfs has to generic code so that you
>  can do something akin to:
>
> mount /dev/sda1:/foo/bar /mnt
>
>  See nfs_follow_remote_path() and mount_subtree().  This is slightly
>  tricky in NFS as we have to prevent referral loops.
>
>  (*) Move the pid_ns pointer from struct mount_context to struct
>  proc_mount_context as I'm not sure it's necessary for anything other
>  than procfs.

FWIW the RFC "proc: support private proc instances per pidnamespace"
[1] that I have to clean will hide pid_ns under procfs filesystem, so
maybe that's a good reason to move it then get rid of it.

Thanks!


[1] https://lkml.org/lkml/2017/4/25/282

-- 
tixxdz


  1   2   3   4   5   6   >