[PATCH RFC v3 0/7] proc: modernize proc to support multiple private instances

2017-11-09 Thread Djalal Harouni
Hi list,


Preface:

This is RFC v3 to modernize procfs and make it able to support multiple
private instances per the same pid namespace.

I have been working on this with Alexey Gladkov and Andy Lutomirski.

RFC v1 is here:
https://lkml.org/lkml/2017/3/30/670

RFC v2 is here:
https://lkml.org/lkml/2017/4/25/282

This RFC v3 can be applied on top of next-20171109

This RFC was tested on Ubuntu/Debian and Alexey tested it on altlinux.
It does not work on Fedora due to a bug during boot with dracut, I did
not have time to investigate it more. I will make sure to fix it next
iteration. We decided to send it to get more feedback on the direction,
we will continue to improve it.

RFC v3 handles all previous comments from Andy Lutomirski, thank you for
all the feedback.


Procfs modernization:
-
Historically procfs was always tied to pid namespaces, during pid
namespace creation we internally create a procfs mount for it. However,
this has the effect that all new procfs mounts are just a mirror of the
internal one, any change, any mount option update, any new future
introduction will propagate to all other procfs mounts that are in the
same pid namespace.

This may have solved several use cases in that time. However today we
face new requirements, and making procfs able to support new private
instances inside same pid namespace seems a major point. If we want to
to introduce new features and security mechanisms we have to make sure
first that we do not break existing usecases. Supporting private procfs
instances wil allow to support new features and behaviour without
propagating it to all other procfs mounts.


Today procfs is more of a burden especially to some Embedded, IoT,
sandbox, container use cases. In user space we are over-mounting null
or inaccessible files on top to hide files and information. If we want
to hide pids we have to create PID namespaces otherwise mount options
propagate to all other proc mounts, changing a mount option value in one
mount will propagate to all other proc mounts. If we want to introduce
new features, then they will propagate to all other mounts too, resulting
either maybe new useful functionality or maybe breaking stuff. We have
also to note that userspace should not workaround procfs, the kernel
should just provide a sane simple interface.


In this regard several developers and maintainers pointed out that
there are problems with procfs and it has to be modernized:

"Here's another one: split up and modernize /proc." by Andy Lutomirski [1]

Discussion about kernel pointer leaks:
"And yes, as Kees and Daniel mentioned, it's definitely not just dmesg.
In fact, the primary things tend to be /proc and /sys, not dmesg
itself." By Linus Torvalds [2]

Lot of other areas in the kernel and filesystems have been updated to be
able to support private instances, devpts is one major example [3]. The aim
here is to modernize procfs without breaking userspace, or without affecting
the shared procfs mount. Later new features will apply on the private
instances, and after more testing, months, maybe it can be made the default
especially for IoT.

We want the possibility to do:

  mount -t proc -onewinstance,newfeature none /proc

newfeature: we are planning new features later for procfs, for now in
this RFC we only introduce "pids=all|ptraceable" mount option.

This allows to absorbe changes, make improvments without breaking use
cases.


Which will be used for:

1) Embedded systems and IoT: usually we have one supervisor for
apps, we have some lightweight sandbox support, however if we create
pid namespaces we have to manage all the processes inside too,
where our goal is to be able to run a bunch of apps each one inside
its own mount namespace, maybe use network namespaces for vlans
setups, but right now we only want mount namespaces, without all the
other complexity. we want procfs to behave more like a real file system,
and block access to inodes that belong to other users. 'hidepid=' will
not work since it is a shared mount option.


2) Containers, sandboxes and Private instances of file systems - devpts case
Historically, lot of file systems inside Linux kernel view when instantiated
were just a mirror of an already created and mounted filesystem. This was the
case of devpts filesystem, it seems at that time the requirements were to
optimize things and reuse the same memory, etc. This design used to work but not
anymore with today’s containers, IoT, hostile environments and all the privacy
challenges that Linux faces.

In that regards, devpts was updated so that each new mounts is a total
independent file system by the following patches:
“devpts: Make each mount of devpts an independent filesystem” by
Eric W. Biederman [3] [4]


3) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. 

[PATCH RFC v3 0/7] proc: modernize proc to support multiple private instances

2017-11-09 Thread Djalal Harouni
Hi list,


Preface:

This is RFC v3 to modernize procfs and make it able to support multiple
private instances per the same pid namespace.

I have been working on this with Alexey Gladkov and Andy Lutomirski.

RFC v1 is here:
https://lkml.org/lkml/2017/3/30/670

RFC v2 is here:
https://lkml.org/lkml/2017/4/25/282

This RFC v3 can be applied on top of next-20171109

This RFC was tested on Ubuntu/Debian and Alexey tested it on altlinux.
It does not work on Fedora due to a bug during boot with dracut, I did
not have time to investigate it more. I will make sure to fix it next
iteration. We decided to send it to get more feedback on the direction,
we will continue to improve it.

RFC v3 handles all previous comments from Andy Lutomirski, thank you for
all the feedback.


Procfs modernization:
-
Historically procfs was always tied to pid namespaces, during pid
namespace creation we internally create a procfs mount for it. However,
this has the effect that all new procfs mounts are just a mirror of the
internal one, any change, any mount option update, any new future
introduction will propagate to all other procfs mounts that are in the
same pid namespace.

This may have solved several use cases in that time. However today we
face new requirements, and making procfs able to support new private
instances inside same pid namespace seems a major point. If we want to
to introduce new features and security mechanisms we have to make sure
first that we do not break existing usecases. Supporting private procfs
instances wil allow to support new features and behaviour without
propagating it to all other procfs mounts.


Today procfs is more of a burden especially to some Embedded, IoT,
sandbox, container use cases. In user space we are over-mounting null
or inaccessible files on top to hide files and information. If we want
to hide pids we have to create PID namespaces otherwise mount options
propagate to all other proc mounts, changing a mount option value in one
mount will propagate to all other proc mounts. If we want to introduce
new features, then they will propagate to all other mounts too, resulting
either maybe new useful functionality or maybe breaking stuff. We have
also to note that userspace should not workaround procfs, the kernel
should just provide a sane simple interface.


In this regard several developers and maintainers pointed out that
there are problems with procfs and it has to be modernized:

"Here's another one: split up and modernize /proc." by Andy Lutomirski [1]

Discussion about kernel pointer leaks:
"And yes, as Kees and Daniel mentioned, it's definitely not just dmesg.
In fact, the primary things tend to be /proc and /sys, not dmesg
itself." By Linus Torvalds [2]

Lot of other areas in the kernel and filesystems have been updated to be
able to support private instances, devpts is one major example [3]. The aim
here is to modernize procfs without breaking userspace, or without affecting
the shared procfs mount. Later new features will apply on the private
instances, and after more testing, months, maybe it can be made the default
especially for IoT.

We want the possibility to do:

  mount -t proc -onewinstance,newfeature none /proc

newfeature: we are planning new features later for procfs, for now in
this RFC we only introduce "pids=all|ptraceable" mount option.

This allows to absorbe changes, make improvments without breaking use
cases.


Which will be used for:

1) Embedded systems and IoT: usually we have one supervisor for
apps, we have some lightweight sandbox support, however if we create
pid namespaces we have to manage all the processes inside too,
where our goal is to be able to run a bunch of apps each one inside
its own mount namespace, maybe use network namespaces for vlans
setups, but right now we only want mount namespaces, without all the
other complexity. we want procfs to behave more like a real file system,
and block access to inodes that belong to other users. 'hidepid=' will
not work since it is a shared mount option.


2) Containers, sandboxes and Private instances of file systems - devpts case
Historically, lot of file systems inside Linux kernel view when instantiated
were just a mirror of an already created and mounted filesystem. This was the
case of devpts filesystem, it seems at that time the requirements were to
optimize things and reuse the same memory, etc. This design used to work but not
anymore with today’s containers, IoT, hostile environments and all the privacy
challenges that Linux faces.

In that regards, devpts was updated so that each new mounts is a total
independent file system by the following patches:
“devpts: Make each mount of devpts an independent filesystem” by
Eric W. Biederman [3] [4]


3) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run.