Re: [BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn

2018-12-24 Thread Greg KH
On Sun, Dec 23, 2018 at 12:54:11PM +0200, Thomas Backlund wrote:
> Den 23-12-2018 kl. 01:28, skrev Linus Torvalds:
> > On Sat, Dec 22, 2018 at 3:07 PM Christian Brauner
> >  wrote:
> > > 
> > > However, for this case should I resend the revert?
> > 
> > Since I was pointed at the original email thread, I just picked it up
> > from there directly. It still applied cleanly, nothing had changed in
> > that area.
> > 
> >  Linus
> > 
> 
> This should also be picked up for 4.19 lts
> 
> Greg, it's now upstream as:
> 
> From 94f82008ce30e2624537d240d64ce718255e0b80 Mon Sep 17 00:00:00 2001
> From: Christian Brauner 
> Date: Thu, 5 Jul 2018 17:51:20 +0200
> Subject: Revert "vfs: Allow userns root to call mknod on owned filesystems."

Now queued up, thanks.

greg k-h


Re: [BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn

2018-12-23 Thread Thomas Backlund

Den 23-12-2018 kl. 01:28, skrev Linus Torvalds:

On Sat, Dec 22, 2018 at 3:07 PM Christian Brauner
 wrote:


However, for this case should I resend the revert?


Since I was pointed at the original email thread, I just picked it up
from there directly. It still applied cleanly, nothing had changed in
that area.

 Linus



This should also be picked up for 4.19 lts

Greg, it's now upstream as:

From 94f82008ce30e2624537d240d64ce718255e0b80 Mon Sep 17 00:00:00 2001
From: Christian Brauner 
Date: Thu, 5 Jul 2018 17:51:20 +0200
Subject: Revert "vfs: Allow userns root to call mknod on owned filesystems."

--
Thomas



Re: [BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn

2018-12-22 Thread Ellie Reeves

Hi,
I would like to thank you all for reacting to this issue so quickly, and 
I am really sorry for sending the message several time. I thought there 
was a problem with the way it was formatted or some such, hence why I 
sent it several times, because none of the messages seemed to get through.

So yeah, real sorry about that bit, and thanking you all

Gabriel C a écrit :

Am So., 23. Dez. 2018 um 00:02 Uhr schrieb Linus Torvalds
:


On Sat, Dec 22, 2018 at 2:49 PM Christian Brauner
 wrote:


To be fair, no one apart from me was pointing out that it actually
breaks people including systemd folks
even though I was bringing it up with them. I even tried to fix all of
userspace after this got NACKED


Seriously, the "we don't break user space" is the #1 rule in the
kernel, and people should _know_ it's the #1 rule.

If somebody ignores that rule, it needs to be escalated to me.
Immediately. Because I need to know.



I do that usually but I didn't saw Christian's revert the time and I
never hit that issue.
Just saw that now because the unusual  [BREAKAGE] prefix.


I need to know so that I can override the bogus NAK, and so that we
can fix the breakage ASAP. The absolute last thing we need is some
other user space then starting to rely on the new behavior, which just
compounds the problem and makes it a *much* bigger problem.



Yes and you are right ..
https://github.com/lxc/lxc/pull/2438

I've added an comment there about 4.20.0.

BR,

Gabriel



Re: [BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn

2018-12-22 Thread Gabriel C
Am So., 23. Dez. 2018 um 00:02 Uhr schrieb Linus Torvalds
:
>
> On Sat, Dec 22, 2018 at 2:49 PM Christian Brauner
>  wrote:
> >
> > To be fair, no one apart from me was pointing out that it actually
> > breaks people including systemd folks
> > even though I was bringing it up with them. I even tried to fix all of
> > userspace after this got NACKED
>
> Seriously, the "we don't break user space" is the #1 rule in the
> kernel, and people should _know_ it's the #1 rule.
>
> If somebody ignores that rule, it needs to be escalated to me.
> Immediately. Because I need to know.
>

I do that usually but I didn't saw Christian's revert the time and I
never hit that issue.
Just saw that now because the unusual  [BREAKAGE] prefix.

> I need to know so that I can override the bogus NAK, and so that we
> can fix the breakage ASAP. The absolute last thing we need is some
> other user space then starting to rely on the new behavior, which just
> compounds the problem and makes it a *much* bigger problem.
>

Yes and you are right ..
https://github.com/lxc/lxc/pull/2438

I've added an comment there about 4.20.0.

BR,

Gabriel


Re: [BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn

2018-12-22 Thread Linus Torvalds
On Sat, Dec 22, 2018 at 3:07 PM Christian Brauner
 wrote:
>
> However, for this case should I resend the revert?

Since I was pointed at the original email thread, I just picked it up
from there directly. It still applied cleanly, nothing had changed in
that area.

Linus


Re: [BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn

2018-12-22 Thread Christian Brauner
On Sun, Dec 23, 2018 at 12:02 AM Linus Torvalds
 wrote:
>
> On Sat, Dec 22, 2018 at 2:49 PM Christian Brauner
>  wrote:
> >
> > To be fair, no one apart from me was pointing out that it actually
> > breaks people including systemd folks
> > even though I was bringing it up with them. I even tried to fix all of
> > userspace after this got NACKED
>
> Seriously, the "we don't break user space" is the #1 rule in the
> kernel, and people should _know_ it's the #1 rule.
>
> If somebody ignores that rule, it needs to be escalated to me.
> Immediately. Because I need to know.

Fair enough. I usually try to be very conservative when sending patches
directly your way and Eric is otherwise very much on top of not regressing
userspace and I trust him.

However, for this case should I resend the revert?

Christian

>
> I need to know so that I can override the bogus NAK, and so that we
> can fix the breakage ASAP. The absolute last thing we need is some
> other user space then starting to rely on the new behavior, which just
> compounds the problem and makes it a *much* bigger problem.
>
> But I also need to know so that I can then make sure I know not to
> trust the person who broke rule #1.
>
> This is not some odd corner case for the kernel. This is literally the
> rule we have lived with for *decades*.
>
> So please escalate to me whenever you feel a kernel developer doesn't
> follow the first rule. Because the code that broke things *will* be
> reverted (*).
>
> Linus
>
> (*) Yes, there are exceptions. We have had situations where some
> interface was simply just a huge security issue or had some other
> fundamental issue. And we've had cases where the breakage was just so
> old that it was no longer fixable. So even rule #1 can sometimes have
> things that hold it back. But it is *very* rare. Certainly nothing
> like this.


Re: [BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn

2018-12-22 Thread Linus Torvalds
On Sat, Dec 22, 2018 at 2:49 PM Christian Brauner
 wrote:
>
> To be fair, no one apart from me was pointing out that it actually
> breaks people including systemd folks
> even though I was bringing it up with them. I even tried to fix all of
> userspace after this got NACKED

Seriously, the "we don't break user space" is the #1 rule in the
kernel, and people should _know_ it's the #1 rule.

If somebody ignores that rule, it needs to be escalated to me.
Immediately. Because I need to know.

I need to know so that I can override the bogus NAK, and so that we
can fix the breakage ASAP. The absolute last thing we need is some
other user space then starting to rely on the new behavior, which just
compounds the problem and makes it a *much* bigger problem.

But I also need to know so that I can then make sure I know not to
trust the person who broke rule #1.

This is not some odd corner case for the kernel. This is literally the
rule we have lived with for *decades*.

So please escalate to me whenever you feel a kernel developer doesn't
follow the first rule. Because the code that broke things *will* be
reverted (*).

Linus

(*) Yes, there are exceptions. We have had situations where some
interface was simply just a huge security issue or had some other
fundamental issue. And we've had cases where the breakage was just so
old that it was no longer fixable. So even rule #1 can sometimes have
things that hold it back. But it is *very* rare. Certainly nothing
like this.


Re: [BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn

2018-12-22 Thread Christian Brauner
On Sat, Dec 22, 2018 at 11:20 PM Linus Torvalds
 wrote:
>
> Eric, this is entirely unacceptable.

i would like to point out that I send a revert for this in *July*
before any kernel with this change
was released for the exact same reason. But I was ignored and no one
came to argumentative aid:
- https://lists.linuxfoundation.org/pipermail/containers/2018-July/039182.html
- https://lists.linuxfoundation.org/pipermail/containers/2018-July/039183.html

To be fair, no one apart from me was pointing out that it actually
breaks people including systemd folks
even though I was bringing it up with them. I even tried to fix all of
userspace after this got NACKED
( https://github.com/systemd/systemd/pull/9483 ).

Christian

>
> On Sat, Dec 22, 2018 at 12:58 PM Gabriel C  wrote:
> >
> > Added some people to CC that might want to see this..
>
> Thanks.
>
> > > Here's an email that was sent to lkml about the subject:
> > >
> > > https://lkml.org/lkml/2018/7/5/742
> > >
> > > I link also this, quoting the last of it:
> > >
> > > https://lkml.org/lkml/2018/7/5/701
> > >
> > > It has never been the case that mknod on a device node will guarantee
> > > that you even can open the device node.  The applications that regress
> > > are broken.  It doesn't mean we shouldn't be bug compatible, but we darn
> > > well should document very clearly the bugs we are being bug compatible 
> > > with.
>
> Yeah, this is complete garbage.
>
> We have very clear rules in the kernel: if some change breaks existing
> setups, it is ABSOLUTELY NEVER the application that is broken.
>
> It is the kernel.
>
> There is absolutely zero gray areas here. Eric, your behavior is
> entirely out of line, and now we apparently have a regression that
> goes back to June that I was not told about because of your incorrect
> stance.
>
> Eric, I want to make this 1000% clear: there are no user space bugs.
> If it used to work, then user space was clearly doing the right thing.
> The fact that you tried to several times claim it was buggy user space
> is a serious breach of trust. You KNOW this is the case.
>
> Seriously. There are no excuses.
>
> That commit is now reverted in my tree, and furthermore I will not
> take any pull requests from you until you have made it clear that you
> comprehend this very fundamental issue.
>
> Why did it take so long for this issue to be elevated to me?
>
>Linus


Re: [BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn

2018-12-22 Thread Linus Torvalds
Eric, this is entirely unacceptable.

On Sat, Dec 22, 2018 at 12:58 PM Gabriel C  wrote:
>
> Added some people to CC that might want to see this..

Thanks.

> > Here's an email that was sent to lkml about the subject:
> >
> > https://lkml.org/lkml/2018/7/5/742
> >
> > I link also this, quoting the last of it:
> >
> > https://lkml.org/lkml/2018/7/5/701
> >
> > It has never been the case that mknod on a device node will guarantee
> > that you even can open the device node.  The applications that regress
> > are broken.  It doesn't mean we shouldn't be bug compatible, but we darn
> > well should document very clearly the bugs we are being bug compatible with.

Yeah, this is complete garbage.

We have very clear rules in the kernel: if some change breaks existing
setups, it is ABSOLUTELY NEVER the application that is broken.

It is the kernel.

There is absolutely zero gray areas here. Eric, your behavior is
entirely out of line, and now we apparently have a regression that
goes back to June that I was not told about because of your incorrect
stance.

Eric, I want to make this 1000% clear: there are no user space bugs.
If it used to work, then user space was clearly doing the right thing.
The fact that you tried to several times claim it was buggy user space
is a serious breach of trust. You KNOW this is the case.

Seriously. There are no excuses.

That commit is now reverted in my tree, and furthermore I will not
take any pull requests from you until you have made it clear that you
comprehend this very fundamental issue.

Why did it take so long for this issue to be elevated to me?

   Linus


Re: [BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn

2018-12-22 Thread Gabriel C
Added some people to CC that might want to see this..

Am Sa., 22. Dez. 2018 um 19:14 Uhr schrieb Ellie Reeves :
>
> Hi,
> first off, allow me to express that this is my first time ever writing
> on such a mailing list, and that if something is unclear or you would
> need more information, just let me know.
> I write to this list in hoping to see this change reverted. The linux
> kernel always said it would avoid breaking user namespace as much as
> possible, and yet this is what happens. I was hence very much surprised
> when my perfectly working containers on systemd-nspawn which makes use
> of userns by default, stopped working from one day to the next, till I
> identified the problem as being kernel >= 4.18. This container is in
> production, hence the annoyance it was. From one day to the next the
> container started failing with stranges problems:
>
> * nginx, dovecot, postgresql, and postfix complained about getting
> permission denied on /dev/null even though it appeared perfectly normal
> to me, the correct permissions, all that
> * /var was also acting very strangely, getting a lot of permission
> denied or operation not supported messages.
> * I could not delete a file that my user had the right to create, write
> to and read in /var, I needed root
>
> Here is the pull request that was made to systemd, along with a small
> amount of talk around the issue:
>
> https://github.com/systemd/systemd/pull/9483
>
> It was ultimately decided among the systemd folks to bail out of the
> issue, as shown in the news entry for systemd 240:
>
>  * KERNEL API BREAKAGE: Linux kernel 4.18 changed behaviour
> regarding
>mknod() handling in user namespaces. Previously mknod() would
> always
>fail with EPERM in user namespaces. Since 4.18 mknod() will
> succeed
>but device nodes generated that way cannot be opened, and
> attempts to
>open them result in EPERM. This breaks the "graceful
> fallback" logic
>in systemd's PrivateDevices= sand-boxing option. This option is
>implemented defensively, so that when systemd detects it runs
> in a
>restricted environment (such as a user namespace, or an
> environment
>where mknod() is blocked through seccomp or absence of
> CAP_SYS_MKNOD)
>where device nodes cannot be created the effect of
> PrivateDevices= is
>bypassed (following the logic that 2nd-level sand-boxing is not
>essential if the system systemd runs in is itself already
> sand-boxed
>as a whole). This logic breaks with 4.18 in container
> managers where
>user namespacing is used: suddenly PrivateDevices= succeeds
> setting
>up a private /dev/ file system containing devices nodes — but
> when
>these are opened they don't work.
>
>At this point is is recommended that container managers utilizing
>user namespaces that intend to run systemd in the payload
> explicitly
>block mknod() with seccomp or similar, so that the graceful
> fallback
>logic works again.
>
>We are very sorry for the breakage and the requirement to change
>container configurations for newer kernels. It's purely
> caused by an
>incompatible kernel change. The relevant kernel developers
> have been
>notified about this userspace breakage quickly, but they chose to
>ignore it.
>
> Here's an email that was sent to lkml about the subject:
>
> https://lkml.org/lkml/2018/7/5/742
>
> I link also this, quoting the last of it:
>
> https://lkml.org/lkml/2018/7/5/701
>
> It has never been the case that mknod on a device node will guarantee
> that you even can open the device node.  The applications that regress
> are broken.  It doesn't mean we shouldn't be bug compatible, but we darn
> well should document very clearly the bugs we are being bug compatible with.
>
> I'm in the opinion that it is a kernel bug, and I quote someone from the
> systemd irc channel:
>
> ewb said applications were broken. But the rule is, if userspace breaks,
> its a bug. The kernel *has* to revert it. And honestly, this change
> doesn't make much sense. You can set nodev yourself but then you know
> mknod will not allow you to open the object. Here, the kernel does it
> without your knowledge
>
> Also, it seems that if this change is reverted, things that were fixed
> to work around the issue this breakage caused will not be broken again,
> they should simply go back to their previous way of working. I
> understand there may be security reason why this change was made in the
> first place, but it is not so big a problem is it ? I can mknode
> arbitrary devices in userns and open them as userns root. But my point
> is, several things broke. My *working* stuff was broken from one day to
> the next.
>
> I am not trying to pick a fight. I want to understand the reasoning
> behind this change in 

[BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn

2018-12-22 Thread Ellie Reeves

Hi,
first off, allow me to express that this is my first time ever writing 
on such a mailing list, and that if something is unclear or you would 
need more information, just let me know.
I write to this list in hoping to see this change reverted. The linux 
kernel always said it would avoid breaking user namespace as much as 
possible, and yet this is what happens. I was hence very much surprised 
when my perfectly working containers on systemd-nspawn which makes use 
of userns by default, stopped working from one day to the next, till I 
identified the problem as being kernel >= 4.18. This container is in 
production, hence the annoyance it was. From one day to the next the 
container started failing with stranges problems:


* nginx, dovecot, postgresql, and postfix complained about getting 
permission denied on /dev/null even though it appeared perfectly normal 
to me, the correct permissions, all that
* /var was also acting very strangely, getting a lot of permission 
denied or operation not supported messages.
* I could not delete a file that my user had the right to create, write 
to and read in /var, I needed root


Here is the pull request that was made to systemd, along with a small 
amount of talk around the issue:


https://github.com/systemd/systemd/pull/9483

It was ultimately decided among the systemd folks to bail out of the 
issue, as shown in the news entry for systemd 240:


    * KERNEL API BREAKAGE: Linux kernel 4.18 changed behaviour 
regarding
  mknod() handling in user namespaces. Previously mknod() would 
always
  fail with EPERM in user namespaces. Since 4.18 mknod() will 
succeed
  but device nodes generated that way cannot be opened, and 
attempts to
  open them result in EPERM. This breaks the "graceful 
fallback" logic

  in systemd's PrivateDevices= sand-boxing option. This option is
  implemented defensively, so that when systemd detects it runs 
in a
  restricted environment (such as a user namespace, or an 
environment
  where mknod() is blocked through seccomp or absence of 
CAP_SYS_MKNOD)
  where device nodes cannot be created the effect of 
PrivateDevices= is

  bypassed (following the logic that 2nd-level sand-boxing is not
  essential if the system systemd runs in is itself already 
sand-boxed
  as a whole). This logic breaks with 4.18 in container 
managers where
  user namespacing is used: suddenly PrivateDevices= succeeds 
setting
  up a private /dev/ file system containing devices nodes — but 
when

  these are opened they don't work.

  At this point is is recommended that container managers utilizing
  user namespaces that intend to run systemd in the payload 
explicitly
  block mknod() with seccomp or similar, so that the graceful 
fallback

  logic works again.

  We are very sorry for the breakage and the requirement to change
  container configurations for newer kernels. It's purely 
caused by an
  incompatible kernel change. The relevant kernel developers 
have been

  notified about this userspace breakage quickly, but they chose to
  ignore it.

Here's an email that was sent to lkml about the subject:

https://lkml.org/lkml/2018/7/5/742

I link also this, quoting the last of it:

https://lkml.org/lkml/2018/7/5/701

It has never been the case that mknod on a device node will guarantee 
that you even can open the device node.  The applications that regress 
are broken.  It doesn't mean we shouldn't be bug compatible, but we darn 
well should document very clearly the bugs we are being bug compatible with.


I'm in the opinion that it is a kernel bug, and I quote someone from the 
systemd irc channel:


ewb said applications were broken. But the rule is, if userspace breaks, 
its a bug. The kernel *has* to revert it. And honestly, this change 
doesn't make much sense. You can set nodev yourself but then you know 
mknod will not allow you to open the object. Here, the kernel does it 
without your knowledge


Also, it seems that if this change is reverted, things that were fixed 
to work around the issue this breakage caused will not be broken again, 
they should simply go back to their previous way of working. I 
understand there may be security reason why this change was made in the 
first place, but it is not so big a problem is it ? I can mknode 
arbitrary devices in userns and open them as userns root. But my point 
is, several things broke. My *working* stuff was broken from one day to 
the next.


I am not trying to pick a fight. I want to understand the reasoning 
behind this change in the first place, and I'm simply making an attempt 
at getting it reverted, because it is true that I don't much fancy 
blocking the mknode() syscall in every template unit on every machine we 
administer here, and that staying on kernel < 4.18 is not a good 
sollution either.


[BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn

2018-12-22 Thread Ellie Revves

Hi,
first off, allow me to express that this is my first time ever writing 
on such a mailing list, and that if something is unclear or you would 
need more information, just let me know.
I write to this list in hoping to see this change reverted. The linux 
kernel always said it would avoid breaking user namespace as much as 
possible, and yet this is what happens. I was hence very much surprised 
when my perfectly working containers on systemd-nspawn which makes use 
of userns by default, stopped working from one day to the next, till I 
identified the problem as being kernel >= 4.18. This container is in 
production, hence the annoyance it was. From one day to the next the 
container started failing with stranges problems:


* nginx, dovecot, postgresql, and postfix complained about getting 
permission denied on /dev/null even though it appeared perfectly normal 
to me, the correct permissions, all that
* /var was also acting very strangely, getting a lot of permission 
denied or operation not supported messages.
* I could not delete a file that my user had the right to create, write 
to and read in /var, I needed root


Here is the pull request that was made to systemd, along with a small 
amount of talk around the issue:


https://github.com/systemd/systemd/pull/9483

It was ultimately decided among the systemd folks to bail out of the 
issue, as shown in the news entry for systemd 240:


    * KERNEL API BREAKAGE: Linux kernel 4.18 changed behaviour 
regarding
  mknod() handling in user namespaces. Previously mknod() would 
always
  fail with EPERM in user namespaces. Since 4.18 mknod() will 
succeed
  but device nodes generated that way cannot be opened, and 
attempts to
  open them result in EPERM. This breaks the "graceful 
fallback" logic

  in systemd's PrivateDevices= sand-boxing option. This option is
  implemented defensively, so that when systemd detects it runs 
in a
  restricted environment (such as a user namespace, or an 
environment
  where mknod() is blocked through seccomp or absence of 
CAP_SYS_MKNOD)
  where device nodes cannot be created the effect of 
PrivateDevices= is

  bypassed (following the logic that 2nd-level sand-boxing is not
  essential if the system systemd runs in is itself already 
sand-boxed
  as a whole). This logic breaks with 4.18 in container 
managers where
  user namespacing is used: suddenly PrivateDevices= succeeds 
setting
  up a private /dev/ file system containing devices nodes — but 
when

  these are opened they don't work.

  At this point is is recommended that container managers utilizing
  user namespaces that intend to run systemd in the payload 
explicitly
  block mknod() with seccomp or similar, so that the graceful 
fallback

  logic works again.

  We are very sorry for the breakage and the requirement to change
  container configurations for newer kernels. It's purely 
caused by an
  incompatible kernel change. The relevant kernel developers 
have been

  notified about this userspace breakage quickly, but they chose to
  ignore it.

Here's an email that was sent to lkml about the subject:

https://lkml.org/lkml/2018/7/5/742

I link also this, quoting the last of it:

https://lkml.org/lkml/2018/7/5/701

It has never been the case that mknod on a device node will guarantee 
that you even can open the device node.  The applications that regress 
are broken.  It doesn't mean we shouldn't be bug compatible, but we darn 
well should document very clearly the bugs we are being bug compatible with.


I'm in the opinion that it is a kernel bug, and I quote someone from the 
systemd irc channel:


ewb said applications were broken. But the rule is, if userspace breaks, 
its a bug. The kernel *has* to revert it. And honestly, this change 
doesn't make much sense. You can set nodev yourself but then you know 
mknod will not allow you to open the object. Here, the kernel does it 
without your knowledge


Also, it seems that if this change is reverted, things that were fixed 
to work around the issue this breakage caused will not be broken again, 
they should simply go back to their previous way of working. I 
understand there may be security reason why this change was made in the 
first place, but it is not so big a problem is it ? I can mknode 
arbitrary devices in userns and open them as userns root. But my point 
is, several things broke. My *working* stuff was broken from one day to 
the next.


I am not trying to pick a fight. I want to understand the reasoning 
behind this change in the first place, and I'm simply making an attempt 
at getting it reverted, because it is true that I don't much fancy 
blocking the mknode() syscall in every template unit on every machine we 
administer here, and that staying on kernel < 4.18 is not a good 
sollution either.


[BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn

2018-12-22 Thread Ellie Reeves

Hi,
first off, allow me to express that this is my first time ever writing 
on such a mailing list, and that if something is unclear or you would 
need more information, just let me know.
I write to this list in hoping to see this change reverted. The linux 
kernel always said it would avoid breaking user namespace as much as 
possible, and yet this is what happens. I was hence very much surprised 
when my perfectly working containers on systemd-nspawn which makes use 
of userns by default, stopped working from one day to the next, till I 
identified the problem as being kernel >= 4.18. This container is in 
production, hence the annoyance it was. From one day to the next the 
container started failing with stranges problems:


nginx, dovecot, postgresql, and postfix complained about getting 
permission denied on /dev/null even though it appeared perfectly normal 
to me, the correct permissions, all that
/var was also acting very strangely, getting a lot of permission denied 
or operation not supported messages.
I could not delete a file that my user had the right to create, write to 
and read in /var, I needed root


Here is the pull request that was made to systemd, along with a small 
amount of talk around the issue:


https://github.com/systemd/systemd/pull/9483

It was ultimately decided among the systemd folks to bail out of the 
issue, as shown in the news entry for systemd 240:


KERNEL API BREAKAGE: Linux kernel 4.18 changed behaviour regarding
mknod() handling in user namespaces. Previously mknod() would always
fail with EPERM in user namespaces. Since 4.18 mknod() will succeed
but device nodes generated that way cannot be opened, and attempts to
open them result in EPERM. This breaks the "graceful fallback" logic
in systemd's PrivateDevices= sand-boxing option. This option is
implemented defensively, so that when systemd detects it runs in a
restricted environment (such as a user namespace, or an environment
where mknod() is blocked through seccomp or absence of CAP_SYS_MKNOD)
where device nodes cannot be created the effect of PrivateDevices= is
bypassed (following the logic that 2nd-level sand-boxing is not
essential if the system systemd runs in is itself already sand-boxed
as a whole). This logic breaks with 4.18 in container managers where
user namespacing is used: suddenly PrivateDevices= succeeds setting
up a private /dev/ file system containing devices nodes — but when
these are opened they don't work.
At this point is is recommended that container managers utilizing
user namespaces that intend to run systemd in the payload explicitly
block mknod() with seccomp or similar, so that the graceful fallback
logic works again.
We are very sorry for the breakage and the requirement to change
container configurations for newer kernels. It's purely caused by an
incompatible kernel change. The relevant kernel developers have been
notified about this userspace breakage quickly, but they chose to
ignore it.

Here's an email that was sent to lkml about the subject:

https://lkml.org/lkml/2018/7/5/742

I link also this, quoting the last of it:

https://lkml.org/lkml/2018/7/5/701

It has never been the case that mknod on a device node will guarantee 
that you even can open the device node.  The applications that regress 
are broken.  It doesn't mean we shouldn't be bug compatible, but we darn 
well should document very clearly the bugs we are being bug compatible with.


I'm in the opinion that it is a kernel bug, and I quote someone from the 
systemd irc channel:


ewb said applications were broken. But the rule is, if userspace breaks, 
its a bug. The kernel *has* to revert it. And honestly, this change 
doesn't make much sense. You can set nodev yourself but then you know 
mknod will not allow you to open the object. Here, the kernel does it 
without your knowledge


Also, it seems that if this change is reverted, things that were fixed 
to work around the issue this breakage caused will not be broken again, 
they should simply go back to their previous way of working. I 
understand there may be security reason why this change was made in the 
first place, but it is not so big a problem is it ? I can mknode 
arbitrary devices in userns and open them as userns root. But my point 
is, several things broke. My *working* stuff was broken from one day to 
the next.


I am not trying to pick a fight. I want to understand the reasoning 
behind this change in the first place, and I'm simply making an attempt 
at getting it reverted, because it is true that I don't much fancy 
blocking the mknode() syscall in every template unit on every machine we 
administer here, and that staying on kernel < 4.18 is not a good 
sollution either.


I would also like to be personally CC'ed the comments or answers posted 
to this mailing list in response to this message.


Thanks