Re: [systemd-devel] Docker vs PrivateTmp

2015-02-02 Thread Lennart Poettering
On Fri, 30.01.15 11:02, Alexander Larsson (al...@redhat.com) wrote:

 I think the problem is that docker daemon makes 
 /var/lib/docker/devicemapper private in the host namespace to handle
 some scalability issues we found in the kernel. This causes problem not
 with docker containers (because they unmount all other mounts as per the
 above), but with other namespace-using apps. For instance, if a service
 with PrivateTmp is launched, it will inherit the existing mounts
 in /var/lib/docker/devicemapper at the point of startup, but when these
 are eventually unmounted in the host namespace this is not propagated
 into the service (due to it being a private mount, not a slave mount).
 
 We could try making this slave instead, but I don't know if that then
 fixes the scalability issues we had, because they were related to
 stupidities in the kernel wrt propagating mounts. If it doesn't work,
 then we have to put docker-daemon in its own namespace.

The daemon should first create its own namespace, and then detach
propagation, not the other way round. This really isn't stupidity in
the kernel, but in docker's userspace...

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Docker vs PrivateTmp

2015-02-02 Thread Alexander Larsson
On mån, 2015-02-02 at 12:12 +0100, Lennart Poettering wrote:
 On Fri, 30.01.15 11:02, Alexander Larsson (al...@redhat.com) wrote:
 
  I think the problem is that docker daemon makes 
  /var/lib/docker/devicemapper private in the host namespace to handle
  some scalability issues we found in the kernel. This causes problem not
  with docker containers (because they unmount all other mounts as per the
  above), but with other namespace-using apps. For instance, if a service
  with PrivateTmp is launched, it will inherit the existing mounts
  in /var/lib/docker/devicemapper at the point of startup, but when these
  are eventually unmounted in the host namespace this is not propagated
  into the service (due to it being a private mount, not a slave mount).
  
  We could try making this slave instead, but I don't know if that then
  fixes the scalability issues we had, because they were related to
  stupidities in the kernel wrt propagating mounts. If it doesn't work,
  then we have to put docker-daemon in its own namespace.
 
 The daemon should first create its own namespace, and then detach
 propagation, not the other way round. This really isn't stupidity in
 the kernel, but in docker's userspace...

The stupidity was the O(n^4) algorithm in the kernel when it was
duplicating all vfsmounts that could possibly be propagated, and then
immediately freeing them when they did not propagate, which interacted
poorly with some lame kernel O(n^2) allocator behaviour.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Alexander LarssonRed Hat, Inc 
   al...@redhat.comalexander.lars...@gmail.com 
He's an oversexed shark-wrestling rock star from the 'hood. She's a 
high-kicking cigar-chomping former first lady with the power to see 
death. They fight crime! 

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Docker vs PrivateTmp

2015-01-30 Thread Alexander Larsson
On fre, 2015-01-23 at 11:31 -0500, Daniel J Walsh wrote:
 On 01/22/2015 10:02 PM, Lennart Poettering wrote:
  On Sat, 17.01.15 23:02, Lars Kellogg-Stedman (l...@redhat.com) wrote:
 
  See the `devicemapper` mountpoint created by Docker for the container:
 
  # grep devicemapper/mnt /proc/mounts
  
  /dev/mapper/docker-253:6-98310-e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62
  
  /var/lib/docker/devicemapper/mnt/e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62
  ext4
  
  rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c261,c1018,relatime,discard,stripe=16,data=ordered
  0 0
  I am not sure why docker makes these mounts visible in the host
  namespace at all. This smells like a bug.

They need to at least be visible to the docker daemon, because it needs
to look into it to do diffs between images when e.g. commiting. It
doesn't necessarily have to be in the host namespace though, it could be
in a different namespace owned only by the docker daemon. I wanted to do
that, but for reasons that escape me at the moment that was problematic
and I never got to it.

  Watch Docker fail to destroy the container because it is unable to remove 
  the mountpoint directory:
 
  Jan 17 22:43:03 pk115wp-lkellogg docker-1.4.1-dev[18239]:
  time=2015-01-17T22:43:03-05:00 level=error msg=Handler for DELETE
  /containers/{name:.*} returned error: Cannot destroy container 
  e68df3f45d61:
  Driver devicemapper failed to remove root filesystem
  e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62: 
  Device is
  Busy
  This smells as if Docker incorrectly sets the mount propagation bits
  on its own mounts.
 
  It would be good checking /proc/self/mountinfo inside and outside of
  docker's own namespace, and checking how the propagation bits are set
  for the individual mounts. It's a bit hard to read, but the
  interesting bits are in the 7th column of that file.
 
  In general: docker should do the equivalent of mount --make-rslave /
  as first thing after opening its mount namespace, so that from that
  point on mounts and especiall *un*mounts propagate from the host into
  the container, but not vice versa.
 
  If they do not invoke that, then the propagation will stay at
  shared, which means the mounts will appear in the host and vice
  versa, which is certainly undesired.
 
  Also, they should not use mount --make-rprivate /, as that means
  anything the host mounted will stay mounted in the container forever,
  which is a problem.
 
  Also, they really need to make this recursive, so that all mount
  points they have access too are detached from the host!

It was a while since I looked at this, but i believe that the docker
containers run as MS_PRIVATE, and they explicitly unmount all the host
filesystems exept the ones specifically mounted in as volumes.

I think the problem is that docker daemon makes 
/var/lib/docker/devicemapper private in the host namespace to handle
some scalability issues we found in the kernel. This causes problem not
with docker containers (because they unmount all other mounts as per the
above), but with other namespace-using apps. For instance, if a service
with PrivateTmp is launched, it will inherit the existing mounts
in /var/lib/docker/devicemapper at the point of startup, but when these
are eventually unmounted in the host namespace this is not propagated
into the service (due to it being a private mount, not a slave mount).

We could try making this slave instead, but I don't know if that then
fixes the scalability issues we had, because they were related to
stupidities in the kernel wrt propagating mounts. If it doesn't work,
then we have to put docker-daemon in its own namespace.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Alexander LarssonRed Hat, Inc 
   al...@redhat.comalexander.lars...@gmail.com 
He's an impetuous amnesiac hairdresser who dotes on his loving old ma. 
She's an elegant Bolivian single mother from out of town. They fight 
crime! 

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Docker vs PrivateTmp

2015-01-23 Thread Lennart Poettering
On Fri, 23.01.15 11:31, Daniel J Walsh (dwa...@redhat.com) wrote:

You just sent a full quote without any comment of yours?

 
 On 01/22/2015 10:02 PM, Lennart Poettering wrote:
  On Sat, 17.01.15 23:02, Lars Kellogg-Stedman (l...@redhat.com) wrote:
 
  See the `devicemapper` mountpoint created by Docker for the container:
 
  # grep devicemapper/mnt /proc/mounts
  
  /dev/mapper/docker-253:6-98310-e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62
  
  /var/lib/docker/devicemapper/mnt/e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62
  ext4
  
  rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c261,c1018,relatime,discard,stripe=16,data=ordered
  0 0
  I am not sure why docker makes these mounts visible in the host
  namespace at all. This smells like a bug.
 
  Watch Docker fail to destroy the container because it is unable to remove 
  the mountpoint directory:
 
  Jan 17 22:43:03 pk115wp-lkellogg docker-1.4.1-dev[18239]:
  time=2015-01-17T22:43:03-05:00 level=error msg=Handler for DELETE
  /containers/{name:.*} returned error: Cannot destroy container 
  e68df3f45d61:
  Driver devicemapper failed to remove root filesystem
  e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62: 
  Device is
  Busy
  This smells as if Docker incorrectly sets the mount propagation bits
  on its own mounts.
 
  It would be good checking /proc/self/mountinfo inside and outside of
  docker's own namespace, and checking how the propagation bits are set
  for the individual mounts. It's a bit hard to read, but the
  interesting bits are in the 7th column of that file.
 
  In general: docker should do the equivalent of mount --make-rslave /
  as first thing after opening its mount namespace, so that from that
  point on mounts and especiall *un*mounts propagate from the host into
  the container, but not vice versa.
 
  If they do not invoke that, then the propagation will stay at
  shared, which means the mounts will appear in the host and vice
  versa, which is certainly undesired.
 
  Also, they should not use mount --make-rprivate /, as that means
  anything the host mounted will stay mounted in the container forever,
  which is a problem.
 
  Also, they really need to make this recursive, so that all mount
  points they have access too are detached from the host!
 
  Lennart
 
 
 


Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Docker vs PrivateTmp

2015-01-23 Thread Daniel J Walsh
Yes I was trying to get a comment from Alex, since he did the original
patch.

On 01/23/2015 12:26 PM, Lennart Poettering wrote:
 On Fri, 23.01.15 11:31, Daniel J Walsh (dwa...@redhat.com) wrote:

 You just sent a full quote without any comment of yours?

 On 01/22/2015 10:02 PM, Lennart Poettering wrote:
 On Sat, 17.01.15 23:02, Lars Kellogg-Stedman (l...@redhat.com) wrote:

 See the `devicemapper` mountpoint created by Docker for the container:

 # grep devicemapper/mnt /proc/mounts
 
 /dev/mapper/docker-253:6-98310-e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62
 
 /var/lib/docker/devicemapper/mnt/e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62
 ext4
 
 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c261,c1018,relatime,discard,stripe=16,data=ordered
 0 0
 I am not sure why docker makes these mounts visible in the host
 namespace at all. This smells like a bug.

 Watch Docker fail to destroy the container because it is unable to remove 
 the mountpoint directory:

 Jan 17 22:43:03 pk115wp-lkellogg docker-1.4.1-dev[18239]:
 time=2015-01-17T22:43:03-05:00 level=error msg=Handler for DELETE
 /containers/{name:.*} returned error: Cannot destroy container 
 e68df3f45d61:
 Driver devicemapper failed to remove root filesystem
 e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62: 
 Device is
 Busy
 This smells as if Docker incorrectly sets the mount propagation bits
 on its own mounts.

 It would be good checking /proc/self/mountinfo inside and outside of
 docker's own namespace, and checking how the propagation bits are set
 for the individual mounts. It's a bit hard to read, but the
 interesting bits are in the 7th column of that file.

 In general: docker should do the equivalent of mount --make-rslave /
 as first thing after opening its mount namespace, so that from that
 point on mounts and especiall *un*mounts propagate from the host into
 the container, but not vice versa.

 If they do not invoke that, then the propagation will stay at
 shared, which means the mounts will appear in the host and vice
 versa, which is certainly undesired.

 Also, they should not use mount --make-rprivate /, as that means
 anything the host mounted will stay mounted in the container forever,
 which is a problem.

 Also, they really need to make this recursive, so that all mount
 points they have access too are detached from the host!

 Lennart



 Lennart


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Docker vs PrivateTmp

2015-01-23 Thread Daniel J Walsh

On 01/22/2015 10:02 PM, Lennart Poettering wrote:
 On Sat, 17.01.15 23:02, Lars Kellogg-Stedman (l...@redhat.com) wrote:

 See the `devicemapper` mountpoint created by Docker for the container:

 # grep devicemapper/mnt /proc/mounts
 
 /dev/mapper/docker-253:6-98310-e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62
 
 /var/lib/docker/devicemapper/mnt/e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62
 ext4
 
 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c261,c1018,relatime,discard,stripe=16,data=ordered
 0 0
 I am not sure why docker makes these mounts visible in the host
 namespace at all. This smells like a bug.

 Watch Docker fail to destroy the container because it is unable to remove 
 the mountpoint directory:

 Jan 17 22:43:03 pk115wp-lkellogg docker-1.4.1-dev[18239]:
 time=2015-01-17T22:43:03-05:00 level=error msg=Handler for DELETE
 /containers/{name:.*} returned error: Cannot destroy container 
 e68df3f45d61:
 Driver devicemapper failed to remove root filesystem
 e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62: Device 
 is
 Busy
 This smells as if Docker incorrectly sets the mount propagation bits
 on its own mounts.

 It would be good checking /proc/self/mountinfo inside and outside of
 docker's own namespace, and checking how the propagation bits are set
 for the individual mounts. It's a bit hard to read, but the
 interesting bits are in the 7th column of that file.

 In general: docker should do the equivalent of mount --make-rslave /
 as first thing after opening its mount namespace, so that from that
 point on mounts and especiall *un*mounts propagate from the host into
 the container, but not vice versa.

 If they do not invoke that, then the propagation will stay at
 shared, which means the mounts will appear in the host and vice
 versa, which is certainly undesired.

 Also, they should not use mount --make-rprivate /, as that means
 anything the host mounted will stay mounted in the container forever,
 which is a problem.

 Also, they really need to make this recursive, so that all mount
 points they have access too are detached from the host!

 Lennart


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Docker vs PrivateTmp

2015-01-22 Thread Lennart Poettering
On Sun, 18.01.15 20:50, Colin Walters (walt...@verbum.org) wrote:

 On Sat, Jan 17, 2015, at 11:02 PM, Lars Kellogg-Stedman wrote:
  Hello all,
  
  With systemd 216 on Fedora 21 (kernel 3.17.8), I have run into an odd
  behavior concerning the PrivateTmp directive, and I am looking for
  help identifying this as:
  
  - Everything Is Working As Designed, Citizen
  - A bug in Docker (some mount flag is being set incorrectly?)
 
 This should be fixed by:
 http://pkgs.fedoraproject.org/cgit/docker-io.git/commit/?id=6c9e373ee06cb1aee07d3cae426c46002663010d
 
 i.e. having docker.service use MountFlags=private, so its mounts
 aren't visible to other processes.

MountFlags=private also disables *un*mount propagation from the host
into the service, which means file systems once mounted in the host
when a service was started will stay mounted forever in the service,
which will keep the backing device busy forever.

MountFlags=private is hence pretty useless in real life. Never use it.

MountFlags=shared is also pointless, since it is the implied default.

Which means MountFlags=slave is really the only option that makes
sense to ever add to a unit file.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Docker vs PrivateTmp

2015-01-22 Thread Lennart Poettering
On Sat, 17.01.15 23:02, Lars Kellogg-Stedman (l...@redhat.com) wrote:

 See the `devicemapper` mountpoint created by Docker for the container:
 
 # grep devicemapper/mnt /proc/mounts
 
 /dev/mapper/docker-253:6-98310-e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62
 
 /var/lib/docker/devicemapper/mnt/e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62
 ext4
 
 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c261,c1018,relatime,discard,stripe=16,data=ordered
 0 0

I am not sure why docker makes these mounts visible in the host
namespace at all. This smells like a bug.

 Watch Docker fail to destroy the container because it is unable to remove the 
 mountpoint directory:
 
 Jan 17 22:43:03 pk115wp-lkellogg docker-1.4.1-dev[18239]:
 time=2015-01-17T22:43:03-05:00 level=error msg=Handler for DELETE
 /containers/{name:.*} returned error: Cannot destroy container 
 e68df3f45d61:
 Driver devicemapper failed to remove root filesystem
 e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62: Device 
 is
 Busy

This smells as if Docker incorrectly sets the mount propagation bits
on its own mounts.

It would be good checking /proc/self/mountinfo inside and outside of
docker's own namespace, and checking how the propagation bits are set
for the individual mounts. It's a bit hard to read, but the
interesting bits are in the 7th column of that file.

In general: docker should do the equivalent of mount --make-rslave /
as first thing after opening its mount namespace, so that from that
point on mounts and especiall *un*mounts propagate from the host into
the container, but not vice versa.

If they do not invoke that, then the propagation will stay at
shared, which means the mounts will appear in the host and vice
versa, which is certainly undesired.

Also, they should not use mount --make-rprivate /, as that means
anything the host mounted will stay mounted in the container forever,
which is a problem.

Also, they really need to make this recursive, so that all mount
points they have access too are detached from the host!

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Docker vs PrivateTmp

2015-01-19 Thread Daniel J Walsh

On 01/19/2015 12:27 AM, Lars Kellogg-Stedman wrote:
 On Sun, Jan 18, 2015 at 11:38:12PM -0500, Lars Kellogg-Stedman wrote:
 I think we actually want MountFlags=slave, which will permit mounts
 from the global namespace to propagate into the service namespace
 without permitting propagation in the other direction.  It seems like
 this would the Least Surprising behavior.
 ...which would be the default if docker.service were itself using
 PrivateTmp=true, because from systemd.exec:

 Note that the file system namespace related options (PrivateTmp=,
 PrivateDevices=, ProtectSystem=, ProtectHome=, ReadOnlyDirectories=,
 InaccessibleDirectories= and ReadWriteDirectories=) require that mount
 and unmount propagation from the unit's file system namespace is
 disabled, and hence downgrade shared to slave.

 So either explicitly setting MountFlags=slave, or setting
 PrivateTmp=true if that doesn't cause any issues of which I am not
 aware.



 ___
 systemd-devel mailing list
 systemd-devel@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Vincent what do you think about MountFlags=slave?
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Docker vs PrivateTmp

2015-01-19 Thread Vincent Batts

On 19/01/15 08:39 -0500, Daniel J Walsh wrote:


On 01/19/2015 12:27 AM, Lars Kellogg-Stedman wrote:

On Sun, Jan 18, 2015 at 11:38:12PM -0500, Lars Kellogg-Stedman wrote:

I think we actually want MountFlags=slave, which will permit mounts
from the global namespace to propagate into the service namespace
without permitting propagation in the other direction.  It seems like
this would the Least Surprising behavior.

...which would be the default if docker.service were itself using
PrivateTmp=true, because from systemd.exec:

Note that the file system namespace related options (PrivateTmp=,
PrivateDevices=, ProtectSystem=, ProtectHome=, ReadOnlyDirectories=,
InaccessibleDirectories= and ReadWriteDirectories=) require that mount
and unmount propagation from the unit's file system namespace is
disabled, and hence downgrade shared to slave.

So either explicitly setting MountFlags=slave, or setting
PrivateTmp=true if that doesn't cause any issues of which I am not
aware.




Vincent what do you think about MountFlags=slave?


'slave' sounds like the correct subtree mount. We were targeting
'MountFlags' to make use of unsharing the mount namespace.

vb


pgpbxSgwQKy9E.pgp
Description: PGP signature
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Docker vs PrivateTmp

2015-01-19 Thread Lars Kellogg-Stedman
On Sat, Jan 17, 2015 at 11:02:01PM -0500, Lars Kellogg-Stedman wrote:
 The TL;DR is that restarting a service with PrivateTmp=true appears to
 preserve references to any mounts in the parent mount namespace that
 were active at the time the service was started.  If these mounts are
 later unmounted in the parent namespace, the reference persists in the
 child mount namespace, which means among other things that the
 mountpoint cannot be deleted (Device or resource busy)...

While I think we've probably identified the solution, I'm still trying
to understand how we get into this situation in the first place.

With neither `MountFlags` nor `PrivateTmp` specified in my docker.service,
starting a container results in the following mount visible in the global mount
namespace:

global# grep /mnt /proc/self/mountinfo
685 433 253:22 / 
/var/lib/docker/devicemapper/mnt/297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448
 rw,relatime - ext4 
/dev/mapper/docker-253:6-98310-297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448
 
rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c138,c268,discard,stripe=16,data=ordered

If I create a new mount namespace (as a child of the global namespace) with
`unshare -m`, I can as expected see the same mount:

unshare# grep /mnt /proc/self/mountinfo
805 804 253:22 / 
/var/lib/docker/devicemapper/mnt/297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448
 rw,relatime - ext4 
/dev/mapper/docker-253:6-98310-297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448
 
rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c138,c268,discard,stripe=16,data=ordered

If I attempt to stop that container, the mount disappears from the global
namespace:

global# grep /mnt /proc/self/mountinfo
global#

But is still visible in the mount namespace I created with unshare:

unshare# grep /mnt /proc/self/mountinfo 
805 804 253:22 / 
/var/lib/docker/devicemapper/mnt/297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448
 rw,relatime - ext4 
/dev/mapper/docker-253:6-98310-297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448
 
rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c138,c268,discard,stripe=16,data=ordered

What is causing this behavior? I have tried to replicate it by hand through a
combination of mount and unshare, and the only way I can get a mount to persist
in the unshare namespace after being unmounted in the global namespace is by
explicitly calling mount `--make-rprivate /` *inside* the unshare namespace, 
which
is obviously not happening in the above Docker example.

Thanks,

-- 
Lars Kellogg-Stedman l...@redhat.com | larsks @ {freenode,twitter,github}
Cloud Engineering / OpenStack  | http://blog.oddbit.com/



pgp717y6GE84v.pgp
Description: PGP signature
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Docker vs PrivateTmp

2015-01-19 Thread Andrei Borzenkov
В Mon, 19 Jan 2015 11:33:42 -0500
Lars Kellogg-Stedman l...@redhat.com пишет:

 On Sat, Jan 17, 2015 at 11:02:01PM -0500, Lars Kellogg-Stedman wrote:
  The TL;DR is that restarting a service with PrivateTmp=true appears to
  preserve references to any mounts in the parent mount namespace that
  were active at the time the service was started.  If these mounts are
  later unmounted in the parent namespace, the reference persists in the
  child mount namespace, which means among other things that the
  mountpoint cannot be deleted (Device or resource busy)...
 
 While I think we've probably identified the solution, I'm still trying
 to understand how we get into this situation in the first place.
 
 With neither `MountFlags` nor `PrivateTmp` specified in my docker.service,
 starting a container results in the following mount visible in the global 
 mount
 namespace:
 
 global# grep /mnt /proc/self/mountinfo
 685 433 253:22 / 
 /var/lib/docker/devicemapper/mnt/297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448
  rw,relatime - ext4 
 /dev/mapper/docker-253:6-98310-297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448
  
 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c138,c268,discard,stripe=16,data=ordered
 
 If I create a new mount namespace (as a child of the global namespace) with
 `unshare -m`, I can as expected see the same mount:
 
 unshare# grep /mnt /proc/self/mountinfo
 805 804 253:22 / 
 /var/lib/docker/devicemapper/mnt/297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448
  rw,relatime - ext4 
 /dev/mapper/docker-253:6-98310-297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448
  
 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c138,c268,discard,stripe=16,data=ordered
 
 If I attempt to stop that container, the mount disappears from the global
 namespace:
 
 global# grep /mnt /proc/self/mountinfo
 global#
 
 But is still visible in the mount namespace I created with unshare:
 
 unshare# grep /mnt /proc/self/mountinfo 
 805 804 253:22 / 
 /var/lib/docker/devicemapper/mnt/297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448
  rw,relatime - ext4 
 /dev/mapper/docker-253:6-98310-297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448
  
 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c138,c268,discard,stripe=16,data=ordered
 
 What is causing this behavior? I have tried to replicate it by hand through a
 combination of mount and unshare, and the only way I can get a mount to 
 persist
 in the unshare namespace after being unmounted in the global namespace is by
 explicitly calling mount `--make-rprivate /` *inside* the unshare namespace, 
 which
 is obviously not happening in the above Docker example.
 

It obviously happens. Your mount is private (it does not have any of
shared/master/.. flags). May be docker does it?


pgpZ4nCiXQPVT.pgp
Description: OpenPGP digital signature
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Docker vs PrivateTmp

2015-01-18 Thread Lars Kellogg-Stedman
On Sun, Jan 18, 2015 at 08:50:35PM -0500, Colin Walters wrote:
 On Sat, Jan 17, 2015, at 11:02 PM, Lars Kellogg-Stedman wrote:
  Hello all,
  
  With systemd 216 on Fedora 21 (kernel 3.17.8), I have run into an odd
  behavior concerning the PrivateTmp directive, and I am looking for
  help identifying this as:
  
  - Everything Is Working As Designed, Citizen
  - A bug in Docker (some mount flag is being set incorrectly?)
 
 This should be fixed by:
 http://pkgs.fedoraproject.org/cgit/docker-io.git/commit/?id=6c9e373ee06cb1aee07d3cae426c46002663010d
 
 i.e. having docker.service use MountFlags=private, so its mounts
 aren't visible to other processes.

Colin,

Thanks for the pointer.

It seems as if using MountFlags=private is going to cause a new set of
problems:

Imagine that I am a system administrator using Docker to containerize
services.  I want to serve set up a webserver container on my Docker
host, so I mount the web content from a remote server:

mount my-fancy-server:/vol/content /content

And then expose that as a Docker volume:

docker run -v /content:/content webserver

This will fail mysteriously, because with MountFlags=private, the
mount of my-fancy-server:/vol/content on /content won't be visible to
Docker containers.  I will spend fruitless hours trying to figure out
why such a seemingly simple operation is failing.

I think we actually want MountFlags=slave, which will permit mounts
from the global namespace to propagate into the service namespace
without permitting propagation in the other direction.  It seems like
this would the Least Surprising behavior.

-- 
Lars Kellogg-Stedman l...@redhat.com | larsks @ {freenode,twitter,github}
Cloud Engineering / OpenStack  | http://blog.oddbit.com/



pgphEQ65s0FS9.pgp
Description: PGP signature
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Docker vs PrivateTmp

2015-01-18 Thread Colin Walters
On Sat, Jan 17, 2015, at 11:02 PM, Lars Kellogg-Stedman wrote:
 Hello all,
 
 With systemd 216 on Fedora 21 (kernel 3.17.8), I have run into an odd
 behavior concerning the PrivateTmp directive, and I am looking for
 help identifying this as:
 
 - Everything Is Working As Designed, Citizen
 - A bug in Docker (some mount flag is being set incorrectly?)

This should be fixed by:
http://pkgs.fedoraproject.org/cgit/docker-io.git/commit/?id=6c9e373ee06cb1aee07d3cae426c46002663010d

i.e. having docker.service use MountFlags=private, so its mounts
aren't visible to other processes.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Docker vs PrivateTmp

2015-01-18 Thread Lars Kellogg-Stedman
On Sun, Jan 18, 2015 at 11:38:12PM -0500, Lars Kellogg-Stedman wrote:
 I think we actually want MountFlags=slave, which will permit mounts
 from the global namespace to propagate into the service namespace
 without permitting propagation in the other direction.  It seems like
 this would the Least Surprising behavior.

...which would be the default if docker.service were itself using
PrivateTmp=true, because from systemd.exec:

Note that the file system namespace related options (PrivateTmp=,
PrivateDevices=, ProtectSystem=, ProtectHome=, ReadOnlyDirectories=,
InaccessibleDirectories= and ReadWriteDirectories=) require that mount
and unmount propagation from the unit's file system namespace is
disabled, and hence downgrade shared to slave.

So either explicitly setting MountFlags=slave, or setting
PrivateTmp=true if that doesn't cause any issues of which I am not
aware.

-- 
Lars Kellogg-Stedman l...@redhat.com | larsks @ {freenode,twitter,github}
Cloud Engineering / OpenStack  | http://blog.oddbit.com/



pgpiVLDyZPrQb.pgp
Description: PGP signature
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Docker vs PrivateTmp

2015-01-18 Thread Lokesh Mandvekar
On Sun, Jan 18, 2015 at 11:38:12PM -0500, Lars Kellogg-Stedman wrote:
 On Sun, Jan 18, 2015 at 08:50:35PM -0500, Colin Walters wrote:
  On Sat, Jan 17, 2015, at 11:02 PM, Lars Kellogg-Stedman wrote:
   Hello all,
   
   With systemd 216 on Fedora 21 (kernel 3.17.8), I have run into an odd
   behavior concerning the PrivateTmp directive, and I am looking for
   help identifying this as:
   
   - Everything Is Working As Designed, Citizen
   - A bug in Docker (some mount flag is being set incorrectly?)
  
  This should be fixed by:
  http://pkgs.fedoraproject.org/cgit/docker-io.git/commit/?id=6c9e373ee06cb1aee07d3cae426c46002663010d
  
  i.e. having docker.service use MountFlags=private, so its mounts
  aren't visible to other processes.
 
 Colin,
 
 Thanks for the pointer.
 
 It seems as if using MountFlags=private is going to cause a new set of
 problems:
 
 Imagine that I am a system administrator using Docker to containerize
 services.  I want to serve set up a webserver container on my Docker
 host, so I mount the web content from a remote server:
 
 mount my-fancy-server:/vol/content /content
 
 And then expose that as a Docker volume:
 
 docker run -v /content:/content webserver
 
 This will fail mysteriously, because with MountFlags=private, the
 mount of my-fancy-server:/vol/content on /content won't be visible to
 Docker containers.  I will spend fruitless hours trying to figure out
 why such a seemingly simple operation is failing.
 
 I think we actually want MountFlags=slave, which will permit mounts
 from the global namespace to propagate into the service namespace
 without permitting propagation in the other direction.  It seems like
 this would the Least Surprising behavior.

Copying dwalsh


-- 
Lokesh
Freenode, OFTC: lsm5
GPG: 0xC7C3A0DD


pgpTr9Yj9xv1t.pgp
Description: PGP signature
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Docker vs PrivateTmp

2015-01-17 Thread Lars Kellogg-Stedman
Hello all,

With systemd 216 on Fedora 21 (kernel 3.17.8), I have run into an odd
behavior concerning the PrivateTmp directive, and I am looking for
help identifying this as:

- Everything Is Working As Designed, Citizen
- A bug in Docker (some mount flag is being set incorrectly?)
- A bug in systemd's PrivateTmp behavior
- Something Completely Different

The TL;DR is that restarting a service with PrivateTmp=true appears to
preserve references to any mounts in the parent mount namespace that
were active at the time the service was started.  If these mounts are
later unmounted in the parent namespace, the reference persists in the
child mount namespace, which means among other things that the
mountpoint cannot be deleted (Device or resource busy).

This seems to be approximately the same issue described in
https://bugzilla.redhat.com/show_bug.cgi?id=851970, but that bug is
two years old and closed.

Here's how I encountered the problem:

Assuming that your Docker is configured to use the `devicemapper`
storage driver, start a Docker container.  Any container will do, e.g:

# cid=$(docker run -d larsks/thttpd)

See the `devicemapper` mountpoint created by Docker for the container:

# grep devicemapper/mnt /proc/mounts

/dev/mapper/docker-253:6-98310-e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62
 
/var/lib/docker/devicemapper/mnt/e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62
 ext4 
rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c261,c1018,relatime,discard,stripe=16,data=ordered
 0 0

Now restart a service -- any service! -- that has PrivateTmp=true:

# systemctl restart systemd-machined

Get the PID for that service:

# systemctl status systemd-machined | grep PID
 Main PID: 18698 (systemd-machine

And see that the Docker devicemapper mount is visible inside the
mount namespace for this process:

# grep devicemapper/mnt /proc/18698/mounts

/dev/mapper/docker-253:6-98310-e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62
 
/var/lib/docker/devicemapper/mnt/e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62
 ext4 
rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c261,c1018,relatime,discard,stripe=16,data=ordered
 0 0

Attempt to destroy the container:

# docker rm -f $cid

Watch Docker fail to destroy the container because it is unable to remove the 
mountpoint directory:

Jan 17 22:43:03 pk115wp-lkellogg docker-1.4.1-dev[18239]:
time=2015-01-17T22:43:03-05:00 level=error msg=Handler for DELETE
/containers/{name:.*} returned error: Cannot destroy container e68df3f45d61:
Driver devicemapper failed to remove root filesystem
e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62: Device is
Busy

Because while that mount is gone from the global namespace:

# grep devicemapper/mnt /proc/mounts

It still exists inside the mount namespace for the service we restarted:

# grep devicemapper/mnt /proc/18698/mounts

/dev/mapper/docker-253:6-98310-e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62
 
/var/lib/docker/devicemapper/mnt/e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62
 ext4 
rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c261,c1018,relatime,discard,stripe=16,data=ordered
 0 0

The only solution is to restart the service holding these references:

   # systemctl restart systemd-machined

Now the mountpoint can be deleted.

Thanks,

-- 
Lars Kellogg-Stedman l...@redhat.com | larsks @ {freenode,twitter,github}
Cloud Engineering / OpenStack  | http://blog.oddbit.com/



pgpZjRhuHvEDk.pgp
Description: PGP signature
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel