Re: [systemd-devel] Docker vs PrivateTmp
On Fri, 30.01.15 11:02, Alexander Larsson (al...@redhat.com) wrote: I think the problem is that docker daemon makes /var/lib/docker/devicemapper private in the host namespace to handle some scalability issues we found in the kernel. This causes problem not with docker containers (because they unmount all other mounts as per the above), but with other namespace-using apps. For instance, if a service with PrivateTmp is launched, it will inherit the existing mounts in /var/lib/docker/devicemapper at the point of startup, but when these are eventually unmounted in the host namespace this is not propagated into the service (due to it being a private mount, not a slave mount). We could try making this slave instead, but I don't know if that then fixes the scalability issues we had, because they were related to stupidities in the kernel wrt propagating mounts. If it doesn't work, then we have to put docker-daemon in its own namespace. The daemon should first create its own namespace, and then detach propagation, not the other way round. This really isn't stupidity in the kernel, but in docker's userspace... Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Docker vs PrivateTmp
On mån, 2015-02-02 at 12:12 +0100, Lennart Poettering wrote: On Fri, 30.01.15 11:02, Alexander Larsson (al...@redhat.com) wrote: I think the problem is that docker daemon makes /var/lib/docker/devicemapper private in the host namespace to handle some scalability issues we found in the kernel. This causes problem not with docker containers (because they unmount all other mounts as per the above), but with other namespace-using apps. For instance, if a service with PrivateTmp is launched, it will inherit the existing mounts in /var/lib/docker/devicemapper at the point of startup, but when these are eventually unmounted in the host namespace this is not propagated into the service (due to it being a private mount, not a slave mount). We could try making this slave instead, but I don't know if that then fixes the scalability issues we had, because they were related to stupidities in the kernel wrt propagating mounts. If it doesn't work, then we have to put docker-daemon in its own namespace. The daemon should first create its own namespace, and then detach propagation, not the other way round. This really isn't stupidity in the kernel, but in docker's userspace... The stupidity was the O(n^4) algorithm in the kernel when it was duplicating all vfsmounts that could possibly be propagated, and then immediately freeing them when they did not propagate, which interacted poorly with some lame kernel O(n^2) allocator behaviour. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander LarssonRed Hat, Inc al...@redhat.comalexander.lars...@gmail.com He's an oversexed shark-wrestling rock star from the 'hood. She's a high-kicking cigar-chomping former first lady with the power to see death. They fight crime! ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Docker vs PrivateTmp
On fre, 2015-01-23 at 11:31 -0500, Daniel J Walsh wrote: On 01/22/2015 10:02 PM, Lennart Poettering wrote: On Sat, 17.01.15 23:02, Lars Kellogg-Stedman (l...@redhat.com) wrote: See the `devicemapper` mountpoint created by Docker for the container: # grep devicemapper/mnt /proc/mounts /dev/mapper/docker-253:6-98310-e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62 /var/lib/docker/devicemapper/mnt/e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62 ext4 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c261,c1018,relatime,discard,stripe=16,data=ordered 0 0 I am not sure why docker makes these mounts visible in the host namespace at all. This smells like a bug. They need to at least be visible to the docker daemon, because it needs to look into it to do diffs between images when e.g. commiting. It doesn't necessarily have to be in the host namespace though, it could be in a different namespace owned only by the docker daemon. I wanted to do that, but for reasons that escape me at the moment that was problematic and I never got to it. Watch Docker fail to destroy the container because it is unable to remove the mountpoint directory: Jan 17 22:43:03 pk115wp-lkellogg docker-1.4.1-dev[18239]: time=2015-01-17T22:43:03-05:00 level=error msg=Handler for DELETE /containers/{name:.*} returned error: Cannot destroy container e68df3f45d61: Driver devicemapper failed to remove root filesystem e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62: Device is Busy This smells as if Docker incorrectly sets the mount propagation bits on its own mounts. It would be good checking /proc/self/mountinfo inside and outside of docker's own namespace, and checking how the propagation bits are set for the individual mounts. It's a bit hard to read, but the interesting bits are in the 7th column of that file. In general: docker should do the equivalent of mount --make-rslave / as first thing after opening its mount namespace, so that from that point on mounts and especiall *un*mounts propagate from the host into the container, but not vice versa. If they do not invoke that, then the propagation will stay at shared, which means the mounts will appear in the host and vice versa, which is certainly undesired. Also, they should not use mount --make-rprivate /, as that means anything the host mounted will stay mounted in the container forever, which is a problem. Also, they really need to make this recursive, so that all mount points they have access too are detached from the host! It was a while since I looked at this, but i believe that the docker containers run as MS_PRIVATE, and they explicitly unmount all the host filesystems exept the ones specifically mounted in as volumes. I think the problem is that docker daemon makes /var/lib/docker/devicemapper private in the host namespace to handle some scalability issues we found in the kernel. This causes problem not with docker containers (because they unmount all other mounts as per the above), but with other namespace-using apps. For instance, if a service with PrivateTmp is launched, it will inherit the existing mounts in /var/lib/docker/devicemapper at the point of startup, but when these are eventually unmounted in the host namespace this is not propagated into the service (due to it being a private mount, not a slave mount). We could try making this slave instead, but I don't know if that then fixes the scalability issues we had, because they were related to stupidities in the kernel wrt propagating mounts. If it doesn't work, then we have to put docker-daemon in its own namespace. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander LarssonRed Hat, Inc al...@redhat.comalexander.lars...@gmail.com He's an impetuous amnesiac hairdresser who dotes on his loving old ma. She's an elegant Bolivian single mother from out of town. They fight crime! ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Docker vs PrivateTmp
On Fri, 23.01.15 11:31, Daniel J Walsh (dwa...@redhat.com) wrote: You just sent a full quote without any comment of yours? On 01/22/2015 10:02 PM, Lennart Poettering wrote: On Sat, 17.01.15 23:02, Lars Kellogg-Stedman (l...@redhat.com) wrote: See the `devicemapper` mountpoint created by Docker for the container: # grep devicemapper/mnt /proc/mounts /dev/mapper/docker-253:6-98310-e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62 /var/lib/docker/devicemapper/mnt/e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62 ext4 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c261,c1018,relatime,discard,stripe=16,data=ordered 0 0 I am not sure why docker makes these mounts visible in the host namespace at all. This smells like a bug. Watch Docker fail to destroy the container because it is unable to remove the mountpoint directory: Jan 17 22:43:03 pk115wp-lkellogg docker-1.4.1-dev[18239]: time=2015-01-17T22:43:03-05:00 level=error msg=Handler for DELETE /containers/{name:.*} returned error: Cannot destroy container e68df3f45d61: Driver devicemapper failed to remove root filesystem e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62: Device is Busy This smells as if Docker incorrectly sets the mount propagation bits on its own mounts. It would be good checking /proc/self/mountinfo inside and outside of docker's own namespace, and checking how the propagation bits are set for the individual mounts. It's a bit hard to read, but the interesting bits are in the 7th column of that file. In general: docker should do the equivalent of mount --make-rslave / as first thing after opening its mount namespace, so that from that point on mounts and especiall *un*mounts propagate from the host into the container, but not vice versa. If they do not invoke that, then the propagation will stay at shared, which means the mounts will appear in the host and vice versa, which is certainly undesired. Also, they should not use mount --make-rprivate /, as that means anything the host mounted will stay mounted in the container forever, which is a problem. Also, they really need to make this recursive, so that all mount points they have access too are detached from the host! Lennart Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Docker vs PrivateTmp
Yes I was trying to get a comment from Alex, since he did the original patch. On 01/23/2015 12:26 PM, Lennart Poettering wrote: On Fri, 23.01.15 11:31, Daniel J Walsh (dwa...@redhat.com) wrote: You just sent a full quote without any comment of yours? On 01/22/2015 10:02 PM, Lennart Poettering wrote: On Sat, 17.01.15 23:02, Lars Kellogg-Stedman (l...@redhat.com) wrote: See the `devicemapper` mountpoint created by Docker for the container: # grep devicemapper/mnt /proc/mounts /dev/mapper/docker-253:6-98310-e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62 /var/lib/docker/devicemapper/mnt/e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62 ext4 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c261,c1018,relatime,discard,stripe=16,data=ordered 0 0 I am not sure why docker makes these mounts visible in the host namespace at all. This smells like a bug. Watch Docker fail to destroy the container because it is unable to remove the mountpoint directory: Jan 17 22:43:03 pk115wp-lkellogg docker-1.4.1-dev[18239]: time=2015-01-17T22:43:03-05:00 level=error msg=Handler for DELETE /containers/{name:.*} returned error: Cannot destroy container e68df3f45d61: Driver devicemapper failed to remove root filesystem e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62: Device is Busy This smells as if Docker incorrectly sets the mount propagation bits on its own mounts. It would be good checking /proc/self/mountinfo inside and outside of docker's own namespace, and checking how the propagation bits are set for the individual mounts. It's a bit hard to read, but the interesting bits are in the 7th column of that file. In general: docker should do the equivalent of mount --make-rslave / as first thing after opening its mount namespace, so that from that point on mounts and especiall *un*mounts propagate from the host into the container, but not vice versa. If they do not invoke that, then the propagation will stay at shared, which means the mounts will appear in the host and vice versa, which is certainly undesired. Also, they should not use mount --make-rprivate /, as that means anything the host mounted will stay mounted in the container forever, which is a problem. Also, they really need to make this recursive, so that all mount points they have access too are detached from the host! Lennart Lennart ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Docker vs PrivateTmp
On 01/22/2015 10:02 PM, Lennart Poettering wrote: On Sat, 17.01.15 23:02, Lars Kellogg-Stedman (l...@redhat.com) wrote: See the `devicemapper` mountpoint created by Docker for the container: # grep devicemapper/mnt /proc/mounts /dev/mapper/docker-253:6-98310-e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62 /var/lib/docker/devicemapper/mnt/e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62 ext4 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c261,c1018,relatime,discard,stripe=16,data=ordered 0 0 I am not sure why docker makes these mounts visible in the host namespace at all. This smells like a bug. Watch Docker fail to destroy the container because it is unable to remove the mountpoint directory: Jan 17 22:43:03 pk115wp-lkellogg docker-1.4.1-dev[18239]: time=2015-01-17T22:43:03-05:00 level=error msg=Handler for DELETE /containers/{name:.*} returned error: Cannot destroy container e68df3f45d61: Driver devicemapper failed to remove root filesystem e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62: Device is Busy This smells as if Docker incorrectly sets the mount propagation bits on its own mounts. It would be good checking /proc/self/mountinfo inside and outside of docker's own namespace, and checking how the propagation bits are set for the individual mounts. It's a bit hard to read, but the interesting bits are in the 7th column of that file. In general: docker should do the equivalent of mount --make-rslave / as first thing after opening its mount namespace, so that from that point on mounts and especiall *un*mounts propagate from the host into the container, but not vice versa. If they do not invoke that, then the propagation will stay at shared, which means the mounts will appear in the host and vice versa, which is certainly undesired. Also, they should not use mount --make-rprivate /, as that means anything the host mounted will stay mounted in the container forever, which is a problem. Also, they really need to make this recursive, so that all mount points they have access too are detached from the host! Lennart ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Docker vs PrivateTmp
On Sun, 18.01.15 20:50, Colin Walters (walt...@verbum.org) wrote: On Sat, Jan 17, 2015, at 11:02 PM, Lars Kellogg-Stedman wrote: Hello all, With systemd 216 on Fedora 21 (kernel 3.17.8), I have run into an odd behavior concerning the PrivateTmp directive, and I am looking for help identifying this as: - Everything Is Working As Designed, Citizen - A bug in Docker (some mount flag is being set incorrectly?) This should be fixed by: http://pkgs.fedoraproject.org/cgit/docker-io.git/commit/?id=6c9e373ee06cb1aee07d3cae426c46002663010d i.e. having docker.service use MountFlags=private, so its mounts aren't visible to other processes. MountFlags=private also disables *un*mount propagation from the host into the service, which means file systems once mounted in the host when a service was started will stay mounted forever in the service, which will keep the backing device busy forever. MountFlags=private is hence pretty useless in real life. Never use it. MountFlags=shared is also pointless, since it is the implied default. Which means MountFlags=slave is really the only option that makes sense to ever add to a unit file. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Docker vs PrivateTmp
On Sat, 17.01.15 23:02, Lars Kellogg-Stedman (l...@redhat.com) wrote: See the `devicemapper` mountpoint created by Docker for the container: # grep devicemapper/mnt /proc/mounts /dev/mapper/docker-253:6-98310-e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62 /var/lib/docker/devicemapper/mnt/e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62 ext4 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c261,c1018,relatime,discard,stripe=16,data=ordered 0 0 I am not sure why docker makes these mounts visible in the host namespace at all. This smells like a bug. Watch Docker fail to destroy the container because it is unable to remove the mountpoint directory: Jan 17 22:43:03 pk115wp-lkellogg docker-1.4.1-dev[18239]: time=2015-01-17T22:43:03-05:00 level=error msg=Handler for DELETE /containers/{name:.*} returned error: Cannot destroy container e68df3f45d61: Driver devicemapper failed to remove root filesystem e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62: Device is Busy This smells as if Docker incorrectly sets the mount propagation bits on its own mounts. It would be good checking /proc/self/mountinfo inside and outside of docker's own namespace, and checking how the propagation bits are set for the individual mounts. It's a bit hard to read, but the interesting bits are in the 7th column of that file. In general: docker should do the equivalent of mount --make-rslave / as first thing after opening its mount namespace, so that from that point on mounts and especiall *un*mounts propagate from the host into the container, but not vice versa. If they do not invoke that, then the propagation will stay at shared, which means the mounts will appear in the host and vice versa, which is certainly undesired. Also, they should not use mount --make-rprivate /, as that means anything the host mounted will stay mounted in the container forever, which is a problem. Also, they really need to make this recursive, so that all mount points they have access too are detached from the host! Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Docker vs PrivateTmp
On 01/19/2015 12:27 AM, Lars Kellogg-Stedman wrote: On Sun, Jan 18, 2015 at 11:38:12PM -0500, Lars Kellogg-Stedman wrote: I think we actually want MountFlags=slave, which will permit mounts from the global namespace to propagate into the service namespace without permitting propagation in the other direction. It seems like this would the Least Surprising behavior. ...which would be the default if docker.service were itself using PrivateTmp=true, because from systemd.exec: Note that the file system namespace related options (PrivateTmp=, PrivateDevices=, ProtectSystem=, ProtectHome=, ReadOnlyDirectories=, InaccessibleDirectories= and ReadWriteDirectories=) require that mount and unmount propagation from the unit's file system namespace is disabled, and hence downgrade shared to slave. So either explicitly setting MountFlags=slave, or setting PrivateTmp=true if that doesn't cause any issues of which I am not aware. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel Vincent what do you think about MountFlags=slave? ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Docker vs PrivateTmp
On 19/01/15 08:39 -0500, Daniel J Walsh wrote: On 01/19/2015 12:27 AM, Lars Kellogg-Stedman wrote: On Sun, Jan 18, 2015 at 11:38:12PM -0500, Lars Kellogg-Stedman wrote: I think we actually want MountFlags=slave, which will permit mounts from the global namespace to propagate into the service namespace without permitting propagation in the other direction. It seems like this would the Least Surprising behavior. ...which would be the default if docker.service were itself using PrivateTmp=true, because from systemd.exec: Note that the file system namespace related options (PrivateTmp=, PrivateDevices=, ProtectSystem=, ProtectHome=, ReadOnlyDirectories=, InaccessibleDirectories= and ReadWriteDirectories=) require that mount and unmount propagation from the unit's file system namespace is disabled, and hence downgrade shared to slave. So either explicitly setting MountFlags=slave, or setting PrivateTmp=true if that doesn't cause any issues of which I am not aware. Vincent what do you think about MountFlags=slave? 'slave' sounds like the correct subtree mount. We were targeting 'MountFlags' to make use of unsharing the mount namespace. vb pgpbxSgwQKy9E.pgp Description: PGP signature ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Docker vs PrivateTmp
On Sat, Jan 17, 2015 at 11:02:01PM -0500, Lars Kellogg-Stedman wrote: The TL;DR is that restarting a service with PrivateTmp=true appears to preserve references to any mounts in the parent mount namespace that were active at the time the service was started. If these mounts are later unmounted in the parent namespace, the reference persists in the child mount namespace, which means among other things that the mountpoint cannot be deleted (Device or resource busy)... While I think we've probably identified the solution, I'm still trying to understand how we get into this situation in the first place. With neither `MountFlags` nor `PrivateTmp` specified in my docker.service, starting a container results in the following mount visible in the global mount namespace: global# grep /mnt /proc/self/mountinfo 685 433 253:22 / /var/lib/docker/devicemapper/mnt/297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448 rw,relatime - ext4 /dev/mapper/docker-253:6-98310-297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c138,c268,discard,stripe=16,data=ordered If I create a new mount namespace (as a child of the global namespace) with `unshare -m`, I can as expected see the same mount: unshare# grep /mnt /proc/self/mountinfo 805 804 253:22 / /var/lib/docker/devicemapper/mnt/297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448 rw,relatime - ext4 /dev/mapper/docker-253:6-98310-297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c138,c268,discard,stripe=16,data=ordered If I attempt to stop that container, the mount disappears from the global namespace: global# grep /mnt /proc/self/mountinfo global# But is still visible in the mount namespace I created with unshare: unshare# grep /mnt /proc/self/mountinfo 805 804 253:22 / /var/lib/docker/devicemapper/mnt/297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448 rw,relatime - ext4 /dev/mapper/docker-253:6-98310-297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c138,c268,discard,stripe=16,data=ordered What is causing this behavior? I have tried to replicate it by hand through a combination of mount and unshare, and the only way I can get a mount to persist in the unshare namespace after being unmounted in the global namespace is by explicitly calling mount `--make-rprivate /` *inside* the unshare namespace, which is obviously not happening in the above Docker example. Thanks, -- Lars Kellogg-Stedman l...@redhat.com | larsks @ {freenode,twitter,github} Cloud Engineering / OpenStack | http://blog.oddbit.com/ pgp717y6GE84v.pgp Description: PGP signature ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Docker vs PrivateTmp
В Mon, 19 Jan 2015 11:33:42 -0500 Lars Kellogg-Stedman l...@redhat.com пишет: On Sat, Jan 17, 2015 at 11:02:01PM -0500, Lars Kellogg-Stedman wrote: The TL;DR is that restarting a service with PrivateTmp=true appears to preserve references to any mounts in the parent mount namespace that were active at the time the service was started. If these mounts are later unmounted in the parent namespace, the reference persists in the child mount namespace, which means among other things that the mountpoint cannot be deleted (Device or resource busy)... While I think we've probably identified the solution, I'm still trying to understand how we get into this situation in the first place. With neither `MountFlags` nor `PrivateTmp` specified in my docker.service, starting a container results in the following mount visible in the global mount namespace: global# grep /mnt /proc/self/mountinfo 685 433 253:22 / /var/lib/docker/devicemapper/mnt/297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448 rw,relatime - ext4 /dev/mapper/docker-253:6-98310-297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c138,c268,discard,stripe=16,data=ordered If I create a new mount namespace (as a child of the global namespace) with `unshare -m`, I can as expected see the same mount: unshare# grep /mnt /proc/self/mountinfo 805 804 253:22 / /var/lib/docker/devicemapper/mnt/297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448 rw,relatime - ext4 /dev/mapper/docker-253:6-98310-297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c138,c268,discard,stripe=16,data=ordered If I attempt to stop that container, the mount disappears from the global namespace: global# grep /mnt /proc/self/mountinfo global# But is still visible in the mount namespace I created with unshare: unshare# grep /mnt /proc/self/mountinfo 805 804 253:22 / /var/lib/docker/devicemapper/mnt/297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448 rw,relatime - ext4 /dev/mapper/docker-253:6-98310-297bf7ae64bd5cf552b45b098b22df85a49deeadb2d71b330e2f866dac95a448 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c138,c268,discard,stripe=16,data=ordered What is causing this behavior? I have tried to replicate it by hand through a combination of mount and unshare, and the only way I can get a mount to persist in the unshare namespace after being unmounted in the global namespace is by explicitly calling mount `--make-rprivate /` *inside* the unshare namespace, which is obviously not happening in the above Docker example. It obviously happens. Your mount is private (it does not have any of shared/master/.. flags). May be docker does it? pgpZ4nCiXQPVT.pgp Description: OpenPGP digital signature ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Docker vs PrivateTmp
On Sun, Jan 18, 2015 at 08:50:35PM -0500, Colin Walters wrote: On Sat, Jan 17, 2015, at 11:02 PM, Lars Kellogg-Stedman wrote: Hello all, With systemd 216 on Fedora 21 (kernel 3.17.8), I have run into an odd behavior concerning the PrivateTmp directive, and I am looking for help identifying this as: - Everything Is Working As Designed, Citizen - A bug in Docker (some mount flag is being set incorrectly?) This should be fixed by: http://pkgs.fedoraproject.org/cgit/docker-io.git/commit/?id=6c9e373ee06cb1aee07d3cae426c46002663010d i.e. having docker.service use MountFlags=private, so its mounts aren't visible to other processes. Colin, Thanks for the pointer. It seems as if using MountFlags=private is going to cause a new set of problems: Imagine that I am a system administrator using Docker to containerize services. I want to serve set up a webserver container on my Docker host, so I mount the web content from a remote server: mount my-fancy-server:/vol/content /content And then expose that as a Docker volume: docker run -v /content:/content webserver This will fail mysteriously, because with MountFlags=private, the mount of my-fancy-server:/vol/content on /content won't be visible to Docker containers. I will spend fruitless hours trying to figure out why such a seemingly simple operation is failing. I think we actually want MountFlags=slave, which will permit mounts from the global namespace to propagate into the service namespace without permitting propagation in the other direction. It seems like this would the Least Surprising behavior. -- Lars Kellogg-Stedman l...@redhat.com | larsks @ {freenode,twitter,github} Cloud Engineering / OpenStack | http://blog.oddbit.com/ pgphEQ65s0FS9.pgp Description: PGP signature ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Docker vs PrivateTmp
On Sat, Jan 17, 2015, at 11:02 PM, Lars Kellogg-Stedman wrote: Hello all, With systemd 216 on Fedora 21 (kernel 3.17.8), I have run into an odd behavior concerning the PrivateTmp directive, and I am looking for help identifying this as: - Everything Is Working As Designed, Citizen - A bug in Docker (some mount flag is being set incorrectly?) This should be fixed by: http://pkgs.fedoraproject.org/cgit/docker-io.git/commit/?id=6c9e373ee06cb1aee07d3cae426c46002663010d i.e. having docker.service use MountFlags=private, so its mounts aren't visible to other processes. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Docker vs PrivateTmp
On Sun, Jan 18, 2015 at 11:38:12PM -0500, Lars Kellogg-Stedman wrote: I think we actually want MountFlags=slave, which will permit mounts from the global namespace to propagate into the service namespace without permitting propagation in the other direction. It seems like this would the Least Surprising behavior. ...which would be the default if docker.service were itself using PrivateTmp=true, because from systemd.exec: Note that the file system namespace related options (PrivateTmp=, PrivateDevices=, ProtectSystem=, ProtectHome=, ReadOnlyDirectories=, InaccessibleDirectories= and ReadWriteDirectories=) require that mount and unmount propagation from the unit's file system namespace is disabled, and hence downgrade shared to slave. So either explicitly setting MountFlags=slave, or setting PrivateTmp=true if that doesn't cause any issues of which I am not aware. -- Lars Kellogg-Stedman l...@redhat.com | larsks @ {freenode,twitter,github} Cloud Engineering / OpenStack | http://blog.oddbit.com/ pgpiVLDyZPrQb.pgp Description: PGP signature ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Docker vs PrivateTmp
On Sun, Jan 18, 2015 at 11:38:12PM -0500, Lars Kellogg-Stedman wrote: On Sun, Jan 18, 2015 at 08:50:35PM -0500, Colin Walters wrote: On Sat, Jan 17, 2015, at 11:02 PM, Lars Kellogg-Stedman wrote: Hello all, With systemd 216 on Fedora 21 (kernel 3.17.8), I have run into an odd behavior concerning the PrivateTmp directive, and I am looking for help identifying this as: - Everything Is Working As Designed, Citizen - A bug in Docker (some mount flag is being set incorrectly?) This should be fixed by: http://pkgs.fedoraproject.org/cgit/docker-io.git/commit/?id=6c9e373ee06cb1aee07d3cae426c46002663010d i.e. having docker.service use MountFlags=private, so its mounts aren't visible to other processes. Colin, Thanks for the pointer. It seems as if using MountFlags=private is going to cause a new set of problems: Imagine that I am a system administrator using Docker to containerize services. I want to serve set up a webserver container on my Docker host, so I mount the web content from a remote server: mount my-fancy-server:/vol/content /content And then expose that as a Docker volume: docker run -v /content:/content webserver This will fail mysteriously, because with MountFlags=private, the mount of my-fancy-server:/vol/content on /content won't be visible to Docker containers. I will spend fruitless hours trying to figure out why such a seemingly simple operation is failing. I think we actually want MountFlags=slave, which will permit mounts from the global namespace to propagate into the service namespace without permitting propagation in the other direction. It seems like this would the Least Surprising behavior. Copying dwalsh -- Lokesh Freenode, OFTC: lsm5 GPG: 0xC7C3A0DD pgpTr9Yj9xv1t.pgp Description: PGP signature ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Docker vs PrivateTmp
Hello all, With systemd 216 on Fedora 21 (kernel 3.17.8), I have run into an odd behavior concerning the PrivateTmp directive, and I am looking for help identifying this as: - Everything Is Working As Designed, Citizen - A bug in Docker (some mount flag is being set incorrectly?) - A bug in systemd's PrivateTmp behavior - Something Completely Different The TL;DR is that restarting a service with PrivateTmp=true appears to preserve references to any mounts in the parent mount namespace that were active at the time the service was started. If these mounts are later unmounted in the parent namespace, the reference persists in the child mount namespace, which means among other things that the mountpoint cannot be deleted (Device or resource busy). This seems to be approximately the same issue described in https://bugzilla.redhat.com/show_bug.cgi?id=851970, but that bug is two years old and closed. Here's how I encountered the problem: Assuming that your Docker is configured to use the `devicemapper` storage driver, start a Docker container. Any container will do, e.g: # cid=$(docker run -d larsks/thttpd) See the `devicemapper` mountpoint created by Docker for the container: # grep devicemapper/mnt /proc/mounts /dev/mapper/docker-253:6-98310-e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62 /var/lib/docker/devicemapper/mnt/e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62 ext4 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c261,c1018,relatime,discard,stripe=16,data=ordered 0 0 Now restart a service -- any service! -- that has PrivateTmp=true: # systemctl restart systemd-machined Get the PID for that service: # systemctl status systemd-machined | grep PID Main PID: 18698 (systemd-machine And see that the Docker devicemapper mount is visible inside the mount namespace for this process: # grep devicemapper/mnt /proc/18698/mounts /dev/mapper/docker-253:6-98310-e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62 /var/lib/docker/devicemapper/mnt/e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62 ext4 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c261,c1018,relatime,discard,stripe=16,data=ordered 0 0 Attempt to destroy the container: # docker rm -f $cid Watch Docker fail to destroy the container because it is unable to remove the mountpoint directory: Jan 17 22:43:03 pk115wp-lkellogg docker-1.4.1-dev[18239]: time=2015-01-17T22:43:03-05:00 level=error msg=Handler for DELETE /containers/{name:.*} returned error: Cannot destroy container e68df3f45d61: Driver devicemapper failed to remove root filesystem e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62: Device is Busy Because while that mount is gone from the global namespace: # grep devicemapper/mnt /proc/mounts It still exists inside the mount namespace for the service we restarted: # grep devicemapper/mnt /proc/18698/mounts /dev/mapper/docker-253:6-98310-e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62 /var/lib/docker/devicemapper/mnt/e68df3f45d6151259ce84a0e467a3117840084e99ef3bbc654b33f08d2d6dd62 ext4 rw,context=system_u:object_r:svirt_sandbox_file_t:s0:c261,c1018,relatime,discard,stripe=16,data=ordered 0 0 The only solution is to restart the service holding these references: # systemctl restart systemd-machined Now the mountpoint can be deleted. Thanks, -- Lars Kellogg-Stedman l...@redhat.com | larsks @ {freenode,twitter,github} Cloud Engineering / OpenStack | http://blog.oddbit.com/ pgpZjRhuHvEDk.pgp Description: PGP signature ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel