Re: [systemd-devel] [ANNOUNCE] Git development moved to github
On Tue, Jun 9, 2015 at 11:37 PM, Lennart Poettering lenn...@poettering.net wrote: On Tue, 09.06.15 13:04, Filipe Brandenburger (filbran...@google.com) wrote: On Tue, Jun 9, 2015 at 12:59 PM, Lennart Poettering lenn...@poettering.net wrote: [...] so we comment and ask for a new PR, and close the old one. See my previous comment, I think this cure is worse than the disease :-) Why would you say this? Why are multiple sequencial PR, where the old obsoleted ones are closed and locked that bad? Instead, just reuse the same PR and use `git push -f` to ship new versions of the commits to the same branch... Yes it's awful but unfortunately that's how GitHub works... Yeah, it is awful, and loses all the comments, as well is incompatible with having multiple people making patch suggestions for the same issue. FWIW it only loses the comments if people comment on individual commits instead of commenting on the Files changed tab of a PR. I usually comment in this way on purpose instead of commenting on commits, so that the history of comments are kept in the PR, even after rebase (it might be folded if the chunk of the patch is not there anymore, but the comment is still in the PR). If you really want to comment on an individual commit (but I don't recommend it), you can include the reference of the PR in your comment (#42), then github will keep your comment attached to the PR. I think it is fine as it is as long as people comment in the Files changed tab. To work around the problem of line comments being lost, just ask *reviewers* to make most of the relevant comments in the PR thread and keep line comments to simple comments that are probably not going to be relevant when they're obliterated... Well, *the* major reason we switched to github is actually taking benefit of the much more powerful inline review tool that's a pleasure to work with. Reviews by mail are just awful, and reviews out-of-line are even worse. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd-218 - Requisite implies TriggeredByRestartOf
On Tue, May 19, 2015 at 1:26 AM, Lennart Poettering lenn...@poettering.net wrote: On Tue, 19.05.15 00:55, Lennart Poettering (lenn...@poettering.net) wrote: On Thu, 14.05.15 21:23, Evert (evert.gen...@planet.nl) wrote: Hi, According to the systemd documentation, Requisite disallows starting a unit unless the specified unit has been started. This seems to work fine, however, if the specified unit has been restarted, this unit will be started too! This is not what should happen and it doesn't happen with a stop and start of the specified unit, so clearly, restart behaves different than stop followed by start. Hmm, I figure this is a bug in systemd. Whenever you have a unit A declare a Requisite= on a unit B then this will actually create two dependencies: the actual Requisite= one plus one in the opposite direction, of type RequiredBy=. Dependencies of type RequiredBy= are mostly internal to systemd, you cannot configure them directly (though you can query them using systemctl show). They are used to track down what to stop when we get systemctl stop by the user. Now, as a special shortcut we currently map Requisite= and Required= to a reverse dependency of RequiredBy=, which is problematichere, since we'll hence make no distinction between the two when processing systemctl stop. I figure we need to split up the reverse dep here, and introduce a seperate RequisiteBy= dependency so that we can avoid this problem... Fixed in git. Please verify. The commit be7d9ff730cb88d7c6a869dd5c47754c78ceaef2 (core: introduce seperate reverse dependencies for Requires= and Requisite=) introduced a regression in rkt: the container does not stop anymore when the main service terminates. In my test, grep says we have one unit using a Requisite* option (systemd-update-utmp-runlevel.service). The main unit file uses one Requires option: == sha512-0543bc3759cd76d21d8605cd55e19bea.service == [Unit] Description=coreos.com/rkt-inspect DefaultDependencies=false OnFailureJobMode=isolate OnFailure=reaper.service Wants=exit-watcher.service Requires=prepare-app@-opt-stage2-sha512\x2d0543bc3759cd76d21d8605cd55e19bea-rootfs.service After=prepare-app@-opt-stage2-sha512\x2d0543bc3759cd76d21d8605cd55e19bea-rootfs.service [Service] Restart=no ExecStart=/diagexec /opt/stage2/sha512-0543bc3759cd76d21d8605cd55e19bea/rootfs / /rkt/env/sha512-0543bc3759cd76d21d8605cd55e19bea 0 0 /inspect -print-msg=Hellooo -sleep=3 User=0 Group=0 I am trying to understand the regression... it's possible that it is rkt using systemd incorrectly. Cheers, Alban ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd-218 - Requisite implies TriggeredByRestartOf
On Tue, May 19, 2015 at 3:44 PM, Lennart Poettering lenn...@poettering.net wrote: On Tue, 19.05.15 13:08, Alban Crequy (al...@endocode.com) wrote: The commit be7d9ff730cb88d7c6a869dd5c47754c78ceaef2 (core: introduce seperate reverse dependencies for Requires= and Requisite=) introduced a regression in rkt: the container does not stop anymore when the main service terminates. In my test, grep says we have one unit using a Requisite* option (systemd-update-utmp-runlevel.service). The main unit file uses one Requires option: I have the suspicion that f3b85044c845de03de05135c1dd5f3298bf3e5a2 might fix your issue, can you check? Yes, it works now. Thanks! (only if you have units that use StopWhenUnneeded=yes, otherwise it's a different issue) Yes, there was a unit with StopWhenUnneeded=yes. Cheers, Alban ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] nspawn: cloexec extraneous fds
On Mon, May 18, 2015 at 2:00 PM, Lennart Poettering lenn...@poettering.net wrote: On Mon, 18.05.15 10:34, Alban Crequy (al...@endocode.com) wrote: On Wed, May 13, 2015 at 6:14 PM, Lennart Poettering lenn...@poettering.net wrote: On Mon, 11.05.15 16:41, Alban Crequy (alban.cre...@gmail.com) wrote: src/nspawn/nspawn.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c index 71a6239..2e45c3b 100644 --- a/src/nspawn/nspawn.c +++ b/src/nspawn/nspawn.c @@ -3739,6 +3739,9 @@ int main(int argc, char *argv[]) { bool root_device_rw = true, home_device_rw = true, srv_device_rw = true; _cleanup_close_ int master = -1, image_fd = -1; _cleanup_fdset_free_ FDSet *fds = NULL; +_cleanup_fdset_free_ FDSet *misc_fds = NULL; +int fd; +Iterator i; int r, n_fd_passed, loop_nr = -1; char veth_name[IFNAMSIZ]; bool secondary = false, remove_subvol = false; @@ -3775,7 +3778,11 @@ int main(int argc, char *argv[]) { goto finish; } } -fdset_close_others(fds); +fdset_new_fill(misc_fds); +FDSET_FOREACH(fd, fds, i) { +fdset_remove(misc_fds, fd); +} +fdset_cloexec(misc_fds, true); log_open(); Do we really need an extra FDSet object for this? Why not just remove the fdset_close_others() from the nspawn parent and adding it to the child process instead, without depending on O_CLOEXEC? Appears much simpler to as it avoids keeping two fdsets around... fdset_close_others() iterates on the open file descriptor found in /proc/self/fd/* at the time of the call, so I would need to make sure it does not harvest the newly opened file descriptors such as the ones used by the barrier, one end of the kmsg socket pair, one end of the rtnl socket pair, and possibly others. I would need to use fdset_put() on each of those fds. Although the fds used by the barrier are exported in struct Barrier, it is kind of internal structure. Well, we could just invoke it after the barrier stuff has done its deed, right before we invoke exec(), no? Right. Sorry for the noise, I'll send another patch shortly, after I finish testing it. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [PATCH] nspawn: close extra fds before execing init
From: Alban Crequy al...@endocode.com When systemd-nspawn gets exec*()ed, it inherits the followings file descriptors: - 0, 1, 2: stdin, stdout, stderr - SD_LISTEN_FDS_START, ... SD_LISTEN_FDS_START+LISTEN_FDS: file descriptors passed by the system manager (useful for socket activation). They are passed to the child process (process leader). - extra lock fd: rkt passes a locked directory as an extra fd, so the directory remains locked as long as the container is alive. systemd-nspawn used to close all open fds except 0, 1, 2 and the SD_LISTEN_FDS_START..SD_LISTEN_FDS_START+LISTEN_FDS. This patch delays the close just before the exec so the nspawn process (parent) keeps the extra fds open. This patch supersedes the previous attempt (cloexec extraneous fds): http://lists.freedesktop.org/archives/systemd-devel/2015-May/031608.html --- src/nspawn/nspawn.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c index 8aa7b45..85a7bad 100644 --- a/src/nspawn/nspawn.c +++ b/src/nspawn/nspawn.c @@ -3998,7 +3998,6 @@ int main(int argc, char *argv[]) { goto finish; } } -fdset_close_others(fds); log_open(); if (arg_directory) { @@ -4509,6 +4508,8 @@ int main(int argc, char *argv[]) { * setup, too... */ (void) barrier_place_and_sync(barrier); /* #5 */ +(void) fdset_close_others(fds); + if (arg_boot) { char **a; size_t l; -- 2.1.4 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] nspawn: cloexec extraneous fds
On Wed, May 13, 2015 at 6:14 PM, Lennart Poettering lenn...@poettering.net wrote: On Mon, 11.05.15 16:41, Alban Crequy (alban.cre...@gmail.com) wrote: src/nspawn/nspawn.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c index 71a6239..2e45c3b 100644 --- a/src/nspawn/nspawn.c +++ b/src/nspawn/nspawn.c @@ -3739,6 +3739,9 @@ int main(int argc, char *argv[]) { bool root_device_rw = true, home_device_rw = true, srv_device_rw = true; _cleanup_close_ int master = -1, image_fd = -1; _cleanup_fdset_free_ FDSet *fds = NULL; +_cleanup_fdset_free_ FDSet *misc_fds = NULL; +int fd; +Iterator i; int r, n_fd_passed, loop_nr = -1; char veth_name[IFNAMSIZ]; bool secondary = false, remove_subvol = false; @@ -3775,7 +3778,11 @@ int main(int argc, char *argv[]) { goto finish; } } -fdset_close_others(fds); +fdset_new_fill(misc_fds); +FDSET_FOREACH(fd, fds, i) { +fdset_remove(misc_fds, fd); +} +fdset_cloexec(misc_fds, true); log_open(); Do we really need an extra FDSet object for this? Why not just remove the fdset_close_others() from the nspawn parent and adding it to the child process instead, without depending on O_CLOEXEC? Appears much simpler to as it avoids keeping two fdsets around... fdset_close_others() iterates on the open file descriptor found in /proc/self/fd/* at the time of the call, so I would need to make sure it does not harvest the newly opened file descriptors such as the ones used by the barrier, one end of the kmsg socket pair, one end of the rtnl socket pair, and possibly others. I would need to use fdset_put() on each of those fds. Although the fds used by the barrier are exported in struct Barrier, it is kind of internal structure. By using fdset_cloexec() instead, I don't need to keep in sync the fds to add in the FDSet whenever a new fd is added in systemd-nspawn. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [PATCH] [PATCH v4] core: Private*/Protect* options with RootDirectory
From: Alban Crequy al...@endocode.com When a service is chrooted with the option RootDirectory=/opt/..., then the options PrivateDevices, PrivateTmp, ProtectHome, ProtectSystem must mount the directories under $RootDirectory/{dev,tmp,home,usr,boot}. The test-ns tool can test setup_namespace() with and without chroot: $ sudo TEST_NS_PROJECTS=/home/lennart/projects ./test-ns $ sudo TEST_NS_CHROOT=/home/alban/debian-tree TEST_NS_PROJECTS=/home/alban/debian-tree/home/alban/Documents ./test-ns v4: - rebase on origin/master - coding style - use exec_needs_mount_namespace() - rename variable chroot - root_directory - rename variable new_mnt_namespace - needs_mount_namespace - prefer prefix_roota() to strjoina() - update ./test-ns to print something v3: - add missing include mkdir.h - fixes after the review http://lists.freedesktop.org/archives/systemd-devel/2015-April/031202.html - move logic to setup_namespace() - either call setup_namespace() or chroot() - update ./test-ns so that I can run it without having /home/lennart v2: http://lists.freedesktop.org/archives/systemd-devel/2015-February/028636.html - create the $RootDirectory/dev directory if missing. This is consistent with mount unit creating the mount points if missing. - avoid arbitrary limitation on 512 characters v1: http://lists.freedesktop.org/archives/systemd-devel/2015-February/028609.html --- src/core/execute.c | 8 -- src/core/namespace.c | 80 +--- src/core/namespace.h | 3 +- src/test/test-ns.c | 24 ++-- 4 files changed, 100 insertions(+), 15 deletions(-) diff --git a/src/core/execute.c b/src/core/execute.c index 0cca481..97498b2 100644 --- a/src/core/execute.c +++ b/src/core/execute.c @@ -1307,6 +1307,7 @@ static int exec_child( uid_t uid = UID_INVALID; gid_t gid = GID_INVALID; int i, r; +bool needs_mount_namespace; assert(unit); assert(command); @@ -1585,7 +1586,9 @@ static int exec_child( } } -if (exec_needs_mount_namespace(context, params, runtime)) { +needs_mount_namespace = exec_needs_mount_namespace(context, params, runtime); + +if (needs_mount_namespace) { char *tmp = NULL, *var = NULL; /* The runtime struct only contains the parent @@ -1602,6 +1605,7 @@ static int exec_child( } r = setup_namespace( +params-apply_chroot ? context-root_directory : NULL, context-read_write_dirs, context-read_only_dirs, context-inaccessible_dirs, @@ -1627,7 +1631,7 @@ static int exec_child( } if (params-apply_chroot) { -if (context-root_directory) +if (!needs_mount_namespace context-root_directory) if (chroot(context-root_directory) 0) { *exit_status = EXIT_CHROOT; return -errno; diff --git a/src/core/namespace.c b/src/core/namespace.c index bee2839..4f9b95e 100644 --- a/src/core/namespace.c +++ b/src/core/namespace.c @@ -36,6 +36,7 @@ #include dev-setup.h #include selinux-util.h #include namespace.h +#include mkdir.h typedef enum MountMode { /* This is ordered by priority! */ @@ -124,6 +125,22 @@ static void drop_duplicates(BindMount *m, unsigned *n) { *n = t - m; } +static int mount_move_root(const char *path) { +if (chdir(path) 0) +return -errno; + +if (mount(path, /, NULL, MS_MOVE, NULL) 0) +return -errno; + +if (chroot(.) 0) +return -errno; + +if (chdir(/) 0) +return -errno; + +return 0; +} + static int mount_dev(BindMount *m) { static const char devnodes[] = /dev/null\0 @@ -225,7 +242,13 @@ static int mount_dev(BindMount *m) { dev_setup(temporary_mount); -if (mount(dev, /dev/, NULL, MS_MOVE, NULL) 0) { +/* Create the /dev directory if missing. It is more likely to be + * missing when the service is started with RootDirectory. This is + * consistent with mount units creating the mount points when missing. + */ +(void) mkdir_p_label(m-path, 0755); + +if (mount(dev, m-path, NULL, MS_MOVE, NULL) 0) { r = -errno; goto fail; } @@ -404,6 +427,7 @@ static int make_read_only(BindMount *m) { } int setup_namespace( +const char* root_directory, char** read_write_dirs, char** read_only_dirs, char** inaccessible_dirs, @@ -449,37 +473,56 @@ int setup_namespace( return r; if (tmp_dir) { -m-path = /tmp
[systemd-devel] [PATCH] [PATCH v3] core: Private*/Protect* options with RootDirectory
From: Alban Crequy al...@endocode.com When a service is chrooted with the option RootDirectory=/opt/..., then the options PrivateDevices, PrivateTmp, ProtectHome, ProtectSystem must mount the directories under $RootDirectory/{dev,tmp,home,usr,boot}. This can be tested with test-ns as root: # export TEST_NS_PROJECTS=/home/alban/debian-tree/dev2/ # export TEST_NS_CHROOT=/home/alban/debian-tree # ./test-ns Questions: - I'm not quite sure about whether we can rollback in setup_namespace() v3: - add missing include mkdir.h - fixes after the review http://lists.freedesktop.org/archives/systemd-devel/2015-April/031202.html - move logic to setup_namespace() - either call setup_namespace() or chroot() - update ./test-ns so that I can run it without having /home/lennart v2: http://lists.freedesktop.org/archives/systemd-devel/2015-February/028636.html - create the $RootDirectory/dev directory if missing. This is consistent with mount unit creating the mount points if missing. - avoid arbitrary limitation on 512 characters v1: http://lists.freedesktop.org/archives/systemd-devel/2015-February/028609.html --- src/core/execute.c | 9 -- src/core/namespace.c | 77 +--- src/core/namespace.h | 3 +- src/test/test-ns.c | 10 +-- 4 files changed, 83 insertions(+), 16 deletions(-) diff --git a/src/core/execute.c b/src/core/execute.c index 1a297ba..d4ccac6 100644 --- a/src/core/execute.c +++ b/src/core/execute.c @@ -1277,6 +1277,7 @@ static int exec_child( uid_t uid = UID_INVALID; gid_t gid = GID_INVALID; int i, r; +bool new_mnt_namespace; assert(unit); assert(command); @@ -1555,7 +1556,7 @@ static int exec_child( } } -if (!strv_isempty(context-read_write_dirs) || +new_mnt_namespace = !strv_isempty(context-read_write_dirs) || !strv_isempty(context-read_only_dirs) || !strv_isempty(context-inaccessible_dirs) || context-mount_flags != 0 || @@ -1563,8 +1564,9 @@ static int exec_child( params-bus_endpoint_path || context-private_devices || context-protect_system != PROTECT_SYSTEM_NO || -context-protect_home != PROTECT_HOME_NO) { +context-protect_home != PROTECT_HOME_NO; +if (new_mnt_namespace) { char *tmp = NULL, *var = NULL; /* The runtime struct only contains the parent @@ -1581,6 +1583,7 @@ static int exec_child( } r = setup_namespace( +params-apply_chroot ? context-root_directory : NULL, context-read_write_dirs, context-read_only_dirs, context-inaccessible_dirs, @@ -1606,7 +1609,7 @@ static int exec_child( } if (params-apply_chroot) { -if (context-root_directory) +if (!new_mnt_namespace context-root_directory) if (chroot(context-root_directory) 0) { *exit_status = EXIT_CHROOT; return -errno; diff --git a/src/core/namespace.c b/src/core/namespace.c index 718da23..9f5ba92 100644 --- a/src/core/namespace.c +++ b/src/core/namespace.c @@ -36,6 +36,7 @@ #include dev-setup.h #include selinux-util.h #include namespace.h +#include mkdir.h typedef enum MountMode { /* This is ordered by priority! */ @@ -124,6 +125,26 @@ static void drop_duplicates(BindMount *m, unsigned *n) { *n = t - m; } +static int mount_move_root(const char *path) { +if (chdir(path) 0) { +return -errno; +} + +if (mount(path, /, NULL, MS_MOVE, NULL) 0) { +return -errno; +} + +if (chroot(.) 0) { +return -errno; +} + +if (chdir(/) 0) { +return -errno; +} + +return 0; +} + static int mount_dev(BindMount *m) { static const char devnodes[] = /dev/null\0 @@ -225,7 +246,13 @@ static int mount_dev(BindMount *m) { dev_setup(temporary_mount); -if (mount(dev, /dev/, NULL, MS_MOVE, NULL) 0) { +/* Create the /dev directory if missing. It is more likely to be + * missing when the service is started with RootDirectory. This is + * consistent with mount units creating the mount points when missing. + */ +mkdir_p_label (m-path, 0755); + +if (mount(dev, m-path, NULL, MS_MOVE, NULL) 0) { r = -errno; goto fail; } @@ -404,6 +431,7 @@ static int make_read_only(BindMount *m) { } int setup_namespace( +const char* chroot, char** read_write_dirs, char** read_only_dirs, char
[systemd-devel] [PATCH] [PATCH v2] nspawn: check the pid in SIGCHLD handler before terminating the container
From: Alban Crequy al...@endocode.com When a process starts systemd-nspawn with exec*() without fork(), systemd-nspawn can be the parent process of children processes unknown to systemd-nspawn. It can then receive the signal SIGCHLD for both the container leader process and the previously started processes. So it should distinguish them. v2: - correctly check if a child was in a waitable state. --- src/nspawn/nspawn.c | 55 +++-- 1 file changed, 49 insertions(+), 6 deletions(-) diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c index 05d2c71..71a6239 100644 --- a/src/nspawn/nspawn.c +++ b/src/nspawn/nspawn.c @@ -23,6 +23,7 @@ #include sched.h #include unistd.h #include sys/types.h +#include sys/wait.h #include sys/mount.h #include stdlib.h #include string.h @@ -3511,6 +3512,12 @@ static int change_uid_gid(char **_home) { return 0; } +typedef struct SigChildData { +pid_t leader_pid; +siginfo_t leader_status; +bool terminated; +} SigChildData; + /* * Return values: * 0 : wait_for_terminate() failed to get the state of the @@ -3528,13 +3535,17 @@ static int change_uid_gid(char **_home) { * That is, success is indicated by a return value of zero, and an * error is indicated by a non-zero value. */ -static int wait_for_container(pid_t pid, ContainerStatus *container) { +static int wait_for_container(SigChildData *sigchld_ctx, pid_t pid, ContainerStatus *container) { siginfo_t status; int r; -r = wait_for_terminate(pid, status); -if (r 0) -return log_warning_errno(r, Failed to wait for container: %m); +if (sigchld_ctx-terminated) { +status = sigchld_ctx-leader_status; +} else { +r = wait_for_terminate(pid, status); +if (r 0) +return log_warning_errno(r, Failed to wait for container: %m); +} switch (status.si_code) { @@ -3594,6 +3605,35 @@ static int on_orderly_shutdown(sd_event_source *s, const struct signalfd_siginfo return 0; } +static int on_sigchld(sd_event_source *s, const struct signalfd_siginfo *si, void *userdata) { +SigChildData *ctx = userdata; +pid_t leader_pid = ctx-leader_pid; + +/* several terminated children could be merged in a single SIGCHLD, so + * don't rely on si-ssi_pid. */ + +while (1) { +int r; +siginfo_t status; + +zero(status); +r = waitid(P_ALL, -1, status, WEXITED | WNOHANG); +if (r 0) +break; + +if (status.si_pid = 0) +break; + +if (status.si_pid == leader_pid) { +ctx-leader_status = status; +ctx-terminated = true; +return sd_event_exit(sd_event_source_get_event(s), 0); +} +} + +return 0; +} + static int determine_names(void) { int r; @@ -3917,6 +3957,7 @@ int main(int argc, char *argv[]) { .sa_handler = nop_handler, .sa_flags = SA_NOCLDSTOP, }; +SigChildData sigchld_ctx = {0,}; r = barrier_create(barrier); if (r 0) { @@ -4386,7 +4427,9 @@ int main(int argc, char *argv[]) { } /* simply exit on sigchld */ -sd_event_add_signal(event, NULL, SIGCHLD, NULL, NULL); +sigchld_ctx.leader_pid = pid; +sigchld_ctx.terminated = false; +sd_event_add_signal(event, NULL, SIGCHLD, on_sigchld , sigchld_ctx); if (arg_expose_ports) { r = watch_rtnl(event, rtnl_socket_pair[0], exposed, rtnl); @@ -4425,7 +4468,7 @@ int main(int argc, char *argv[]) { /* Normally redundant, but better safe than sorry */ kill(pid, SIGKILL); -r = wait_for_container(pid, container_status); +r = wait_for_container(sigchld_ctx, pid, container_status); pid = 0; if (r 0) -- 2.1.4 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [PATCH] nspawn: cloexec extraneous fds
From: Alban Crequy al...@endocode.com When systemd-nspawn gets exec*()ed, it inherits the followings file descriptors: - 0, 1, 2: stdin, stdout, stderr - SD_LISTEN_FDS_START, ... SD_LISTEN_FDS_START+LISTEN_FDS: file descriptors passed by the system manager (useful for socket activation). They are passed to the child process (process leader). - extra lock fd: rkt passes a locked directory as an extra fd, so the directory remains locked as long as the container is alive. systemd-nspawn used to close all open fds except 0, 1, 2 and the SD_LISTEN_FDS_START..SD_LISTEN_FDS_START+LISTEN_FDS. This patch just cloexecs them instead so they stay open in the systemd-nspawn process but they are not passed to the process leader. --- src/nspawn/nspawn.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c index 71a6239..2e45c3b 100644 --- a/src/nspawn/nspawn.c +++ b/src/nspawn/nspawn.c @@ -3739,6 +3739,9 @@ int main(int argc, char *argv[]) { bool root_device_rw = true, home_device_rw = true, srv_device_rw = true; _cleanup_close_ int master = -1, image_fd = -1; _cleanup_fdset_free_ FDSet *fds = NULL; +_cleanup_fdset_free_ FDSet *misc_fds = NULL; +int fd; +Iterator i; int r, n_fd_passed, loop_nr = -1; char veth_name[IFNAMSIZ]; bool secondary = false, remove_subvol = false; @@ -3775,7 +3778,11 @@ int main(int argc, char *argv[]) { goto finish; } } -fdset_close_others(fds); +fdset_new_fill(misc_fds); +FDSET_FOREACH(fd, fds, i) { +fdset_remove(misc_fds, fd); +} +fdset_cloexec(misc_fds, true); log_open(); if (arg_directory) { -- 2.1.4 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [PATCH] nspawn: check the pid in SIGCHLD handler before terminating the container
From: Alban Crequy al...@endocode.com When a process starts systemd-nspawn with exec*() without fork(), systemd-nspawn can be the parent process of children processes unknown to systemd-nspawn. It can then receive the signal SIGCHLD for both the container leader process and the previously started processes. So it should distinguish them. --- src/nspawn/nspawn.c | 51 +-- 1 file changed, 45 insertions(+), 6 deletions(-) diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c index 05d2c71..49f6650 100644 --- a/src/nspawn/nspawn.c +++ b/src/nspawn/nspawn.c @@ -23,6 +23,7 @@ #include sched.h #include unistd.h #include sys/types.h +#include sys/wait.h #include sys/mount.h #include stdlib.h #include string.h @@ -3511,6 +3512,12 @@ static int change_uid_gid(char **_home) { return 0; } +typedef struct SigChildData { +pid_t leader_pid; +siginfo_t leader_status; +bool terminated; +} SigChildData; + /* * Return values: * 0 : wait_for_terminate() failed to get the state of the @@ -3528,13 +3535,17 @@ static int change_uid_gid(char **_home) { * That is, success is indicated by a return value of zero, and an * error is indicated by a non-zero value. */ -static int wait_for_container(pid_t pid, ContainerStatus *container) { +static int wait_for_container(SigChildData *sigchld_ctx, pid_t pid, ContainerStatus *container) { siginfo_t status; int r; -r = wait_for_terminate(pid, status); -if (r 0) -return log_warning_errno(r, Failed to wait for container: %m); +if (sigchld_ctx-terminated) { +status = sigchld_ctx-leader_status; +} else { +r = wait_for_terminate(pid, status); +if (r 0) +return log_warning_errno(r, Failed to wait for container: %m); +} switch (status.si_code) { @@ -3594,6 +3605,31 @@ static int on_orderly_shutdown(sd_event_source *s, const struct signalfd_siginfo return 0; } +static int on_sigchld(sd_event_source *s, const struct signalfd_siginfo *si, void *userdata) { +SigChildData *ctx = userdata; +pid_t leader_pid = ctx-leader_pid; + +/* several terminated children could be merged in a single SIGCHLD, so + * don't rely on si-ssi_pid. */ + +while (1) { +int r; +siginfo_t status; + +zero(status); +r = waitid(P_ALL, -1, status, WEXITED | WNOHANG); +if (r 0) +break; +if (status.si_pid == leader_pid) { +ctx-leader_status = status; +ctx-terminated = true; +return sd_event_exit(sd_event_source_get_event(s), 0); +} +} + +return 0; +} + static int determine_names(void) { int r; @@ -3917,6 +3953,7 @@ int main(int argc, char *argv[]) { .sa_handler = nop_handler, .sa_flags = SA_NOCLDSTOP, }; +SigChildData sigchld_ctx = {0,}; r = barrier_create(barrier); if (r 0) { @@ -4386,7 +4423,9 @@ int main(int argc, char *argv[]) { } /* simply exit on sigchld */ -sd_event_add_signal(event, NULL, SIGCHLD, NULL, NULL); +sigchld_ctx.leader_pid = pid; +sigchld_ctx.terminated = false; +sd_event_add_signal(event, NULL, SIGCHLD, on_sigchld , sigchld_ctx); if (arg_expose_ports) { r = watch_rtnl(event, rtnl_socket_pair[0], exposed, rtnl); @@ -4425,7 +4464,7 @@ int main(int argc, char *argv[]) { /* Normally redundant, but better safe than sorry */ kill(pid, SIGKILL); -r = wait_for_container(pid, container_status); +r = wait_for_container(sigchld_ctx, pid, container_status); pid = 0; if (r 0) -- 2.1.4 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [Q] About supporting nested systemd daemon
On Wed, Feb 25, 2015 at 6:48 PM, Lennart Poettering lenn...@poettering.net wrote: On Wed, 25.02.15 00:05, Cyrill Gorcunov (gorcu...@gmail.com) wrote: Hi all! I would really appreciate if someone enlighten me if there is some simple solution for the problem we met in OpenVZ: modern containers are mostly systemd based so that once it is started up the systemd daemon mounts own instance of the systemd cgroup (if previously has not been pre-mounted by container startup tools or whatever). To make a strict isolation of nested systemd cgroup (by nested I mean systemd cgroup instance mounted inside container) we've patched the kernel so that container's systemd obtains own instance of cgroup non-intersected anyhow with one present on a host system. And we would really love to get rid of this kind of kernel's hack but be able to isolate nested systemd with own cgroup instance using solely userspace tools. Is there some way to reach this? Not really. cgroupfs doesn't really allow that. First of all the root cgroup has a different set of attributes than child cgroups, hence you cannot mount an arbitrary child to the root cgroup and assume it works. But even worse, /proc/$PID/cgroup actually contains the full cgroup path, and hence mounting only a subtree would break the refernces from that file. systemd-nspawn nowadays mounts all hierarchies into the container, but mounts all controller hierarchies read-only, and of the name=systemd hierarchy mounts everything read-only, except the subtree the container is allowed to manage. That way only the cgroup tree the container needs access to is writable to it. That solution however does not hide the cgroup tree. A process running inside the container can still go an explore the tree and its attributes. However, all other groups will appear empty to it, since processes not in the container PID namespaces will be suppressed when reading the member process list. To sum up what systemd-nspawn is currently mounting in the container: - /sys/fs/cgroup/systemd/ -- mounted RO - /sys/fs/cgroup/systemd/machine.slice/machine-xxx.scope/ -- mounted RW - /sys/fs/cgroup/cpu,cpuacct/ -- mounted RO - etc. for other cgroup hierarchies -- mounted RO In order to let systemd in the container restrict cpu, memory, etc. on some of its services (see manpage systemd.resource-control(5)), rkt would like systemd-nspawn to mount a subtree of some hierarchy (cpu,cpuacct, memory) in read-write mode. Is there any issues with changing the systemd-nspawn mounts in the following way: - /sys/fs/cgroup/systemd/ -- mounted RO - /sys/fs/cgroup/systemd/machine.slice/machine-xxx.scope/ -- mounted RW - /sys/fs/cgroup/cpu,cpuacct/ -- mounted RO - /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-xxx.scope/ -- mounted RW - etc. for other cgroup hierarchies. Iago wrote two experimental patches on systemd-nspawn to try that and it worked. Delegate=yes was enabled in systemd-nspawn in order to test this: https://github.com/endocode/systemd/commits/iaguis/delegate But I would like to know what is missing to make this safe (or if it is already safe to do). There have been proposals on LKML to add cgroup namespacings, but no idea where that went. LXC created a FUSE emulation of /proc and /sys, called lxcfs to solve this problem. Quite honestly I find this a pretty crazy idea however. If I understand correctly we can provide separate slice to container's systemd leaving the rest of host cgroup in ro mode, right? Yes. If so maybe there a way to hide host cgroup completely from container so it would see only own cgroup in sysfs? I don't see how this could work. I mean, you could overmount all other cgroup siblings with empty directories in the containers, but not realy scalable nor compatible with cgroups being added or removed later on... Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] unit: When stopping due to BindsTo=, log which unit caused it
On Fri, Apr 24, 2015 at 12:45 PM, Lennart Poettering lenn...@poettering.net wrote: On Wed, 22.04.15 16:55, Alban Crequy (al...@endocode.com) wrote: Thanks for the commits. They don't seem related to containers. I can reproduce my issue on git-master: sudo ~/git/systemd/systemd-nspawn --register=false --bind $HOME/tmp/vol -D debian-tree -b Then, in the container, make sure /bin/umount does NOT exist. Then halt the container with kill -37 1 (SIGRTMIN+3) We require /bin/mount and /bin/umount to exist. We do not support systems where you remove those. We also don't support systems without glibc either, ... ;-) Fair enough about the dependency on umount/mount :) I added /bin/mount and /bin/umount in the container for my test and now systemd in the container says: Unit opt-stage2-sha512(...)-rootfs-dir1.mount is bound to inactive unit dev-disk-by\x2duuid-25ea81c8\x2d20d8\x2d4ab1\x2d862c\x2d882a04478837.device. Stopping, too. The directory /opt/stage2/sha512xxx/rootfs/dir1 is the bind mount specified on the systemd-nspawn --bind command line. How can I tell systemd in the nspawn container *not* to umount the volumes prepared by nspawn? Note that systemd is also trying to umount other bind-mounted directories but it fails because the processes in the container are using it: umount: /opt/stage2/sha512-ba93cedc478ed21c03d690b5f026205f/rootfs: target is busy And systemd keeps trying to umount them in a busy loop. How does systemd detect that dev-disk-by...device is inactive? Cheers, Alban ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] unit: When stopping due to BindsTo=, log which unit caused it
On Fri, Apr 24, 2015 at 5:34 PM, Lennart Poettering lenn...@poettering.net wrote: On Fri, 24.04.15 17:10, Alban Crequy (al...@endocode.com) wrote: On Fri, Apr 24, 2015 at 12:45 PM, Lennart Poettering lenn...@poettering.net wrote: On Wed, 22.04.15 16:55, Alban Crequy (al...@endocode.com) wrote: Thanks for the commits. They don't seem related to containers. I can reproduce my issue on git-master: sudo ~/git/systemd/systemd-nspawn --register=false --bind $HOME/tmp/vol -D debian-tree -b Then, in the container, make sure /bin/umount does NOT exist. Then halt the container with kill -37 1 (SIGRTMIN+3) We require /bin/mount and /bin/umount to exist. We do not support systems where you remove those. We also don't support systems without glibc either, ... ;-) Fair enough about the dependency on umount/mount :) I added /bin/mount and /bin/umount in the container for my test and now systemd in the container says: Unit opt-stage2-sha512(...)-rootfs-dir1.mount is bound to inactive unit dev-disk-by\x2duuid-25ea81c8\x2d20d8\x2d4ab1\x2d862c\x2d882a04478837.device. Stopping, too. I figure we shouldn't bother with adding bindsto dependencies for .device units in containers, given that .device units are not supported there anyway. Fix: http://cgit.freedesktop.org/systemd/systemd/commit/?id=47bc12e1ba35d38edda737dae232088d6d3ae688 Please verify, Thanks for the fix, it works for me! I tested the fix both from git master and cherry-picked on v219. Cheers, Alban ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] unit: When stopping due to BindsTo=, log which unit caused it
On Tue, Apr 21, 2015 at 10:35 PM, Zbigniew Jędrzejewski-Szmek zbys...@in.waw.pl wrote: On Tue, Apr 21, 2015 at 03:54:35PM +0200, Alban Crequy wrote: On Sat, Feb 28, 2015 at 5:40 PM, Lennart Poettering lenn...@poettering.net wrote: On Fri, 27.02.15 17:13, Lennart Poettering (lenn...@poettering.net) wrote: On Thu, 26.02.15 16:50, Martin Pitt (martin.p...@ubuntu.com) wrote: IMHO it would be prudent to skip adding the BindsTo= if at the time of creating the .mount unit the backing .device unit doesn't actually exist. In that case it's a mount which isn't managed by systemd, and we shouldn't touch it. We mostly want this BindsTo= for mounts where the .device units *do* exist, so that when they go away we can clean up the mount (mostly for hotpluggable devices and removable media). I'll have a deeper look ASAP. I ran into this myself the other day, and Kay, Daniel and I spent a lot of time to come up with a scheme how to deal with the race... And I think we have a nice scheme now and I started implementing it. The idea is that .device units will gain a third state: currently they are either dead or plugged, and the new state will be tentative. It is entered when a device is referenced in /proc/self/mountinfo or /proc/swap, even though it has not yet shown up via udev. This new state has will not result in BindsTo= getting active. This is implemented now. Please check if this fixes this issue for you. Which commit implements this? I have an issue on nspawn container shutdown with v219 that I didn't have with v215. Systemd is trying to unmount the volumes bind mounted by nspawn and it fails because /bin/umount does not exist in my container. Systemd keeps trying to umount it in a busy loop. Reverting 06e9783e2cc12eb6514e80c7f0014295f59b (core/mount: add dependencies to dynamically mounted mounts too) fixes my issue. There was a bunch of fixes on top. See 5bd4b173605142, 628c89cc68ab96fce2d, 496068a8288084ab3. Thanks for the commits. They don't seem related to containers. I can reproduce my issue on git-master: sudo ~/git/systemd/systemd-nspawn --register=false --bind $HOME/tmp/vol -D debian-tree -b Then, in the container, make sure /bin/umount does NOT exist. Then halt the container with kill -37 1 (SIGRTMIN+3) It will loop forever on: [ OK ] Failed unmounting /home/alban/tmp/vol. Unmounting /home/alban/tmp/vol... [ OK ] Failed unmounting /home/alban/tmp/vol. Unmounting /home/alban/tmp/vol... I am not sure why it worked fine for me with the older version v215. Cheers, Alban ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] unit: When stopping due to BindsTo=, log which unit caused it
On Sat, Feb 28, 2015 at 5:40 PM, Lennart Poettering lenn...@poettering.net wrote: On Fri, 27.02.15 17:13, Lennart Poettering (lenn...@poettering.net) wrote: On Thu, 26.02.15 16:50, Martin Pitt (martin.p...@ubuntu.com) wrote: IMHO it would be prudent to skip adding the BindsTo= if at the time of creating the .mount unit the backing .device unit doesn't actually exist. In that case it's a mount which isn't managed by systemd, and we shouldn't touch it. We mostly want this BindsTo= for mounts where the .device units *do* exist, so that when they go away we can clean up the mount (mostly for hotpluggable devices and removable media). I'll have a deeper look ASAP. I ran into this myself the other day, and Kay, Daniel and I spent a lot of time to come up with a scheme how to deal with the race... And I think we have a nice scheme now and I started implementing it. The idea is that .device units will gain a third state: currently they are either dead or plugged, and the new state will be tentative. It is entered when a device is referenced in /proc/self/mountinfo or /proc/swap, even though it has not yet shown up via udev. This new state has will not result in BindsTo= getting active. This is implemented now. Please check if this fixes this issue for you. Which commit implements this? I have an issue on nspawn container shutdown with v219 that I didn't have with v215. Systemd is trying to unmount the volumes bind mounted by nspawn and it fails because /bin/umount does not exist in my container. Systemd keeps trying to umount it in a busy loop. Reverting 06e9783e2cc12eb6514e80c7f0014295f59b (core/mount: add dependencies to dynamically mounted mounts too) fixes my issue. Cheers, Alban ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [PATCH] [RFC] umount: reduce verbosity
From: Alban Crequy al...@endocode.com When a systemd-nspawn container terminates, systemd umounts all bind mounts that were mounted in the container and generates a log for each umount. This additional log_info was added by bce93b7ac7642426039863493694d8c12812e2a7 for debugging shutdown. But surely log_debug is enough. I don't want to see them when when systemd is started with --log-target=null. There are other log_info that I would like to mute on --log-target=null. Is changing log_info to log_debug the correct approach? This patch is useful for: https://github.com/coreos/rkt/issues/690 When rkt is started with --debug, the systemd logs are printed. When rkt is started without --debug, systemd is started with --log-target=null in order to mute the logs. --- src/core/umount.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/core/umount.c b/src/core/umount.c index bee267a..b5a466c 100644 --- a/src/core/umount.c +++ b/src/core/umount.c @@ -401,7 +401,7 @@ static int mount_points_list_umount(MountPoint **head, bool *changed, bool log_e /* Trying to umount. We don't force here since we rely * on busy NFS and FUSE file systems to return EBUSY * until we closed everything on top of them. */ -log_info(Unmounting %s., m-path); +log_debug(Unmounting %s., m-path); if (umount2(m-path, 0) == 0) { if (changed) *changed = true; -- 2.1.4 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [PATCH] [PATCH v2] nspawn: fallback on bind mount when mknod fails
From: Alban Crequy al...@endocode.com Some systems abusively restrict mknod, even when the device node already exists in /dev. This is unfortunate because it prevents systemd-nspawn from creating the basic devices in /dev in the container. This patch implements a workaround: when mknod fails, fallback on bind mounts. Additionally, /dev/console was created with a mknod with the same major/minor as /dev/null before bind mounting a pts on it. This patch removes the mknod and creates an empty regular file instead. In order to test this patch, I used the following configuration, which I think should replicate the system with the abusive restriction on mknod: # grep devices /proc/self/cgroup 4:devices:/user.slice/restrict # cat /sys/fs/cgroup/devices/user.slice/restrict/devices.list c 1:9 r c 5:2 rw c 136:* rw # systemd-nspawn --register=false -D . --- src/nspawn/nspawn.c | 27 ++- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c index b830141..e85d298 100644 --- a/src/nspawn/nspawn.c +++ b/src/nspawn/nspawn.c @@ -1449,8 +1449,17 @@ static int copy_devnodes(const char *dest) { return -r; } -if (mknod(to, st.st_mode, st.st_rdev) 0) -return log_error_errno(errno, mknod(%s) failed: %m, to); +if (mknod(to, st.st_mode, st.st_rdev) 0) { +if (errno != EPERM) +return log_error_errno(errno, mknod(%s) failed: %m, to); + +/* Some systems abusively restrict mknod but + * allow bind mounts. */ +if (touch(to) 0) +return log_error_errno(errno, touch (%s) failed: %m, to); +if (mount(from, to, NULL, MS_BIND, NULL) 0) +return log_error_errno(errno, both mknod and bind mount (%s) failed: %m, to); +} if (arg_userns arg_uid_shift != UID_INVALID) if (lchown(to, arg_uid_shift, arg_uid_shift) 0) @@ -1481,7 +1490,6 @@ static int setup_ptmx(const char *dest) { static int setup_dev_console(const char *dest, const char *console) { _cleanup_umask_ mode_t u; const char *to; -struct stat st; int r; assert(dest); @@ -1489,24 +1497,17 @@ static int setup_dev_console(const char *dest, const char *console) { u = umask(); -if (stat(/dev/null, st) 0) -return log_error_errno(errno, Failed to stat /dev/null: %m); - r = chmod_and_chown(console, 0600, 0, 0); if (r 0) return log_error_errno(r, Failed to correct access mode for TTY: %m); /* We need to bind mount the right tty to /dev/console since * ptys can only exist on pts file systems. To have something - * to bind mount things on we create a device node first, and - * use /dev/null for that since we the cgroups device policy - * allows us to create that freely, while we cannot create - * /dev/console. (Note that the major minor doesn't actually - * matter here, since we mount it over anyway). */ + * to bind mount things on we create a empty regular file. */ to = strjoina(dest, /dev/console); -if (mknod(to, (st.st_mode ~0) | 0600, st.st_rdev) 0) -return log_error_errno(errno, mknod() for /dev/console failed: %m); +if (touch(to) 0) +return log_error_errno(errno, touch() for /dev/console failed: %m); if (mount(console, to, NULL, MS_BIND, NULL) 0) return log_error_errno(errno, Bind mount for /dev/console failed: %m); -- 2.1.4 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] [PATCH v3] nspawn: fallback on bind mount when mknod fails
On Tue, Mar 31, 2015 at 5:35 PM, Dave Reisner d...@falconindy.com wrote: On Tue, Mar 31, 2015 at 05:14:48PM +0200, Alban Crequy wrote: From: Alban Crequy al...@endocode.com Some systems abusively restrict mknod, even when the device node already exists in /dev. This is unfortunate because it prevents systemd-nspawn from creating the basic devices in /dev in the container. This patch implements a workaround: when mknod fails, fallback on bind mounts. Additionally, /dev/console was created with a mknod with the same major/minor as /dev/null before bind mounting a pts on it. This patch removes the mknod and creates an empty regular file instead. In order to test this patch, I used the following configuration, which I think should replicate the system with the abusive restriction on mknod: # grep devices /proc/self/cgroup 4:devices:/user.slice/restrict # cat /sys/fs/cgroup/devices/user.slice/restrict/devices.list c 1:9 r c 5:2 rw c 136:* rw # systemd-nspawn --register=false -D . v2: - remove bind, it is not needed since there is already MS_BIND v3: - fix error management when calling touch() - fix lowercase in error message --- src/nspawn/nspawn.c | 29 - 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c index b830141..7e56cf2 100644 --- a/src/nspawn/nspawn.c +++ b/src/nspawn/nspawn.c @@ -1449,8 +1449,18 @@ static int copy_devnodes(const char *dest) { return -r; } -if (mknod(to, st.st_mode, st.st_rdev) 0) -return log_error_errno(errno, mknod(%s) failed: %m, to); +if (mknod(to, st.st_mode, st.st_rdev) 0) { +if (errno != EPERM) +return log_error_errno(errno, mknod(%s) failed: %m, to); + +/* Some systems abusively restrict mknod but + * allow bind mounts. */ +r = touch(to); +if (r 0) +return log_error_errno(r, touch (%s) failed: %m, to); Is this really what you wanted? It's not obvious that errno will have the correct value when %m is evaluated. I would have used strerror(-r) here. log_error_errno() is a macro that ends up calling log_internalv() and it will assign errno to the parameter r in order to get %m working: src/shared/log.c: log_internalv(): /* Make sure that %m maps to the specified error */ if (error != 0) errno = error; +if (mount(from, to, NULL, MS_BIND, NULL) 0) +return log_error_errno(errno, Both mknod and bind mount (%s) failed: %m, to); +} if (arg_userns arg_uid_shift != UID_INVALID) if (lchown(to, arg_uid_shift, arg_uid_shift) 0) @@ -1481,7 +1491,6 @@ static int setup_ptmx(const char *dest) { static int setup_dev_console(const char *dest, const char *console) { _cleanup_umask_ mode_t u; const char *to; -struct stat st; int r; assert(dest); @@ -1489,24 +1498,18 @@ static int setup_dev_console(const char *dest, const char *console) { u = umask(); -if (stat(/dev/null, st) 0) -return log_error_errno(errno, Failed to stat /dev/null: %m); - r = chmod_and_chown(console, 0600, 0, 0); if (r 0) return log_error_errno(r, Failed to correct access mode for TTY: %m); /* We need to bind mount the right tty to /dev/console since * ptys can only exist on pts file systems. To have something - * to bind mount things on we create a device node first, and - * use /dev/null for that since we the cgroups device policy - * allows us to create that freely, while we cannot create - * /dev/console. (Note that the major minor doesn't actually - * matter here, since we mount it over anyway). */ + * to bind mount things on we create a empty regular file. */ to = strjoina(dest, /dev/console); -if (mknod(to, (st.st_mode ~0) | 0600, st.st_rdev) 0) -return log_error_errno(errno, mknod() for /dev/console failed: %m); +r = touch(to); +if (r 0) +return log_error_errno(r, touch() for /dev/console failed: %m); if (mount(console, to, NULL, MS_BIND, NULL) 0) return log_error_errno(errno, Bind mount for /dev/console failed: %m); -- 2.1.4 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http
[systemd-devel] [PATCH] nspawn: fallback on bind mount when mknod fails
From: Alban Crequy al...@endocode.com Some systems abusively restrict mknod, even when the device node already exists in /dev. This is unfortunate because it prevents systemd-nspawn from creating the basic devices in /dev in the container. This patch implements a workaround: when mknod fails, fallback on bind mounts. Additionally, /dev/console was created with a mknod with the same major/minor as /dev/null before bind mounting a pts on it. This patch removes the mknod and creates an empty regular file instead. In order to test this patch, I used the following configuration, which I think should replicate the system with the abusive restriction on mknod: # grep devices /proc/self/cgroup 4:devices:/user.slice/restrict # cat /sys/fs/cgroup/devices/user.slice/restrict/devices.list c 1:9 r c 5:2 rw c 136:* rw # systemd-nspawn --register=false -D . --- src/nspawn/nspawn.c | 27 ++- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c index 300b6df..09fff38 100644 --- a/src/nspawn/nspawn.c +++ b/src/nspawn/nspawn.c @@ -1449,8 +1449,17 @@ static int copy_devnodes(const char *dest) { return -r; } -if (mknod(to, st.st_mode, st.st_rdev) 0) -return log_error_errno(errno, mknod(%s) failed: %m, to); +if (mknod(to, st.st_mode, st.st_rdev) 0) { +if (errno != EPERM) +return log_error_errno(errno, mknod(%s) failed: %m, to); + +/* Some systems abusively restrict mknod but + * allow bind mounts. */ +if (touch(to) 0) +return log_error_errno(errno, touch (%s) failed: %m, to); +if (mount(from, to, bind, MS_BIND, NULL) 0) +return log_error_errno(errno, both mknod and bind mount (%s) failed: %m, to); +} if (arg_userns arg_uid_shift != UID_INVALID) if (lchown(to, arg_uid_shift, arg_uid_shift) 0) @@ -1481,7 +1490,6 @@ static int setup_ptmx(const char *dest) { static int setup_dev_console(const char *dest, const char *console) { _cleanup_umask_ mode_t u; const char *to; -struct stat st; int r; assert(dest); @@ -1489,24 +1497,17 @@ static int setup_dev_console(const char *dest, const char *console) { u = umask(); -if (stat(/dev/null, st) 0) -return log_error_errno(errno, Failed to stat /dev/null: %m); - r = chmod_and_chown(console, 0600, 0, 0); if (r 0) return log_error_errno(r, Failed to correct access mode for TTY: %m); /* We need to bind mount the right tty to /dev/console since * ptys can only exist on pts file systems. To have something - * to bind mount things on we create a device node first, and - * use /dev/null for that since we the cgroups device policy - * allows us to create that freely, while we cannot create - * /dev/console. (Note that the major minor doesn't actually - * matter here, since we mount it over anyway). */ + * to bind mount things on we create a empty regular file. */ to = strjoina(dest, /dev/console); -if (mknod(to, (st.st_mode ~0) | 0600, st.st_rdev) 0) -return log_error_errno(errno, mknod() for /dev/console failed: %m); +if (touch(to) 0) +return log_error_errno(errno, touch() for /dev/console failed: %m); if (mount(console, to, bind, MS_BIND, NULL) 0) return log_error_errno(errno, Bind mount for /dev/console failed: %m); -- 2.1.4 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] nspawn: fallback on bind mount when mknod fails
On Sun, Mar 29, 2015 at 5:24 PM, Tom Gundersen t...@jklm.no wrote: On Mar 29, 2015 5:18 PM, Alban Crequy alban.cre...@gmail.com wrote: From: Alban Crequy al...@endocode.com Some systems abusively restrict mknod, even when the device node already exists in /dev. This is unfortunate because it prevents systemd-nspawn from creating the basic devices in /dev in the container. This patch implements a workaround: when mknod fails, fallback on bind mounts. Could we just always use bind mounts and avoid the two code paths? It's possible but I avoided it in order not to add to many entries in /proc/self/mounts and /proc/self/mountinfo in the normal case when mknod is not restricted. I don't have a strong opinion about this. If you think my concern about the mount entries is less important than avoiding two code paths, I can prepare another patch. Alban Tom Additionally, /dev/console was created with a mknod with the same major/minor as /dev/null before bind mounting a pts on it. This patch removes the mknod and creates an empty regular file instead. In order to test this patch, I used the following configuration, which I think should replicate the system with the abusive restriction on mknod: # grep devices /proc/self/cgroup 4:devices:/user.slice/restrict # cat /sys/fs/cgroup/devices/user.slice/restrict/devices.list c 1:9 r c 5:2 rw c 136:* rw # systemd-nspawn --register=false -D . --- src/nspawn/nspawn.c | 27 ++- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c index 300b6df..09fff38 100644 --- a/src/nspawn/nspawn.c +++ b/src/nspawn/nspawn.c @@ -1449,8 +1449,17 @@ static int copy_devnodes(const char *dest) { return -r; } -if (mknod(to, st.st_mode, st.st_rdev) 0) -return log_error_errno(errno, mknod(%s) failed: %m, to); +if (mknod(to, st.st_mode, st.st_rdev) 0) { +if (errno != EPERM) +return log_error_errno(errno, mknod(%s) failed: %m, to); + +/* Some systems abusively restrict mknod but + * allow bind mounts. */ +if (touch(to) 0) +return log_error_errno(errno, touch (%s) failed: %m, to); +if (mount(from, to, bind, MS_BIND, NULL) 0) +return log_error_errno(errno, both mknod and bind mount (%s) failed: %m, to); +} if (arg_userns arg_uid_shift != UID_INVALID) if (lchown(to, arg_uid_shift, arg_uid_shift) 0) @@ -1481,7 +1490,6 @@ static int setup_ptmx(const char *dest) { static int setup_dev_console(const char *dest, const char *console) { _cleanup_umask_ mode_t u; const char *to; -struct stat st; int r; assert(dest); @@ -1489,24 +1497,17 @@ static int setup_dev_console(const char *dest, const char *console) { u = umask(); -if (stat(/dev/null, st) 0) -return log_error_errno(errno, Failed to stat /dev/null: %m); - r = chmod_and_chown(console, 0600, 0, 0); if (r 0) return log_error_errno(r, Failed to correct access mode for TTY: %m); /* We need to bind mount the right tty to /dev/console since * ptys can only exist on pts file systems. To have something - * to bind mount things on we create a device node first, and - * use /dev/null for that since we the cgroups device policy - * allows us to create that freely, while we cannot create - * /dev/console. (Note that the major minor doesn't actually - * matter here, since we mount it over anyway). */ + * to bind mount things on we create a empty regular file. */ to = strjoina(dest, /dev/console); -if (mknod(to, (st.st_mode ~0) | 0600, st.st_rdev) 0) -return log_error_errno(errno, mknod() for /dev/console failed: %m); +if (touch(to) 0) +return log_error_errno(errno, touch() for /dev/console failed: %m); if (mount(console, to, bind, MS_BIND, NULL) 0) return log_error_errno(errno, Bind mount for /dev/console failed: %m); -- 2.1.4 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Appc support in systemd-importd
On Tue, Mar 17, 2015 at 6:12 PM, Vincent Batts vba...@redhat.com wrote: On 11/03/15 15:24 +0100, Iago L?pez Galeiras wrote: Hi, We're looking into adding appc[1] support in systemd-importd. An appc image (ACI) is just a tar with a rootfs directory and a json manifest. We would have to implement image discovery and simply extract the rootfs. Some questions: * I see that the dkr implementation ignores the information in the docker manifest (resource restrictions, binary to exec...). Is that by design or just not implemented yet? Should we do the same with appc (*only* extract the rootfs)? * Alban liked the idea of saving the manifest so we can extract the whole ACI and modify nspawn to detect a container is an ACI and set /var/lib/machines/appc-image.aci/rootfs as the container's root. The benefit of keeping the manifest would be knowing which binary to start. Is that acceptable? [1]: https://github.com/appc/spec/blob/master/SPEC.md Is there a place that you are working on this logic? I'm will participate. My simple stub is https://github.com/vbatts/systemd/commit/6067a9f484af4cfeb9c4b01965edcd50ff0cffc3 rebased on master. Iago and I have a couple of branches on: https://github.com/endocode/systemd/branches - iaguis/nspawn-aci for the code in systemd-nspawn - alban/aci for the rest of the code It's not ready for submitting upstream (there are hardcoded URLs to etcd...) but it could download the aci, and nspawn could run it. I am albanc on IRC Freenode #systemd. Working together on this would be great :) Cheers, Alban ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [PATCH] [PATCH v2] util: add rename_noreplace
From: Alban Crequy al...@endocode.com renameat2() exists since Linux 3.15 but btrfs support for the flag RENAME_NOREPLACE was added later. This patch implements a fallback when renameat2() returns EINVAL. EINVAL is the error returned when the filesystem does not support one of the flags. --- src/import/import-raw.c| 2 +- src/import/import-tar.c| 2 +- src/import/pull-raw.c | 2 +- src/import/pull-tar.c | 2 +- src/shared/copy.c | 7 ++- src/shared/machine-image.c | 2 +- src/shared/machine-pool.c | 2 +- src/shared/util.c | 32 src/shared/util.h | 2 ++ 9 files changed, 46 insertions(+), 7 deletions(-) diff --git a/src/import/import-raw.c b/src/import/import-raw.c index 15e5eb2..e354023 100644 --- a/src/import/import-raw.c +++ b/src/import/import-raw.c @@ -245,7 +245,7 @@ static int raw_import_finish(RawImport *i) { (void) rm_rf_dangerous(i-final_path, false, true, false); } -if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, RENAME_NOREPLACE) 0) +if (rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path) 0) return log_error_errno(errno, Failed to move image into place: %m); free(i-temp_path); diff --git a/src/import/import-tar.c b/src/import/import-tar.c index d5b6dad..dcbe721 100644 --- a/src/import/import-tar.c +++ b/src/import/import-tar.c @@ -201,7 +201,7 @@ static int tar_import_finish(TarImport *i) { (void) rm_rf_dangerous(i-final_path, false, true, false); } -if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, RENAME_NOREPLACE) 0) +if (rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path) 0) return log_error_errno(errno, Failed to move image into place: %m); free(i-temp_path); diff --git a/src/import/pull-raw.c b/src/import/pull-raw.c index d1d77d5..d4b17e7 100644 --- a/src/import/pull-raw.c +++ b/src/import/pull-raw.c @@ -384,7 +384,7 @@ static void raw_pull_job_on_finished(PullJob *j) { if (r 0) goto finish; -r = renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, RENAME_NOREPLACE); +r = rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path); if (r 0) { r = log_error_errno(errno, Failed to move RAW file into place: %m); goto finish; diff --git a/src/import/pull-tar.c b/src/import/pull-tar.c index 504642f..69df7af 100644 --- a/src/import/pull-tar.c +++ b/src/import/pull-tar.c @@ -281,7 +281,7 @@ static void tar_pull_job_on_finished(PullJob *j) { if (r 0) goto finish; -if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, RENAME_NOREPLACE) 0) { +if (rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path) 0) { r = log_error_errno(errno, Failed to rename to final image name: %m); goto finish; } diff --git a/src/shared/copy.c b/src/shared/copy.c index 0239a58..e578d03 100644 --- a/src/shared/copy.c +++ b/src/shared/copy.c @@ -404,7 +404,12 @@ int copy_file_atomic(const char *from, const char *to, mode_t mode, bool replace if (r 0) return r; -if (renameat2(AT_FDCWD, t, AT_FDCWD, to, replace ? 0 : RENAME_NOREPLACE) 0) { +if (replace) { +r = renameat(AT_FDCWD, t, AT_FDCWD, to); +} else { +r = rename_noreplace(AT_FDCWD, t, AT_FDCWD, to); +} +if (r 0) { unlink_noerrno(t); return -errno; } diff --git a/src/shared/machine-image.c b/src/shared/machine-image.c index c6d2850..ad5bff4 100644 --- a/src/shared/machine-image.c +++ b/src/shared/machine-image.c @@ -440,7 +440,7 @@ int image_rename(Image *i, const char *new_name) { if (!nn) return -ENOMEM; -if (renameat2(AT_FDCWD, i-path, AT_FDCWD, new_path, RENAME_NOREPLACE) 0) +if (rename_noreplace(AT_FDCWD, i-path, AT_FDCWD, new_path) 0) return -errno; /* Restore the immutable bit, if it was set before */ diff --git a/src/shared/machine-pool.c b/src/shared/machine-pool.c index e7671a3..7e902a3 100644 --- a/src/shared/machine-pool.c +++ b/src/shared/machine-pool.c @@ -140,7 +140,7 @@ static int setup_machine_raw(uint64_t size, sd_bus_error *error) { goto fail; } -if (renameat2(AT_FDCWD, tmp, AT_FDCWD, /var/lib/machines.raw, RENAME_NOREPLACE) 0) { +if (rename_noreplace(AT_FDCWD, tmp, AT_FDCWD, /var/lib/machines.raw) 0) { r = sd_bus_error_set_errnof(error, errno, Failed to move /var/lib/machines.raw into place: %m); goto fail; } diff --git a/src/shared
[systemd-devel] [PATCH] [PATCH v3] util: add rename_noreplace
From: Alban Crequy al...@endocode.com renameat2() exists since Linux 3.15 but btrfs support for the flag RENAME_NOREPLACE was added later. This patch implements a fallback when renameat2() returns EINVAL. EINVAL is the error returned when the filesystem does not support one of the flags. --- src/import/import-raw.c| 5 +++-- src/import/import-tar.c| 5 +++-- src/import/pull-raw.c | 4 ++-- src/import/pull-tar.c | 5 +++-- src/shared/copy.c | 13 ++--- src/shared/machine-image.c | 5 +++-- src/shared/machine-pool.c | 5 +++-- src/shared/util.c | 41 + src/shared/util.h | 2 ++ 9 files changed, 70 insertions(+), 15 deletions(-) diff --git a/src/import/import-raw.c b/src/import/import-raw.c index 15e5eb2..c53b0e4 100644 --- a/src/import/import-raw.c +++ b/src/import/import-raw.c @@ -245,8 +245,9 @@ static int raw_import_finish(RawImport *i) { (void) rm_rf_dangerous(i-final_path, false, true, false); } -if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, RENAME_NOREPLACE) 0) -return log_error_errno(errno, Failed to move image into place: %m); +r = rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path); +if (r 0) +return log_error_errno(r, Failed to move image into place: %m); free(i-temp_path); i-temp_path = NULL; diff --git a/src/import/import-tar.c b/src/import/import-tar.c index d5b6dad..65262d8 100644 --- a/src/import/import-tar.c +++ b/src/import/import-tar.c @@ -201,8 +201,9 @@ static int tar_import_finish(TarImport *i) { (void) rm_rf_dangerous(i-final_path, false, true, false); } -if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, RENAME_NOREPLACE) 0) -return log_error_errno(errno, Failed to move image into place: %m); +r = rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path); +if (r 0) +return log_error_errno(r, Failed to move image into place: %m); free(i-temp_path); i-temp_path = NULL; diff --git a/src/import/pull-raw.c b/src/import/pull-raw.c index d1d77d5..988fc59 100644 --- a/src/import/pull-raw.c +++ b/src/import/pull-raw.c @@ -384,9 +384,9 @@ static void raw_pull_job_on_finished(PullJob *j) { if (r 0) goto finish; -r = renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, RENAME_NOREPLACE); +r = rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path); if (r 0) { -r = log_error_errno(errno, Failed to move RAW file into place: %m); +r = log_error_errno(r, Failed to move RAW file into place: %m); goto finish; } diff --git a/src/import/pull-tar.c b/src/import/pull-tar.c index 504642f..84ad571 100644 --- a/src/import/pull-tar.c +++ b/src/import/pull-tar.c @@ -281,8 +281,9 @@ static void tar_pull_job_on_finished(PullJob *j) { if (r 0) goto finish; -if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, RENAME_NOREPLACE) 0) { -r = log_error_errno(errno, Failed to rename to final image name: %m); +r = rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path); +if (r 0) { +r = log_error_errno(r, Failed to rename to final image name: %m); goto finish; } diff --git a/src/shared/copy.c b/src/shared/copy.c index 0239a58..edbbb3c 100644 --- a/src/shared/copy.c +++ b/src/shared/copy.c @@ -404,9 +404,16 @@ int copy_file_atomic(const char *from, const char *to, mode_t mode, bool replace if (r 0) return r; -if (renameat2(AT_FDCWD, t, AT_FDCWD, to, replace ? 0 : RENAME_NOREPLACE) 0) { -unlink_noerrno(t); -return -errno; +if (replace) { +r = renameat(AT_FDCWD, t, AT_FDCWD, to); +if (r 0) +r = -errno; +} else { +r = rename_noreplace(AT_FDCWD, t, AT_FDCWD, to); +} +if (r 0) { +(void) unlink_noerrno(t); +return r; } return 0; diff --git a/src/shared/machine-image.c b/src/shared/machine-image.c index c6d2850..440a569 100644 --- a/src/shared/machine-image.c +++ b/src/shared/machine-image.c @@ -440,8 +440,9 @@ int image_rename(Image *i, const char *new_name) { if (!nn) return -ENOMEM; -if (renameat2(AT_FDCWD, i-path, AT_FDCWD, new_path, RENAME_NOREPLACE) 0) -return -errno; +r = rename_noreplace(AT_FDCWD, i-path, AT_FDCWD, new_path); +if (r 0) +return r
[systemd-devel] [PATCH] util: add rename_noreplace
From: Alban Crequy al...@endocode.com renameat2() exists since Linux 3.15 but btrfs support for the flag RENAME_NOREPLACE was added later. This patch implements a fallback when renameat2() returns EINVAL. EINVAL is the error returned when the filesystem does not support one of the flags. --- src/import/import-raw.c| 2 +- src/import/import-tar.c| 2 +- src/import/pull-raw.c | 2 +- src/import/pull-tar.c | 2 +- src/shared/copy.c | 13 ++--- src/shared/machine-image.c | 2 +- src/shared/machine-pool.c | 2 +- src/shared/util.c | 34 ++ src/shared/util.h | 2 ++ 9 files changed, 52 insertions(+), 9 deletions(-) diff --git a/src/import/import-raw.c b/src/import/import-raw.c index 15e5eb2..e354023 100644 --- a/src/import/import-raw.c +++ b/src/import/import-raw.c @@ -245,7 +245,7 @@ static int raw_import_finish(RawImport *i) { (void) rm_rf_dangerous(i-final_path, false, true, false); } -if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, RENAME_NOREPLACE) 0) +if (rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path) 0) return log_error_errno(errno, Failed to move image into place: %m); free(i-temp_path); diff --git a/src/import/import-tar.c b/src/import/import-tar.c index d5b6dad..dcbe721 100644 --- a/src/import/import-tar.c +++ b/src/import/import-tar.c @@ -201,7 +201,7 @@ static int tar_import_finish(TarImport *i) { (void) rm_rf_dangerous(i-final_path, false, true, false); } -if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, RENAME_NOREPLACE) 0) +if (rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path) 0) return log_error_errno(errno, Failed to move image into place: %m); free(i-temp_path); diff --git a/src/import/pull-raw.c b/src/import/pull-raw.c index d1d77d5..d4b17e7 100644 --- a/src/import/pull-raw.c +++ b/src/import/pull-raw.c @@ -384,7 +384,7 @@ static void raw_pull_job_on_finished(PullJob *j) { if (r 0) goto finish; -r = renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, RENAME_NOREPLACE); +r = rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path); if (r 0) { r = log_error_errno(errno, Failed to move RAW file into place: %m); goto finish; diff --git a/src/import/pull-tar.c b/src/import/pull-tar.c index 504642f..69df7af 100644 --- a/src/import/pull-tar.c +++ b/src/import/pull-tar.c @@ -281,7 +281,7 @@ static void tar_pull_job_on_finished(PullJob *j) { if (r 0) goto finish; -if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, RENAME_NOREPLACE) 0) { +if (rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path) 0) { r = log_error_errno(errno, Failed to rename to final image name: %m); goto finish; } diff --git a/src/shared/copy.c b/src/shared/copy.c index 0239a58..458a654 100644 --- a/src/shared/copy.c +++ b/src/shared/copy.c @@ -404,9 +404,16 @@ int copy_file_atomic(const char *from, const char *to, mode_t mode, bool replace if (r 0) return r; -if (renameat2(AT_FDCWD, t, AT_FDCWD, to, replace ? 0 : RENAME_NOREPLACE) 0) { -unlink_noerrno(t); -return -errno; +if (replace) { +if (renameat(AT_FDCWD, t, AT_FDCWD, to) 0) { +unlink_noerrno(t); +return -errno; +} +} else { +if (rename_noreplace(AT_FDCWD, t, AT_FDCWD, to) 0) { +unlink_noerrno(t); +return -errno; +} } return 0; diff --git a/src/shared/machine-image.c b/src/shared/machine-image.c index c6d2850..ad5bff4 100644 --- a/src/shared/machine-image.c +++ b/src/shared/machine-image.c @@ -440,7 +440,7 @@ int image_rename(Image *i, const char *new_name) { if (!nn) return -ENOMEM; -if (renameat2(AT_FDCWD, i-path, AT_FDCWD, new_path, RENAME_NOREPLACE) 0) +if (rename_noreplace(AT_FDCWD, i-path, AT_FDCWD, new_path) 0) return -errno; /* Restore the immutable bit, if it was set before */ diff --git a/src/shared/machine-pool.c b/src/shared/machine-pool.c index e7671a3..7e902a3 100644 --- a/src/shared/machine-pool.c +++ b/src/shared/machine-pool.c @@ -140,7 +140,7 @@ static int setup_machine_raw(uint64_t size, sd_bus_error *error) { goto fail; } -if (renameat2(AT_FDCWD, tmp, AT_FDCWD, /var/lib/machines.raw, RENAME_NOREPLACE) 0) { +if (rename_noreplace(AT_FDCWD, tmp
Re: [systemd-devel] Notification socket and chroot vs PrivateNetwork conflict (abstract vs file-system)
On 9 December 2014 at 17:28, Lennart Poettering lenn...@poettering.net wrote: On Tue, 09.12.14 16:24, Krzysztof Kotlenga (k.kotle...@sims.pl) wrote: Hi. Currently notify socket is unavailable in chrooted services (again) unless you bind mount it there. Is there perhaps another, less cumbersome way? So far notify socket was: 1. abstract socket commit 8c47c7325fa1ab72febf807f8831ff24c75fbf45 notify: add minimal readiness/status protocol for spawned daemons 2. file-system socket commit 91b22f21f3824c1766d34f622c5bbb70cbe881a8 core: move abstract namespace sockets to /dev/.run Now that we have /dev/.run there's no need to use abstract namespace sockets. So, let's move things to /dev/.run, to make things more easily discoverable and improve compat with chroot() and fs namespacing. 3. abstract socket again commit 29252e9e5bad3b0bcfc45d9bc761aee4b0ece1da manager: turn notify socket into abstract namespace socket again sd_notify() should work for daemons that chroot() as part of their initilization, hence it's a good idea to use an abstract namespace socket which is not affected by chroot. 4. file-system socket again commit 7181dbdb2e3112858d62bdaea4f0ad2ed685ccba core: move notify sockets to /run and $XDG_RUNTIME_DIR A service with PrivateNetwork= cannot access abstract namespace sockets of the host anymore, hence let's better not use abstract namespace sockets for this, since we want to make sure that PrivateNetwork= is useful and doesn't break sd_notify(). So... would it be acceptable to have two notify sockets, one abstract and one normal, the latter only set for services with PrivateNetwork or - better maybe - explicitly selectable? Any other ideas? Hmm, but what would you do for a service that has both PrivateNetwork and chroot enabled? I am all open for shifting things around again, but I inda would prefer a solution that works universally in the end... Ideas? I figure we could open a new mount namespace and mount the file system socket into the chroot, but not sure I like the idea... Maybe that's the way to do it... but where would you bind mount the socket file? in $CHROOT/tmp which should be writeable when PrivateTmp=true? Of course it will not work if the daemon is doing the chroot itself instead of relying on systemd's RootDirectory. The same problem exists even without using PrivateNetwork/RootDirectory when the service starts a container with nspawn --private-network and the program inside the container wants to notify when it's ready. This has the same root cause: the service runs in a new mount/chroot and a new network namespace. There is also the additional problem that the program inside the container runs in a different cgroup (/system.slice/docker-... for docker containers, or /machine.slice... for nspawn containers). There is the tool sdnotify-proxy to proxy the notify socket from systemd to a socket file which can be bind mounted in the container. sdnotify-proxy works, but I would like to know if someone finds a better way for containers :) Alban ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Notification socket and chroot vs PrivateNetwork conflict (abstract vs file-system)
On 9 December 2014 at 17:28, Lennart Poettering lenn...@poettering.net wrote: On Tue, 09.12.14 16:24, Krzysztof Kotlenga (k.kotle...@sims.pl) wrote: Hi. Currently notify socket is unavailable in chrooted services (again) unless you bind mount it there. Is there perhaps another, less cumbersome way? So far notify socket was: 1. abstract socket commit 8c47c7325fa1ab72febf807f8831ff24c75fbf45 notify: add minimal readiness/status protocol for spawned daemons 2. file-system socket commit 91b22f21f3824c1766d34f622c5bbb70cbe881a8 core: move abstract namespace sockets to /dev/.run Now that we have /dev/.run there's no need to use abstract namespace sockets. So, let's move things to /dev/.run, to make things more easily discoverable and improve compat with chroot() and fs namespacing. 3. abstract socket again commit 29252e9e5bad3b0bcfc45d9bc761aee4b0ece1da manager: turn notify socket into abstract namespace socket again sd_notify() should work for daemons that chroot() as part of their initilization, hence it's a good idea to use an abstract namespace socket which is not affected by chroot. 4. file-system socket again commit 7181dbdb2e3112858d62bdaea4f0ad2ed685ccba core: move notify sockets to /run and $XDG_RUNTIME_DIR A service with PrivateNetwork= cannot access abstract namespace sockets of the host anymore, hence let's better not use abstract namespace sockets for this, since we want to make sure that PrivateNetwork= is useful and doesn't break sd_notify(). So... would it be acceptable to have two notify sockets, one abstract and one normal, the latter only set for services with PrivateNetwork or - better maybe - explicitly selectable? Any other ideas? Hmm, but what would you do for a service that has both PrivateNetwork and chroot enabled? I am all open for shifting things around again, but I inda would prefer a solution that works universally in the end... Ideas? I figure we could open a new mount namespace and mount the file system socket into the chroot, but not sure I like the idea... I don't know what is the best solution but using a socket file seems better than using an abstract unix socket: processes in a systemd-nspawn container could discover the notify socket of the host in /proc/net/unix (if it does not use a new network namespace) and send garbage file descriptors with SCM_RIGHTS from the container to the host. Systemd on the host does the right thing: it closes the fds when the datagram was not sent by a managed unit. Does manager_get_unit_by_pid() matches the exact cgroup path only or does it also matches a prefix path? I wonder about nspawn containers started by a service unit on the host. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [PATCH] mount: create mount point correctly in case of bind mount
From: Alban Crequy al...@endocode.com Manpage systemd.mount(5) says: If the mount point does not exist at the time of mounting, it is created. However, it was not working for bind mounts of non-directory files (regular, device, socket, etc). This patch checks the type of the resource to bind mount and does a touch instead of a mkdir when necessary. --- src/core/mount.c | 32 1 file changed, 28 insertions(+), 4 deletions(-) diff --git a/src/core/mount.c b/src/core/mount.c index f3977e6..5a373ba 100644 --- a/src/core/mount.c +++ b/src/core/mount.c @@ -904,19 +904,43 @@ fail: static void mount_enter_mounting(Mount *m) { int r; MountParameters *p; +bool is_bind; +bool is_file_bind; assert(m); m-control_command_id = MOUNT_EXEC_MOUNT; m-control_command = m-exec_command + MOUNT_EXEC_MOUNT; -mkdir_p_label(m-where, m-directory_mode); +p = get_mount_parameters_fragment(m); + +/* Bind-mounts: A directory can be bind-mounted only on directories. A + * non-directory file (regular, device, socket etc.) can be bind + * mounted on any non-directory file (even of a different kind). + * + * Non-bind mounts: they are only mounted on directories. + */ + +is_bind = p mount_is_bind(p); +is_file_bind = false; +if (is_bind) { +struct stat st; + +/* both stat(2) and mount(8) follow symlinks, so it's fine */ +r = stat(p-what, st); +if (r == 0 !S_ISDIR(st.st_mode)) +is_file_bind = true; +} -warn_if_dir_nonempty(m-meta.id, m-where); +if (is_file_bind) { +touch_file(m-where, true, USEC_INFINITY, 0, 0, m-directory_mode); +} else { +mkdir_p_label(m-where, m-directory_mode); +warn_if_dir_nonempty(m-meta.id, m-where); +} /* Create the source directory for bind-mounts if needed */ -p = get_mount_parameters_fragment(m); -if (p mount_is_bind(p)) +if (is_bind) mkdir_p_label(p-what, m-directory_mode); r = fail_if_symlink(m-meta.id, m-where); -- 2.1.4 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [PATCH] PrivateDevices: fix /dev mount when a service is chrooted
When a service uses both RootDirectory=/opt/... and PrivateDevices=true, the private /dev must not be mounted in /dev but in /opt/.../dev. --- src/core/execute.c | 6 +- src/core/namespace.c | 10 +- src/core/namespace.h | 2 +- src/test/test-ns.c | 2 +- 4 files changed, 12 insertions(+), 8 deletions(-) diff --git a/src/core/execute.c b/src/core/execute.c index 1815e3d..69b7990 100644 --- a/src/core/execute.c +++ b/src/core/execute.c @@ -1571,6 +1571,7 @@ static int exec_child( context-protect_home != PROTECT_HOME_NO) { char *tmp = NULL, *var = NULL; +char private_dev_dir[512]; /* The runtime struct only contains the parent * of the private /tmp, which is @@ -1585,6 +1586,9 @@ static int exec_child( var = strjoina(runtime-var_tmp_dir, /tmp); } +if (params-apply_chroot context-root_directory) +snprintf(private_dev_dir, sizeof(private_dev_dir) - 1, %s/dev, context-root_directory); + r = setup_namespace( context-read_write_dirs, context-read_only_dirs, @@ -1592,7 +1596,7 @@ static int exec_child( tmp, var, params-bus_endpoint_path, -context-private_devices, +params-apply_chroot context-private_devices ? private_dev_dir : NULL, context-protect_home, context-protect_system, context-mount_flags); diff --git a/src/core/namespace.c b/src/core/namespace.c index 4fecd32..7a6561a 100644 --- a/src/core/namespace.c +++ b/src/core/namespace.c @@ -234,7 +234,7 @@ static int mount_dev(BindMount *m) { dev_setup(temporary_mount); -if (mount(dev, /dev/, NULL, MS_MOVE, NULL) 0) { +if (mount(dev, m-path, NULL, MS_MOVE, NULL) 0) { r = -errno; goto fail; } @@ -419,7 +419,7 @@ int setup_namespace( const char* tmp_dir, const char* var_tmp_dir, const char* bus_endpoint_path, -bool private_dev, +const char* private_dev_dir, ProtectHome protect_home, ProtectSystem protect_system, unsigned long mount_flags) { @@ -438,7 +438,7 @@ int setup_namespace( strv_length(read_write_dirs) + strv_length(read_only_dirs) + strv_length(inaccessible_dirs) + -private_dev + +!!private_dev_dir + (protect_home != PROTECT_HOME_NO ? 3 : 0) + (protect_system != PROTECT_SYSTEM_NO ? 2 : 0) + (protect_system == PROTECT_SYSTEM_FULL ? 1 : 0); @@ -469,8 +469,8 @@ int setup_namespace( m++; } -if (private_dev) { -m-path = /dev; +if (private_dev_dir) { +m-path = private_dev_dir; m-mode = PRIVATE_DEV; m++; } diff --git a/src/core/namespace.h b/src/core/namespace.h index 42b92e7..4b06802 100644 --- a/src/core/namespace.h +++ b/src/core/namespace.h @@ -47,7 +47,7 @@ int setup_namespace(char **read_write_dirs, const char *tmp_dir, const char *var_tmp_dir, const char *endpoint_path, -bool private_dev, +const char *private_dev_dir, ProtectHome protect_home, ProtectSystem protect_system, unsigned long mount_flags); diff --git a/src/test/test-ns.c b/src/test/test-ns.c index 7cd7b77..d973d0d 100644 --- a/src/test/test-ns.c +++ b/src/test/test-ns.c @@ -60,7 +60,7 @@ int main(int argc, char *argv[]) { tmp_dir, var_tmp_dir, NULL, -true, +/dev, PROTECT_HOME_NO, PROTECT_SYSTEM_NO, 0); -- 2.1.4 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd-nspawn create container under unprivileged user
On 5 February 2015 at 12:48, Vasiliy Tolstov v.tols...@selfip.ru wrote: 2015-02-05 12:44 GMT+03:00 Alban Crequy alban.cre...@gmail.com: Manual page namespaces(7): Creation of new namespaces using clone(2) and unshare(2) in most cases requires the CAP_SYS_ADMIN capability. User namespaces are the exception: since Linux 3.8, no privilege is required to create a user namespace. So as i understand i can't create full featured container with network under non root user (and not have cap_sys_admin) caps like CAP_SYS_ADMIN don't have an global meaning anymore but refers to operations a process can do *in its current namespace*. An unprivileged process (uid!=0, without cap_sys_admin) can join a user namespace and get uid=0 cap_sys_admin for operations inside the user namespace, but it will still have uid!=0 !cap_sys_admin for operations in the parent user namespace. user_namespaces(7) contains userns_child_exec.c and it creates a fully featured container with network without being root. (I attached a patched version I was testing) # # Because I'm using the kernel patched by my distribution # echo 1 /proc/sys/kernel/unprivileged_userns_clone $ gcc -lcap -o userns_child_exec userns_child_exec.c Here it seems to work: alban@alban:~$ ls -l /tmp/userns_child_exec -rwxr-xr-x 1 alban alban 14488 Feb 5 23:24 /tmp/userns_child_exec alban@alban:~$ id -u 1000 alban@alban:~$ ip link # --- will show lo, eth0, wlan0... alban@alban:~$ /tmp/userns_child_exec -p -m -U -M '0 1000 1' -G '0 1000 1' -n bash About to exec bash root@alban:~# id uid=0(root) gid=0(root) groups=0(root),65534(nogroup) root@alban:~# ip link # --- only lo visible in this namespace Cheers, Alban --- userns_child_exec.orig.c 2015-02-05 23:20:19.208741366 +0100 +++ userns_child_exec.c 2015-01-30 17:01:56.948493001 +0100 @@ -108,6 +108,30 @@ close(fd); } +static void +write_file(char *content, char *path) +{ +int fd; +size_t content_len; + +content_len = strlen(content); + +fd = open(path, O_RDWR); +if (fd == -1) { +fprintf(stderr, ERROR: open %s: %s\n, path, +strerror(errno)); +exit(EXIT_FAILURE); +} + +if (write(fd, content, content_len) != content_len) { +fprintf(stderr, ERROR: write %s: %s\n, content, +strerror(errno)); +exit(EXIT_FAILURE); +} + +close(fd); +} + static int /* Start function for cloned child */ childFunc(void *arg) { @@ -149,6 +173,7 @@ const int MAP_BUF_SIZE = 100; char map_buf[MAP_BUF_SIZE]; char map_path[PATH_MAX]; +char groups_path[PATH_MAX]; /* Parse command-line options. The initial '+' character in the final getopt() argument prevents GNU-style permutation @@ -225,6 +250,11 @@ update_map(uid_map, map_path); } if (gid_map != NULL || map_zero) { +snprintf(groups_path, PATH_MAX, /proc/%ld/setgroups, +(long) child_pid); +write_file(deny\n, groups_path); +} +if (gid_map != NULL || map_zero) { snprintf(map_path, PATH_MAX, /proc/%ld/gid_map, (long) child_pid); if (map_zero) { /* userns_child_exec.c Licensed under GNU General Public License v2 or later Create a child process that executes a shell command in new namespace(s); allow UID and GID mappings to be specified when creating a user namespace. */ #define _GNU_SOURCE #include sched.h #include unistd.h #include stdlib.h #include sys/wait.h #include signal.h #include fcntl.h #include stdio.h #include string.h #include limits.h #include errno.h /* A simple error-handling function: print an error message based on the value in 'errno' and terminate the calling process */ #define errExit(msg)do { perror(msg); exit(EXIT_FAILURE); \ } while (0) struct child_args { char **argv;/* Command to be executed by child, with args */ intpipe_fd[2]; /* Pipe used to synchronize parent and child */ }; static int verbose; static void usage(char *pname) { fprintf(stderr, Usage: %s [options] cmd [arg...]\n\n, pname); fprintf(stderr, Create a child process that executes a shell command in a new user namespace,\n and possibly also other new namespace(s).\n\n); fprintf(stderr, Options can be:\n\n); #define fpe(str) fprintf(stderr, %s, str); fpe(-i New IPC namespace\n); fpe(-m New mount namespace\n); fpe(-n New network namespace\n); fpe(-p New PID namespace\n); fpe(-u New UTS namespace\n); fpe(-U New user namespace\n); fpe(-M uid_map Specify UID map for user namespace\n); fpe(-G gid_map Specify GID map for user namespace\n); fpe(-z Map user's UID and GID to 0 in user namespace\n); fpe((equivalent to: -M '0 uid 1' -G '0 gid 1')\n); fpe(-v Display verbose messages\n); fpe(\n
Re: [systemd-devel] systemd-nspawn create container under unprivileged user
[reposting - sorry I forgot to Cc the mailing list] On 4 February 2015 at 23:03, Vasiliy Tolstov v.tols...@selfip.ru wrote: Hello! Does it possible to create container as regular user? Oh what capabilities i need to add to create container not using root? Hello, Manual page namespaces(7): Creation of new namespaces using clone(2) and unshare(2) in most cases requires the CAP_SYS_ADMIN capability. User namespaces are the exception: since Linux 3.8, no privilege is required to create a user namespace. systemd-nspawn uses: src/nspawn/nspawn.c: pid = raw_clone(SIGCHLD|CLONE_NEWNS| (arg_share_system ? 0 : CLONE_NEWIPC|CLONE_NEWPID|CLONE_NEWUTS)| (arg_private_network ? CLONE_NEWNET : 0), NULL); So you need to have CAP_SYS_ADMIN to use systemd-nspawn. If you want to try user namespaces, it is something that is still moving... Manual page user_namespaces(7): Starting in Linux 3.8, unprivileged processes can create user namespaces, and mount, PID, IPC, network, and UTS namespaces can be created with just the CAP_SYS_ADMIN capability in the caller's user namespace. But it is not true in most Linux distributions as they disable unprivileged user namespaces and require CAP_SYS_ADMIN anyway. See for example: http://anonscm.debian.org/viewvc/kernel/dists/trunk/linux/debian/patches/debian/add-sysctl-to-disallow-unprivileged-CLONE_NEWUSER-by-default.patch?revision=20773view=markup and: echo 1 /proc/sys/kernel/unprivileged_userns_clone Additionally, the program userns_child_exec.c included in manual page namespaces(7) does not work as is yet because since the changes introduced by CVE-2014-8989, it needs to adjust /proc/pid/setgroups. See: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=66d2f338ee4c449396b6f99f5e75cd18eb6df272 Cheers, Alban ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [PATCH] nspawn: allow bind-mounting char and block files
From: Alban Crequy al...@endocode.com --- src/nspawn/nspawn.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c index 3fce3ad..db57b24 100644 --- a/src/nspawn/nspawn.c +++ b/src/nspawn/nspawn.c @@ -911,8 +911,7 @@ static int mount_binds(const char *dest, char **l, bool ro) { return -errno; } -/* Create the mount point, but be conservative -- refuse to create block - * and char devices. */ +/* Create the mount point */ if (S_ISDIR(source_st.st_mode)) { r = mkdir_label(where, 0755); if (r 0 errno != EEXIST) @@ -929,6 +928,10 @@ static int mount_binds(const char *dest, char **l, bool ro) { r = touch(where); if (r 0) return log_error_errno(r, Failed to create mount point %s: %m, where); +} else if (S_ISCHR(source_st.st_mode) || !S_ISBLK(source_st.st_mode)) { +r = mknod(where, source_st.st_mode, source_st.st_rdev) 0; +if (r 0 errno != EEXIST) +return log_error_errno(errno, Failed to create mount point %s: %m, where); } else { log_error(Refusing to create mountpoint for file: %s, *x); return -ENOTSUP; -- 2.1.4 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [PATCH v2] nspawn: allow bind-mounting char and block files
From: Alban Crequy al...@endocode.com v2: - simplify the patch: any non-directory file can be mounted on any non-directory file. - allow bind mount of files of different types --- src/nspawn/nspawn.c | 27 +++ 1 file changed, 11 insertions(+), 16 deletions(-) diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c index 3fce3ad..2736c0a 100644 --- a/src/nspawn/nspawn.c +++ b/src/nspawn/nspawn.c @@ -898,8 +898,12 @@ static int mount_binds(const char *dest, char **l, bool ro) { r = stat(where, dest_st); if (r == 0) { -if ((source_st.st_mode S_IFMT) != (dest_st.st_mode S_IFMT)) { -log_error(The file types of %s and %s do not match. Refusing bind mount, *x, where); +if (S_ISDIR(source_st.st_mode) !S_ISDIR(dest_st.st_mode)) { +log_error(Cannot bind mount directory %s on file %s, *x, where); +return -EINVAL; +} +if (!S_ISDIR(source_st.st_mode) S_ISDIR(dest_st.st_mode)) { +log_error(Cannot bind mount file %s on directory %s, *x, where); return -EINVAL; } } else if (errno == ENOENT) { @@ -911,27 +915,18 @@ static int mount_binds(const char *dest, char **l, bool ro) { return -errno; } -/* Create the mount point, but be conservative -- refuse to create block - * and char devices. */ +/* Create the mount point. Any non-directory file can be + * mounted on any non-directory file (regular, fifo, socket, + * char, block). + */ if (S_ISDIR(source_st.st_mode)) { r = mkdir_label(where, 0755); if (r 0 errno != EEXIST) return log_error_errno(r, Failed to create mount point %s: %m, where); -} else if (S_ISFIFO(source_st.st_mode)) { -r = mkfifo(where, 0644); -if (r 0 errno != EEXIST) -return log_error_errno(errno, Failed to create mount point %s: %m, where); -} else if (S_ISSOCK(source_st.st_mode)) { -r = mknod(where, 0644 | S_IFSOCK, 0); -if (r 0 errno != EEXIST) -return log_error_errno(errno, Failed to create mount point %s: %m, where); -} else if (S_ISREG(source_st.st_mode)) { +} else { r = touch(where); if (r 0) return log_error_errno(r, Failed to create mount point %s: %m, where); -} else { -log_error(Refusing to create mountpoint for file: %s, *x); -return -ENOTSUP; } if (mount(*x, where, bind, MS_BIND, NULL) 0) -- 2.1.4 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] nspawn: allow bind-mounting char and block files
On 22 January 2015 at 13:51, Lennart Poettering lenn...@poettering.net wrote: On Thu, 22.01.15 13:25, Alban Crequy (mua...@gmail.com) wrote: From: Alban Crequy al...@endocode.com Hmm, I wonder if we can actually simplify this. IIRC the rules for over-mounting are simpler than I thought initially: a) dirs can only over-mount dirs b) everything else can over-mount everything else With that in mind I think we can collapse this code to only have two branches: one branch for the S_ISDIR() case, and another one that uses touch() for everything else. Anychance you can simplify the patch like this? The benefit would be that we can do without CAP_SYS_MKNOD for all of this. Also, your patch woud then shorten the code, while adding a feature, not make it longer! The patch will be a bit longer because the file type checks in mount_binds() need to be updated. Otherwise, the second attempt of running nspawn would fail. I will send the patch v2 shortly. --- src/nspawn/nspawn.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c index 3fce3ad..db57b24 100644 --- a/src/nspawn/nspawn.c +++ b/src/nspawn/nspawn.c @@ -911,8 +911,7 @@ static int mount_binds(const char *dest, char **l, bool ro) { return -errno; } -/* Create the mount point, but be conservative -- refuse to create block - * and char devices. */ +/* Create the mount point */ if (S_ISDIR(source_st.st_mode)) { r = mkdir_label(where, 0755); if (r 0 errno != EEXIST) @@ -929,6 +928,10 @@ static int mount_binds(const char *dest, char **l, bool ro) { r = touch(where); if (r 0) return log_error_errno(r, Failed to create mount point %s: %m, where); +} else if (S_ISCHR(source_st.st_mode) || !S_ISBLK(source_st.st_mode)) { +r = mknod(where, source_st.st_mode, source_st.st_rdev) 0; +if (r 0 errno != EEXIST) +return log_error_errno(errno, Failed to create mount point %s: %m, where); } else { log_error(Refusing to create mountpoint for file: %s, *x); return -ENOTSUP; -- 2.1.4 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [PATCH] bus.h: add missing include to linux/kref.h
Symptoms: kdbus/bus.h:56:14: error: field ‘kref’ has incomplete type struct kref kref; Signed-off-by: Alban Crequy alban.cre...@collabora.co.uk --- bus.h | 1 + 1 file changed, 1 insertion(+) diff --git a/bus.h b/bus.h index a5832b8..c7ce2fa 100644 --- a/bus.h +++ b/bus.h @@ -16,6 +16,7 @@ #include linux/hashtable.h #include linux/spinlock.h #include linux/idr.h +#include linux/kref.h #include util.h -- 1.8.5.3 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [PATCH] [RFC] [WIP] [kdbus] Attempt to recursively pass fd
Before Linux commit 25888e (from 2.6.37-rc4, Nov 2010), fd-passing on Unix sockets could recursively be stacked, allowing a process to exhaust the open files limit (/proc/sys/fs/file-max) on the system without restriction from ulimit -n. This DoS on Unix sockets was fixed by commit: commit 25888e30319f8896fc656fc68643e6a078263060 Author: Eric Dumazet eric.duma...@gmail.com Date: Thu Nov 25 04:11:39 2010 + af_unix: limit recursion level But that commit introduced a bug in dbus: https://bugs.freedesktop.org/show_bug.cgi?id=80163 kdbus does not use fd-passing on Unix sockets so it is not affected by this. However, it allows fd-passing similarly. This patch shows it is possible to recursively pass file descriptors in kdbus and stack them without keeping them attached to the initial process. I could stack passed fds 256 times, probably because of the limit KDBUS_USER_MAX_CONN: defaults.h:#define KDBUS_USER_MAX_CONN 256 But this limit could probably be overrided by using fds from different endpoints. I am also afraid that fds from Unix sockets and fds from kdbus could be stacked together, defeating the limit implemented in unix_attach_fds() by commit 25888e because the check uses unix_get_socket(), making the assumption that only Unix sockets could carry file descriptors: http://lxr.free-electrons.com/source/net/unix/af_unix.c#L1380 http://lxr.free-electrons.com/source/net/unix/garbage.c#L99 --- test/test-kdbus.c | 72 +++ 1 file changed, 72 insertions(+) diff --git a/test/test-kdbus.c b/test/test-kdbus.c index f0bf705..3c79877 100644 --- a/test/test-kdbus.c +++ b/test/test-kdbus.c @@ -12,6 +12,8 @@ #include limits.h #include sys/mman.h #include sys/ioctl.h +#include sys/types.h +#include sys/socket.h #include getopt.h #include stdbool.h @@ -1004,6 +1006,75 @@ static int check_msg_basic(struct kdbus_check_env *env) return CHECK_OK; } +static int send_fds(struct conn *conn, uint64_t dst_id, int fds[2]) +{ + struct kdbus_msg *msg; + struct kdbus_item *item; + uint64_t size; + int ret; + + size = sizeof(struct kdbus_msg); + size += KDBUS_ITEM_SIZE(sizeof(int[2])); + + msg = malloc(size); + ASSERT_RETURN (msg != NULL); + + memset(msg, 0, size); + msg-size = size; + msg-src_id = conn-id; + msg-dst_id = dst_id; + msg-payload_type = KDBUS_PAYLOAD_DBUS; + + item = msg-items; + + item-type = KDBUS_ITEM_FDS; + item-size = KDBUS_ITEM_HEADER_SIZE + sizeof(int[2]); + item-fds[0] = fds[0]; + item-fds[1] = fds[1]; + item = KDBUS_ITEM_NEXT(item); + + ret = ioctl(conn-fd, KDBUS_CMD_MSG_SEND, msg); + if (ret) { + fprintf(stderr, error sending message: %d err %d (%m)\n, ret, errno); + return EXIT_FAILURE; + } + + free(msg); + + return 0; +} + +static int check_fds_passing(struct kdbus_check_env *env) +{ + struct conn *conn_src, *conn_dst; + int fds[2]; + int ret; + int i; + + /* create two connections */ + conn_src = kdbus_hello(env-buspath, 0, NULL, 0); + ASSERT_RETURN(conn_src != NULL); + + for (i = 0; i = 0; i++) { + conn_dst = kdbus_hello(env-buspath, 0, NULL, 0); + ASSERT_RETURN(conn_dst != NULL); + + fds[0] = conn_src-fd; + fds[1] = conn_dst-fd; + + ret = send_fds (conn_src, conn_dst-id, fds); + printf(check_fds_passing: iter: %d fds %d-%d ret %d err %d (%m)\n, i, fds[0], fds[1], ret, errno); + ASSERT_RETURN(ret == 0); + + close(conn_src-fd); + free(conn_src); + + conn_src = conn_dst; + } + + return CHECK_OK; +} + static int check_msg_free(struct kdbus_check_env *env) { int ret; @@ -,6 +1182,7 @@ static const struct kdbus_check checks[] = { { name queue, check_name_queue, CHECK_CREATE_BUS | CHECK_CREATE_CONN}, { message basic, check_msg_basic, CHECK_CREATE_BUS | CHECK_CREATE_CONN}, { message free, check_msg_free, CHECK_CREATE_BUS | CHECK_CREATE_CONN}, + { fds passing,check_fds_passing, CHECK_CREATE_BUS}, { connection info,check_conn_info, CHECK_CREATE_BUS | CHECK_CREATE_CONN}, { match id add, check_match_id_add, CHECK_CREATE_BUS | CHECK_CREATE_CONN}, { match id remove,check_match_id_remove, CHECK_CREATE_BUS | CHECK_CREATE_CONN}, -- 1.8.5.3 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] User sessions: limit the ability to migrate cgroups
On Wed, 13 Aug 2014 16:37:17 +0200 Lennart Poettering lenn...@poettering.net wrote: On Thu, 07.08.14 15:19, Alban Crequy (alban.cre...@collabora.co.uk) wrote: Hi, Should unprivileged processes be allowed to change cgroup? Well, they shouldn#t do it. But I think it's OK as long as this is only done within the specific user's hierarchies. As I understand it, it is not possible to block processes to leave a cgroup, but only to block processes to enter a cgroup. Correct. In the following example, session-c4.scope/tasks belongs to root:root with -rw-r--r-- and user@1000.service/tasks belongs to user:user with -rw-r--r--. Yes, this is because systemd --user needs to be able to manage its own cgroup subtree, so we have to open this up for the user@1000.service service, but keep it restricted otherwise... It makes sense. So processes can freely move from session-c4.scope to user@1000.service. But not in the other direction. Correct. $ systemd-cgls Working Directory /sys/fs/cgroup/systemd/user.slice/user-1000.slice: ├─session-c4.scope │ ├─713 sshd: user [priv] │ ├─722 sshd: user@pts/2 │ ├─723 -bash │ ├─732 systemd-cgls │ └─733 pager ├─user@1000.service │ ├─406 /lib/systemd/systemd --user With user sessions managed by systemd, will it be possible to restrict unprivileged users from migrating to other cgroups? Unlikely. Access control on Unix is generally bound to user IDs, not processes, and we really shouldn't start here with departing from that... I tested SELinux and AppArmor to restrict /sys/fs/cgroup/. SELinux didn't help because the cgroup file system does not support extended attributes such as security.selinux, but AppArmor was able to block an application from changing cgroup. Alban ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] User sessions: limit the ability to migrate cgroups
Hi, Should unprivileged processes be allowed to change cgroup? As I understand it, it is not possible to block processes to leave a cgroup, but only to block processes to enter a cgroup. In the following example, session-c4.scope/tasks belongs to root:root with -rw-r--r-- and user@1000.service/tasks belongs to user:user with -rw-r--r--. So processes can freely move from session-c4.scope to user@1000.service. But not in the other direction. $ systemd-cgls Working Directory /sys/fs/cgroup/systemd/user.slice/user-1000.slice: ├─session-c4.scope │ ├─713 sshd: user [priv] │ ├─722 sshd: user@pts/2 │ ├─723 -bash │ ├─732 systemd-cgls │ └─733 pager ├─user@1000.service │ ├─406 /lib/systemd/systemd --user With user sessions managed by systemd, will it be possible to restrict unprivileged users from migrating to other cgroups? Best regards, Alban ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel