Re: [systemd-devel] [ANNOUNCE] Git development moved to github

2015-06-10 Thread Alban Crequy
On Tue, Jun 9, 2015 at 11:37 PM, Lennart Poettering
lenn...@poettering.net wrote:
 On Tue, 09.06.15 13:04, Filipe Brandenburger (filbran...@google.com) wrote:

 On Tue, Jun 9, 2015 at 12:59 PM, Lennart Poettering
 lenn...@poettering.net wrote:
  [...] so we comment and ask for a new PR, and close the old one.

 See my previous comment, I think this cure is worse than the
 disease :-)

 Why would you say this? Why are multiple sequencial PR, where the old
 obsoleted ones are closed and locked that bad?

 Instead, just reuse the same PR and use `git push -f` to ship new
 versions of the commits to the same branch... Yes it's awful but
 unfortunately that's how GitHub works...

 Yeah, it is awful, and loses all the comments, as well is incompatible
 with having multiple people making patch suggestions for the same
 issue.

FWIW it only loses the comments if people comment on individual
commits instead of commenting on the Files changed tab of a PR. I
usually comment in this way on purpose instead of commenting on
commits, so that the history of comments are kept in the PR, even
after rebase (it might be folded if the chunk of the patch is not
there anymore, but the comment is still in the PR). If you really want
to comment on an individual commit (but I don't recommend it), you can
include the reference of the PR in your comment (#42), then github
will keep your comment attached to the PR.

I think it is fine as it is as long as people comment in the Files
changed tab.

 To work around the problem of line comments being lost, just ask
 *reviewers* to make most of the relevant comments in the PR thread and
 keep line comments to simple comments that are probably not going to
 be relevant when they're obliterated...

 Well, *the* major reason we switched to github is actually taking
 benefit of the much more powerful inline review tool that's a pleasure
 to work with. Reviews by mail are just awful, and reviews out-of-line
 are even worse.

 Lennart

 --
 Lennart Poettering, Red Hat
 ___
 systemd-devel mailing list
 systemd-devel@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/systemd-devel
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] systemd-218 - Requisite implies TriggeredByRestartOf

2015-05-19 Thread Alban Crequy
On Tue, May 19, 2015 at 1:26 AM, Lennart Poettering
lenn...@poettering.net wrote:
 On Tue, 19.05.15 00:55, Lennart Poettering (lenn...@poettering.net) wrote:

 On Thu, 14.05.15 21:23, Evert (evert.gen...@planet.nl) wrote:

  Hi,
 
  According to the systemd documentation, Requisite disallows starting a
  unit unless the specified unit has been started. This seems to work
  fine, however, if the specified unit has been restarted, this unit will
  be started too!
  This is not what should happen and it doesn't happen with a stop and
  start of the specified unit, so clearly, restart behaves different than
  stop followed by start.

 Hmm, I figure this is a bug in systemd.

 Whenever you have a unit A declare a Requisite= on a unit B then this
 will actually create two dependencies: the actual Requisite= one plus
 one in the opposite direction, of type RequiredBy=. Dependencies of
 type RequiredBy= are mostly internal to systemd, you cannot
 configure them directly (though you can query them using systemctl
 show). They are used to track down what to stop when we get
 systemctl stop by the user. Now, as a special shortcut we currently map
 Requisite= and Required= to a reverse dependency of
 RequiredBy=, which is problematichere, since we'll hence make no
 distinction between the two when processing systemctl stop.

 I figure we need to split up the reverse dep here, and introduce a
 seperate RequisiteBy= dependency so that we can avoid this problem...

 Fixed in git. Please verify.

The commit be7d9ff730cb88d7c6a869dd5c47754c78ceaef2 (core: introduce
seperate reverse dependencies for Requires= and Requisite=)
introduced a regression in rkt: the container does not stop anymore
when the main service terminates. In my test, grep says we have one
unit using a Requisite* option (systemd-update-utmp-runlevel.service).
The main unit file uses one Requires option:

== sha512-0543bc3759cd76d21d8605cd55e19bea.service ==

[Unit]
Description=coreos.com/rkt-inspect
DefaultDependencies=false
OnFailureJobMode=isolate
OnFailure=reaper.service
Wants=exit-watcher.service
Requires=prepare-app@-opt-stage2-sha512\x2d0543bc3759cd76d21d8605cd55e19bea-rootfs.service
After=prepare-app@-opt-stage2-sha512\x2d0543bc3759cd76d21d8605cd55e19bea-rootfs.service

[Service]
Restart=no
ExecStart=/diagexec
/opt/stage2/sha512-0543bc3759cd76d21d8605cd55e19bea/rootfs /
/rkt/env/sha512-0543bc3759cd76d21d8605cd55e19bea 0 0 /inspect
-print-msg=Hellooo -sleep=3
User=0
Group=0


I am trying to understand the regression... it's possible that it is
rkt using systemd incorrectly.

Cheers,
Alban
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] systemd-218 - Requisite implies TriggeredByRestartOf

2015-05-19 Thread Alban Crequy
On Tue, May 19, 2015 at 3:44 PM, Lennart Poettering
lenn...@poettering.net wrote:
 On Tue, 19.05.15 13:08, Alban Crequy (al...@endocode.com) wrote:

 The commit be7d9ff730cb88d7c6a869dd5c47754c78ceaef2 (core: introduce
 seperate reverse dependencies for Requires= and Requisite=)
 introduced a regression in rkt: the container does not stop anymore
 when the main service terminates. In my test, grep says we have one
 unit using a Requisite* option (systemd-update-utmp-runlevel.service).
 The main unit file uses one Requires option:

 I have the suspicion that f3b85044c845de03de05135c1dd5f3298bf3e5a2
 might fix your issue, can you check?

Yes, it works now. Thanks!

 (only if you have units that use StopWhenUnneeded=yes, otherwise it's
 a different issue)

Yes, there was a unit with StopWhenUnneeded=yes.

Cheers,
Alban
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] nspawn: cloexec extraneous fds

2015-05-18 Thread Alban Crequy
On Mon, May 18, 2015 at 2:00 PM, Lennart Poettering
lenn...@poettering.net wrote:
 On Mon, 18.05.15 10:34, Alban Crequy (al...@endocode.com) wrote:

 On Wed, May 13, 2015 at 6:14 PM, Lennart Poettering
 lenn...@poettering.net wrote:
  On Mon, 11.05.15 16:41, Alban Crequy (alban.cre...@gmail.com) wrote:
 
   src/nspawn/nspawn.c | 9 -
   1 file changed, 8 insertions(+), 1 deletion(-)
 
  diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
  index 71a6239..2e45c3b 100644
  --- a/src/nspawn/nspawn.c
  +++ b/src/nspawn/nspawn.c
  @@ -3739,6 +3739,9 @@ int main(int argc, char *argv[]) {
   bool root_device_rw = true, home_device_rw = true, srv_device_rw 
  = true;
   _cleanup_close_ int master = -1, image_fd = -1;
   _cleanup_fdset_free_ FDSet *fds = NULL;
  +_cleanup_fdset_free_ FDSet *misc_fds = NULL;
  +int fd;
  +Iterator i;
   int r, n_fd_passed, loop_nr = -1;
   char veth_name[IFNAMSIZ];
   bool secondary = false, remove_subvol = false;
  @@ -3775,7 +3778,11 @@ int main(int argc, char *argv[]) {
   goto finish;
   }
   }
  -fdset_close_others(fds);
  +fdset_new_fill(misc_fds);
  +FDSET_FOREACH(fd, fds, i) {
  +fdset_remove(misc_fds, fd);
  +}
  +fdset_cloexec(misc_fds, true);
   log_open();
 
  Do we really need an extra FDSet object for this? Why not just remove
  the fdset_close_others() from the nspawn parent and adding it to the
  child process instead, without depending on O_CLOEXEC? Appears much
  simpler to as it avoids keeping two fdsets around...

 fdset_close_others() iterates on the open file descriptor found in
 /proc/self/fd/* at the time of the call, so I would need to make sure
 it does not harvest the newly opened file descriptors such as the ones
 used by the barrier, one end of the kmsg socket pair, one end of the
 rtnl socket pair, and possibly others. I would need to use fdset_put()
 on each of those fds. Although the fds used by the barrier are
 exported in struct Barrier, it is kind of internal structure.

 Well, we could just invoke it after the barrier stuff has done its
 deed, right before we invoke exec(), no?

Right. Sorry for the noise, I'll send another patch shortly, after I
finish testing it.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [PATCH] nspawn: close extra fds before execing init

2015-05-18 Thread Alban Crequy
From: Alban Crequy al...@endocode.com

When systemd-nspawn gets exec*()ed, it inherits the followings file
descriptors:
- 0, 1, 2: stdin, stdout, stderr
- SD_LISTEN_FDS_START, ... SD_LISTEN_FDS_START+LISTEN_FDS: file
  descriptors passed by the system manager (useful for socket
  activation). They are passed to the child process (process leader).
- extra lock fd: rkt passes a locked directory as an extra fd, so the
  directory remains locked as long as the container is alive.

systemd-nspawn used to close all open fds except 0, 1, 2 and the
SD_LISTEN_FDS_START..SD_LISTEN_FDS_START+LISTEN_FDS. This patch delays
the close just before the exec so the nspawn process (parent) keeps the
extra fds open.

This patch supersedes the previous attempt (cloexec extraneous fds):
http://lists.freedesktop.org/archives/systemd-devel/2015-May/031608.html
---
 src/nspawn/nspawn.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
index 8aa7b45..85a7bad 100644
--- a/src/nspawn/nspawn.c
+++ b/src/nspawn/nspawn.c
@@ -3998,7 +3998,6 @@ int main(int argc, char *argv[]) {
 goto finish;
 }
 }
-fdset_close_others(fds);
 log_open();
 
 if (arg_directory) {
@@ -4509,6 +4508,8 @@ int main(int argc, char *argv[]) {
  * setup, too... */
 (void) barrier_place_and_sync(barrier); /* #5 */
 
+(void) fdset_close_others(fds);
+
 if (arg_boot) {
 char **a;
 size_t l;
-- 
2.1.4

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] nspawn: cloexec extraneous fds

2015-05-18 Thread Alban Crequy
On Wed, May 13, 2015 at 6:14 PM, Lennart Poettering
lenn...@poettering.net wrote:
 On Mon, 11.05.15 16:41, Alban Crequy (alban.cre...@gmail.com) wrote:

  src/nspawn/nspawn.c | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)

 diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
 index 71a6239..2e45c3b 100644
 --- a/src/nspawn/nspawn.c
 +++ b/src/nspawn/nspawn.c
 @@ -3739,6 +3739,9 @@ int main(int argc, char *argv[]) {
  bool root_device_rw = true, home_device_rw = true, srv_device_rw = 
 true;
  _cleanup_close_ int master = -1, image_fd = -1;
  _cleanup_fdset_free_ FDSet *fds = NULL;
 +_cleanup_fdset_free_ FDSet *misc_fds = NULL;
 +int fd;
 +Iterator i;
  int r, n_fd_passed, loop_nr = -1;
  char veth_name[IFNAMSIZ];
  bool secondary = false, remove_subvol = false;
 @@ -3775,7 +3778,11 @@ int main(int argc, char *argv[]) {
  goto finish;
  }
  }
 -fdset_close_others(fds);
 +fdset_new_fill(misc_fds);
 +FDSET_FOREACH(fd, fds, i) {
 +fdset_remove(misc_fds, fd);
 +}
 +fdset_cloexec(misc_fds, true);
  log_open();

 Do we really need an extra FDSet object for this? Why not just remove
 the fdset_close_others() from the nspawn parent and adding it to the
 child process instead, without depending on O_CLOEXEC? Appears much
 simpler to as it avoids keeping two fdsets around...

fdset_close_others() iterates on the open file descriptor found in
/proc/self/fd/* at the time of the call, so I would need to make sure
it does not harvest the newly opened file descriptors such as the ones
used by the barrier, one end of the kmsg socket pair, one end of the
rtnl socket pair, and possibly others. I would need to use fdset_put()
on each of those fds. Although the fds used by the barrier are
exported in struct Barrier, it is kind of internal structure.

By using fdset_cloexec() instead, I don't need to keep in sync the fds
to add in the FDSet whenever a new fd is added in systemd-nspawn.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [PATCH] [PATCH v4] core: Private*/Protect* options with RootDirectory

2015-05-18 Thread Alban Crequy
From: Alban Crequy al...@endocode.com

When a service is chrooted with the option RootDirectory=/opt/..., then
the options PrivateDevices, PrivateTmp, ProtectHome, ProtectSystem must
mount the directories under $RootDirectory/{dev,tmp,home,usr,boot}.

The test-ns tool can test setup_namespace() with and without chroot:
 $ sudo TEST_NS_PROJECTS=/home/lennart/projects ./test-ns
 $ sudo TEST_NS_CHROOT=/home/alban/debian-tree 
TEST_NS_PROJECTS=/home/alban/debian-tree/home/alban/Documents ./test-ns

v4:
 - rebase on origin/master
 - coding style
 - use exec_needs_mount_namespace()
 - rename variable chroot - root_directory
 - rename variable new_mnt_namespace - needs_mount_namespace
 - prefer prefix_roota() to strjoina()
 - update ./test-ns to print something

v3:
 - add missing include mkdir.h
 - fixes after the review
   http://lists.freedesktop.org/archives/systemd-devel/2015-April/031202.html
   - move logic to setup_namespace()
   - either call setup_namespace() or chroot()
 - update ./test-ns so that I can run it without having /home/lennart

v2: 
http://lists.freedesktop.org/archives/systemd-devel/2015-February/028636.html
 - create the $RootDirectory/dev directory if missing. This is
   consistent with mount unit creating the mount points if missing.
 - avoid arbitrary limitation on 512 characters

v1: 
http://lists.freedesktop.org/archives/systemd-devel/2015-February/028609.html
---
 src/core/execute.c   |  8 --
 src/core/namespace.c | 80 +---
 src/core/namespace.h |  3 +-
 src/test/test-ns.c   | 24 ++--
 4 files changed, 100 insertions(+), 15 deletions(-)

diff --git a/src/core/execute.c b/src/core/execute.c
index 0cca481..97498b2 100644
--- a/src/core/execute.c
+++ b/src/core/execute.c
@@ -1307,6 +1307,7 @@ static int exec_child(
 uid_t uid = UID_INVALID;
 gid_t gid = GID_INVALID;
 int i, r;
+bool needs_mount_namespace;
 
 assert(unit);
 assert(command);
@@ -1585,7 +1586,9 @@ static int exec_child(
 }
 }
 
-if (exec_needs_mount_namespace(context, params, runtime)) {
+needs_mount_namespace = exec_needs_mount_namespace(context, params, 
runtime);
+
+if (needs_mount_namespace) {
 char *tmp = NULL, *var = NULL;
 
 /* The runtime struct only contains the parent
@@ -1602,6 +1605,7 @@ static int exec_child(
 }
 
 r = setup_namespace(
+params-apply_chroot ? context-root_directory 
: NULL,
 context-read_write_dirs,
 context-read_only_dirs,
 context-inaccessible_dirs,
@@ -1627,7 +1631,7 @@ static int exec_child(
 }
 
 if (params-apply_chroot) {
-if (context-root_directory)
+if (!needs_mount_namespace  context-root_directory)
 if (chroot(context-root_directory)  0) {
 *exit_status = EXIT_CHROOT;
 return -errno;
diff --git a/src/core/namespace.c b/src/core/namespace.c
index bee2839..4f9b95e 100644
--- a/src/core/namespace.c
+++ b/src/core/namespace.c
@@ -36,6 +36,7 @@
 #include dev-setup.h
 #include selinux-util.h
 #include namespace.h
+#include mkdir.h
 
 typedef enum MountMode {
 /* This is ordered by priority! */
@@ -124,6 +125,22 @@ static void drop_duplicates(BindMount *m, unsigned *n) {
 *n = t - m;
 }
 
+static int mount_move_root(const char *path) {
+if (chdir(path)  0)
+return -errno;
+
+if (mount(path, /, NULL, MS_MOVE, NULL)  0)
+return -errno;
+
+if (chroot(.)  0)
+return -errno;
+
+if (chdir(/)  0)
+return -errno;
+
+return 0;
+}
+
 static int mount_dev(BindMount *m) {
 static const char devnodes[] =
 /dev/null\0
@@ -225,7 +242,13 @@ static int mount_dev(BindMount *m) {
 
 dev_setup(temporary_mount);
 
-if (mount(dev, /dev/, NULL, MS_MOVE, NULL)  0) {
+/* Create the /dev directory if missing. It is more likely to be
+ * missing when the service is started with RootDirectory. This is
+ * consistent with mount units creating the mount points when missing.
+ */
+(void) mkdir_p_label(m-path, 0755);
+
+if (mount(dev, m-path, NULL, MS_MOVE, NULL)  0) {
 r = -errno;
 goto fail;
 }
@@ -404,6 +427,7 @@ static int make_read_only(BindMount *m) {
 }
 
 int setup_namespace(
+const char* root_directory,
 char** read_write_dirs,
 char** read_only_dirs,
 char** inaccessible_dirs,
@@ -449,37 +473,56 @@ int setup_namespace(
 return r;
 
 if (tmp_dir) {
-m-path = /tmp

[systemd-devel] [PATCH] [PATCH v3] core: Private*/Protect* options with RootDirectory

2015-05-12 Thread Alban Crequy
From: Alban Crequy al...@endocode.com

When a service is chrooted with the option RootDirectory=/opt/..., then
the options PrivateDevices, PrivateTmp, ProtectHome, ProtectSystem must
mount the directories under $RootDirectory/{dev,tmp,home,usr,boot}.

This can be tested with test-ns as root:
 # export TEST_NS_PROJECTS=/home/alban/debian-tree/dev2/
 # export TEST_NS_CHROOT=/home/alban/debian-tree
 # ./test-ns

Questions:
 - I'm not quite sure about whether we can rollback in setup_namespace()

v3:
 - add missing include mkdir.h
 - fixes after the review
   http://lists.freedesktop.org/archives/systemd-devel/2015-April/031202.html
   - move logic to setup_namespace()
   - either call setup_namespace() or chroot()
 - update ./test-ns so that I can run it without having /home/lennart

v2: 
http://lists.freedesktop.org/archives/systemd-devel/2015-February/028636.html
 - create the $RootDirectory/dev directory if missing. This is
   consistent with mount unit creating the mount points if missing.
 - avoid arbitrary limitation on 512 characters

v1: 
http://lists.freedesktop.org/archives/systemd-devel/2015-February/028609.html
---
 src/core/execute.c   |  9 --
 src/core/namespace.c | 77 +---
 src/core/namespace.h |  3 +-
 src/test/test-ns.c   | 10 +--
 4 files changed, 83 insertions(+), 16 deletions(-)

diff --git a/src/core/execute.c b/src/core/execute.c
index 1a297ba..d4ccac6 100644
--- a/src/core/execute.c
+++ b/src/core/execute.c
@@ -1277,6 +1277,7 @@ static int exec_child(
 uid_t uid = UID_INVALID;
 gid_t gid = GID_INVALID;
 int i, r;
+bool new_mnt_namespace;
 
 assert(unit);
 assert(command);
@@ -1555,7 +1556,7 @@ static int exec_child(
 }
 }
 
-if (!strv_isempty(context-read_write_dirs) ||
+new_mnt_namespace = !strv_isempty(context-read_write_dirs) ||
 !strv_isempty(context-read_only_dirs) ||
 !strv_isempty(context-inaccessible_dirs) ||
 context-mount_flags != 0 ||
@@ -1563,8 +1564,9 @@ static int exec_child(
 params-bus_endpoint_path ||
 context-private_devices ||
 context-protect_system != PROTECT_SYSTEM_NO ||
-context-protect_home != PROTECT_HOME_NO) {
+context-protect_home != PROTECT_HOME_NO;
 
+if (new_mnt_namespace) {
 char *tmp = NULL, *var = NULL;
 
 /* The runtime struct only contains the parent
@@ -1581,6 +1583,7 @@ static int exec_child(
 }
 
 r = setup_namespace(
+params-apply_chroot ? context-root_directory 
: NULL,
 context-read_write_dirs,
 context-read_only_dirs,
 context-inaccessible_dirs,
@@ -1606,7 +1609,7 @@ static int exec_child(
 }
 
 if (params-apply_chroot) {
-if (context-root_directory)
+if (!new_mnt_namespace  context-root_directory)
 if (chroot(context-root_directory)  0) {
 *exit_status = EXIT_CHROOT;
 return -errno;
diff --git a/src/core/namespace.c b/src/core/namespace.c
index 718da23..9f5ba92 100644
--- a/src/core/namespace.c
+++ b/src/core/namespace.c
@@ -36,6 +36,7 @@
 #include dev-setup.h
 #include selinux-util.h
 #include namespace.h
+#include mkdir.h
 
 typedef enum MountMode {
 /* This is ordered by priority! */
@@ -124,6 +125,26 @@ static void drop_duplicates(BindMount *m, unsigned *n) {
 *n = t - m;
 }
 
+static int mount_move_root(const char *path) {
+if (chdir(path)  0) {
+return -errno;
+}
+
+if (mount(path, /, NULL, MS_MOVE, NULL)  0) {
+return -errno;
+}
+
+if (chroot(.)  0) {
+return -errno;
+}
+
+if (chdir(/)  0) {
+return -errno;
+}
+
+return 0;
+}
+
 static int mount_dev(BindMount *m) {
 static const char devnodes[] =
 /dev/null\0
@@ -225,7 +246,13 @@ static int mount_dev(BindMount *m) {
 
 dev_setup(temporary_mount);
 
-if (mount(dev, /dev/, NULL, MS_MOVE, NULL)  0) {
+/* Create the /dev directory if missing. It is more likely to be
+ * missing when the service is started with RootDirectory. This is
+ * consistent with mount units creating the mount points when missing.
+ */
+mkdir_p_label (m-path, 0755);
+
+if (mount(dev, m-path, NULL, MS_MOVE, NULL)  0) {
 r = -errno;
 goto fail;
 }
@@ -404,6 +431,7 @@ static int make_read_only(BindMount *m) {
 }
 
 int setup_namespace(
+const char* chroot,
 char** read_write_dirs,
 char** read_only_dirs,
 char

[systemd-devel] [PATCH] [PATCH v2] nspawn: check the pid in SIGCHLD handler before terminating the container

2015-05-11 Thread Alban Crequy
From: Alban Crequy al...@endocode.com

When a process starts systemd-nspawn with exec*() without fork(),
systemd-nspawn can be the parent process of children processes unknown
to systemd-nspawn. It can then receive the signal SIGCHLD for both the
container leader process and the previously started processes. So it
should distinguish them.

v2:
 - correctly check if a child was in a waitable state.
---
 src/nspawn/nspawn.c | 55 +++--
 1 file changed, 49 insertions(+), 6 deletions(-)

diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
index 05d2c71..71a6239 100644
--- a/src/nspawn/nspawn.c
+++ b/src/nspawn/nspawn.c
@@ -23,6 +23,7 @@
 #include sched.h
 #include unistd.h
 #include sys/types.h
+#include sys/wait.h
 #include sys/mount.h
 #include stdlib.h
 #include string.h
@@ -3511,6 +3512,12 @@ static int change_uid_gid(char **_home) {
 return 0;
 }
 
+typedef struct SigChildData {
+pid_t leader_pid;
+siginfo_t leader_status;
+bool terminated;
+} SigChildData;
+
 /*
  * Return values:
  *  0 : wait_for_terminate() failed to get the state of the
@@ -3528,13 +3535,17 @@ static int change_uid_gid(char **_home) {
  * That is, success is indicated by a return value of zero, and an
  * error is indicated by a non-zero value.
  */
-static int wait_for_container(pid_t pid, ContainerStatus *container) {
+static int wait_for_container(SigChildData *sigchld_ctx, pid_t pid, 
ContainerStatus *container) {
 siginfo_t status;
 int r;
 
-r = wait_for_terminate(pid, status);
-if (r  0)
-return log_warning_errno(r, Failed to wait for container: 
%m);
+if (sigchld_ctx-terminated) {
+status = sigchld_ctx-leader_status;
+} else {
+r = wait_for_terminate(pid, status);
+if (r  0)
+return log_warning_errno(r, Failed to wait for 
container: %m);
+}
 
 switch (status.si_code) {
 
@@ -3594,6 +3605,35 @@ static int on_orderly_shutdown(sd_event_source *s, const 
struct signalfd_siginfo
 return 0;
 }
 
+static int on_sigchld(sd_event_source *s, const struct signalfd_siginfo *si, 
void *userdata) {
+SigChildData *ctx = userdata;
+pid_t leader_pid = ctx-leader_pid;
+
+/* several terminated children could be merged in a single SIGCHLD, so
+ * don't rely on si-ssi_pid. */
+
+while (1) {
+int r;
+siginfo_t status;
+
+zero(status);
+r = waitid(P_ALL, -1, status, WEXITED | WNOHANG);
+if (r  0)
+break;
+
+if (status.si_pid = 0)
+break;
+
+if (status.si_pid == leader_pid) {
+ctx-leader_status = status;
+ctx-terminated = true;
+return sd_event_exit(sd_event_source_get_event(s), 0);
+}
+}
+
+return 0;
+}
+
 static int determine_names(void) {
 int r;
 
@@ -3917,6 +3957,7 @@ int main(int argc, char *argv[]) {
 .sa_handler = nop_handler,
 .sa_flags = SA_NOCLDSTOP,
 };
+SigChildData sigchld_ctx = {0,};
 
 r = barrier_create(barrier);
 if (r  0) {
@@ -4386,7 +4427,9 @@ int main(int argc, char *argv[]) {
 }
 
 /* simply exit on sigchld */
-sd_event_add_signal(event, NULL, SIGCHLD, 
NULL, NULL);
+sigchld_ctx.leader_pid = pid;
+sigchld_ctx.terminated = false;
+sd_event_add_signal(event, NULL, SIGCHLD,  
on_sigchld , sigchld_ctx);
 
 if (arg_expose_ports) {
 r = watch_rtnl(event, 
rtnl_socket_pair[0], exposed, rtnl);
@@ -4425,7 +4468,7 @@ int main(int argc, char *argv[]) {
 /* Normally redundant, but better safe than sorry */
 kill(pid, SIGKILL);
 
-r = wait_for_container(pid, container_status);
+r = wait_for_container(sigchld_ctx, pid, container_status);
 pid = 0;
 
 if (r  0)
-- 
2.1.4

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [PATCH] nspawn: cloexec extraneous fds

2015-05-11 Thread Alban Crequy
From: Alban Crequy al...@endocode.com

When systemd-nspawn gets exec*()ed, it inherits the followings file
descriptors:
- 0, 1, 2: stdin, stdout, stderr
- SD_LISTEN_FDS_START, ... SD_LISTEN_FDS_START+LISTEN_FDS: file
  descriptors passed by the system manager (useful for socket
  activation). They are passed to the child process (process leader).
- extra lock fd: rkt passes a locked directory as an extra fd, so the
  directory remains locked as long as the container is alive.

systemd-nspawn used to close all open fds except 0, 1, 2 and the
SD_LISTEN_FDS_START..SD_LISTEN_FDS_START+LISTEN_FDS. This patch just
cloexecs them instead so they stay open in the systemd-nspawn process
but they are not passed to the process leader.
---
 src/nspawn/nspawn.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
index 71a6239..2e45c3b 100644
--- a/src/nspawn/nspawn.c
+++ b/src/nspawn/nspawn.c
@@ -3739,6 +3739,9 @@ int main(int argc, char *argv[]) {
 bool root_device_rw = true, home_device_rw = true, srv_device_rw = 
true;
 _cleanup_close_ int master = -1, image_fd = -1;
 _cleanup_fdset_free_ FDSet *fds = NULL;
+_cleanup_fdset_free_ FDSet *misc_fds = NULL;
+int fd;
+Iterator i;
 int r, n_fd_passed, loop_nr = -1;
 char veth_name[IFNAMSIZ];
 bool secondary = false, remove_subvol = false;
@@ -3775,7 +3778,11 @@ int main(int argc, char *argv[]) {
 goto finish;
 }
 }
-fdset_close_others(fds);
+fdset_new_fill(misc_fds);
+FDSET_FOREACH(fd, fds, i) {
+fdset_remove(misc_fds, fd);
+}
+fdset_cloexec(misc_fds, true);
 log_open();
 
 if (arg_directory) {
-- 
2.1.4

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [PATCH] nspawn: check the pid in SIGCHLD handler before terminating the container

2015-05-10 Thread Alban Crequy
From: Alban Crequy al...@endocode.com

When a process starts systemd-nspawn with exec*() without fork(),
systemd-nspawn can be the parent process of children processes unknown
to systemd-nspawn. It can then receive the signal SIGCHLD for both the
container leader process and the previously started processes. So it
should distinguish them.
---
 src/nspawn/nspawn.c | 51 +--
 1 file changed, 45 insertions(+), 6 deletions(-)

diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
index 05d2c71..49f6650 100644
--- a/src/nspawn/nspawn.c
+++ b/src/nspawn/nspawn.c
@@ -23,6 +23,7 @@
 #include sched.h
 #include unistd.h
 #include sys/types.h
+#include sys/wait.h
 #include sys/mount.h
 #include stdlib.h
 #include string.h
@@ -3511,6 +3512,12 @@ static int change_uid_gid(char **_home) {
 return 0;
 }
 
+typedef struct SigChildData {
+pid_t leader_pid;
+siginfo_t leader_status;
+bool terminated;
+} SigChildData;
+
 /*
  * Return values:
  *  0 : wait_for_terminate() failed to get the state of the
@@ -3528,13 +3535,17 @@ static int change_uid_gid(char **_home) {
  * That is, success is indicated by a return value of zero, and an
  * error is indicated by a non-zero value.
  */
-static int wait_for_container(pid_t pid, ContainerStatus *container) {
+static int wait_for_container(SigChildData *sigchld_ctx, pid_t pid, 
ContainerStatus *container) {
 siginfo_t status;
 int r;
 
-r = wait_for_terminate(pid, status);
-if (r  0)
-return log_warning_errno(r, Failed to wait for container: 
%m);
+if (sigchld_ctx-terminated) {
+status = sigchld_ctx-leader_status;
+} else {
+r = wait_for_terminate(pid, status);
+if (r  0)
+return log_warning_errno(r, Failed to wait for 
container: %m);
+}
 
 switch (status.si_code) {
 
@@ -3594,6 +3605,31 @@ static int on_orderly_shutdown(sd_event_source *s, const 
struct signalfd_siginfo
 return 0;
 }
 
+static int on_sigchld(sd_event_source *s, const struct signalfd_siginfo *si, 
void *userdata) {
+SigChildData *ctx = userdata;
+pid_t leader_pid = ctx-leader_pid;
+
+/* several terminated children could be merged in a single SIGCHLD, so
+ * don't rely on si-ssi_pid. */
+
+while (1) {
+int r;
+siginfo_t status;
+
+zero(status);
+r = waitid(P_ALL, -1, status, WEXITED | WNOHANG);
+if (r  0)
+break;
+if (status.si_pid == leader_pid) {
+ctx-leader_status = status;
+ctx-terminated = true;
+return sd_event_exit(sd_event_source_get_event(s), 0);
+}
+}
+
+return 0;
+}
+
 static int determine_names(void) {
 int r;
 
@@ -3917,6 +3953,7 @@ int main(int argc, char *argv[]) {
 .sa_handler = nop_handler,
 .sa_flags = SA_NOCLDSTOP,
 };
+SigChildData sigchld_ctx = {0,};
 
 r = barrier_create(barrier);
 if (r  0) {
@@ -4386,7 +4423,9 @@ int main(int argc, char *argv[]) {
 }
 
 /* simply exit on sigchld */
-sd_event_add_signal(event, NULL, SIGCHLD, 
NULL, NULL);
+sigchld_ctx.leader_pid = pid;
+sigchld_ctx.terminated = false;
+sd_event_add_signal(event, NULL, SIGCHLD,  
on_sigchld , sigchld_ctx);
 
 if (arg_expose_ports) {
 r = watch_rtnl(event, 
rtnl_socket_pair[0], exposed, rtnl);
@@ -4425,7 +4464,7 @@ int main(int argc, char *argv[]) {
 /* Normally redundant, but better safe than sorry */
 kill(pid, SIGKILL);
 
-r = wait_for_container(pid, container_status);
+r = wait_for_container(sigchld_ctx, pid, container_status);
 pid = 0;
 
 if (r  0)
-- 
2.1.4

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [Q] About supporting nested systemd daemon

2015-04-30 Thread Alban Crequy
On Wed, Feb 25, 2015 at 6:48 PM, Lennart Poettering
lenn...@poettering.net wrote:
 On Wed, 25.02.15 00:05, Cyrill Gorcunov (gorcu...@gmail.com) wrote:

 Hi all! I would really appreciate if someone enlighten me if there is some 
 simple
 solution for the problem we met in OpenVZ: modern containers are mostly 
 systemd
 based so that once it is started up the systemd daemon mounts own instance of
 the systemd cgroup (if previously has not been pre-mounted by container 
 startup
 tools or whatever). To make a strict isolation of nested systemd cgroup (by
 nested I mean systemd cgroup instance mounted inside container) we've 
 patched
 the kernel so that container's systemd obtains own instance of cgroup 
 non-intersected
 anyhow with one present on a host system.

 And we would really love to get rid of this kind of kernel's hack but be able
 to isolate nested systemd with own cgroup instance using solely userspace
 tools. Is there some way to reach this?

 Not really. cgroupfs doesn't really allow that. First of all the root
 cgroup has a different set of attributes than child cgroups, hence you
 cannot mount an arbitrary child to the root cgroup and assume it
 works. But even worse, /proc/$PID/cgroup actually contains the full
 cgroup path, and hence mounting only a subtree would break the
 refernces from that file.

 systemd-nspawn nowadays mounts all hierarchies into the container, but
 mounts all controller hierarchies read-only, and of the name=systemd
 hierarchy mounts everything read-only, except the subtree the
 container is allowed to manage. That way only the cgroup tree the
 container needs access to is writable to it. That solution however
 does not hide the cgroup tree. A process running inside the container
 can still go an explore the tree and its attributes. However, all
 other groups will appear empty to it, since processes not in the
 container PID namespaces will be suppressed when reading the member
 process list.

To sum up what systemd-nspawn is currently mounting in the container:
- /sys/fs/cgroup/systemd/  --  mounted RO
- /sys/fs/cgroup/systemd/machine.slice/machine-xxx.scope/  -- mounted RW
- /sys/fs/cgroup/cpu,cpuacct/  --  mounted RO
- etc. for other cgroup hierarchies  --  mounted RO

In order to let systemd in the container restrict cpu, memory, etc. on
some of its services (see manpage systemd.resource-control(5)), rkt
would like systemd-nspawn to mount a subtree of some hierarchy
(cpu,cpuacct, memory) in read-write mode.

Is there any issues with changing the systemd-nspawn mounts in the
following way:
- /sys/fs/cgroup/systemd/  --  mounted RO
- /sys/fs/cgroup/systemd/machine.slice/machine-xxx.scope/  -- mounted RW
- /sys/fs/cgroup/cpu,cpuacct/  --  mounted RO
- /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-xxx.scope/  -- mounted RW
- etc. for other cgroup hierarchies.

Iago wrote two experimental patches on systemd-nspawn to try that and
it worked. Delegate=yes was enabled in systemd-nspawn in order to test
this:
https://github.com/endocode/systemd/commits/iaguis/delegate

But I would like to know what is missing to make this safe (or if it
is already safe to do).

 There have been proposals on LKML to add cgroup namespacings, but no
 idea where that went.

 LXC created a FUSE emulation of /proc and /sys, called lxcfs to solve
 this problem. Quite honestly I find this a pretty crazy idea however.

 If I understand correctly we can provide separate slice to container's
 systemd leaving the rest of host cgroup in ro mode, right?

 Yes.

 If so maybe there a way to hide host cgroup completely from
 container so it would see only own cgroup in sysfs?

 I don't see how this could work. I mean, you could overmount all other
 cgroup siblings with empty directories in the containers, but not
 realy scalable nor compatible with cgroups being added or removed
 later on...

 Lennart

 --
 Lennart Poettering, Red Hat
 ___
 systemd-devel mailing list
 systemd-devel@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/systemd-devel
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] unit: When stopping due to BindsTo=, log which unit caused it

2015-04-24 Thread Alban Crequy
On Fri, Apr 24, 2015 at 12:45 PM, Lennart Poettering
lenn...@poettering.net wrote:
 On Wed, 22.04.15 16:55, Alban Crequy (al...@endocode.com) wrote:

 Thanks for the commits. They don't seem related to containers.

 I can reproduce my issue on git-master:

 sudo ~/git/systemd/systemd-nspawn --register=false --bind
 $HOME/tmp/vol -D debian-tree -b

 Then, in the container, make sure /bin/umount does NOT exist.
 Then halt the container with kill -37 1 (SIGRTMIN+3)

 We require /bin/mount and /bin/umount to exist. We do not support
 systems where you remove those. We also don't support systems without
 glibc either, ... ;-)

Fair enough about the dependency on umount/mount :)

I added /bin/mount and /bin/umount in the container for my test and
now systemd in the container says:

Unit opt-stage2-sha512(...)-rootfs-dir1.mount is bound to inactive
unit 
dev-disk-by\x2duuid-25ea81c8\x2d20d8\x2d4ab1\x2d862c\x2d882a04478837.device.
Stopping, too.

The directory /opt/stage2/sha512xxx/rootfs/dir1 is the bind mount
specified on the systemd-nspawn --bind command line. How can I tell
systemd in the nspawn container *not* to umount the volumes prepared
by nspawn?

Note that systemd is also trying to umount other bind-mounted
directories but it fails because the processes in the container are
using it:
umount: /opt/stage2/sha512-ba93cedc478ed21c03d690b5f026205f/rootfs:
target is busy

And systemd keeps trying to umount them in a busy loop.

How does systemd detect that dev-disk-by...device is inactive?

Cheers,
Alban
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] unit: When stopping due to BindsTo=, log which unit caused it

2015-04-24 Thread Alban Crequy
On Fri, Apr 24, 2015 at 5:34 PM, Lennart Poettering
lenn...@poettering.net wrote:
 On Fri, 24.04.15 17:10, Alban Crequy (al...@endocode.com) wrote:

 On Fri, Apr 24, 2015 at 12:45 PM, Lennart Poettering
 lenn...@poettering.net wrote:
  On Wed, 22.04.15 16:55, Alban Crequy (al...@endocode.com) wrote:
 
  Thanks for the commits. They don't seem related to containers.
 
  I can reproduce my issue on git-master:
 
  sudo ~/git/systemd/systemd-nspawn --register=false --bind
  $HOME/tmp/vol -D debian-tree -b
 
  Then, in the container, make sure /bin/umount does NOT exist.
  Then halt the container with kill -37 1 (SIGRTMIN+3)
 
  We require /bin/mount and /bin/umount to exist. We do not support
  systems where you remove those. We also don't support systems without
  glibc either, ... ;-)

 Fair enough about the dependency on umount/mount :)

 I added /bin/mount and /bin/umount in the container for my test and
 now systemd in the container says:

 Unit opt-stage2-sha512(...)-rootfs-dir1.mount is bound to inactive
 unit 
 dev-disk-by\x2duuid-25ea81c8\x2d20d8\x2d4ab1\x2d862c\x2d882a04478837.device.
 Stopping, too.

 I figure we shouldn't bother with adding bindsto dependencies for
 .device units in containers, given that .device units are not
 supported there anyway.

 Fix:

 http://cgit.freedesktop.org/systemd/systemd/commit/?id=47bc12e1ba35d38edda737dae232088d6d3ae688

 Please verify,

Thanks for the fix, it works for me! I tested the fix both from git
master and cherry-picked on v219.

Cheers,
Alban
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] unit: When stopping due to BindsTo=, log which unit caused it

2015-04-22 Thread Alban Crequy
On Tue, Apr 21, 2015 at 10:35 PM, Zbigniew Jędrzejewski-Szmek
zbys...@in.waw.pl wrote:
 On Tue, Apr 21, 2015 at 03:54:35PM +0200, Alban Crequy wrote:
 On Sat, Feb 28, 2015 at 5:40 PM, Lennart Poettering
 lenn...@poettering.net wrote:
  On Fri, 27.02.15 17:13, Lennart Poettering (lenn...@poettering.net) wrote:
 
  On Thu, 26.02.15 16:50, Martin Pitt (martin.p...@ubuntu.com) wrote:
 
   IMHO it would be prudent to skip adding the BindsTo= if at the time of
   creating the .mount unit the backing .device unit doesn't actually
   exist. In that case it's a mount which isn't managed by systemd, and
   we shouldn't touch it. We mostly want this BindsTo= for mounts where
   the .device units *do* exist, so that when they go away we can clean
   up the mount (mostly for hotpluggable devices and removable media).
   I'll have a deeper look ASAP.
 
  I ran into this myself the other day, and Kay, Daniel and I spent a
  lot of time to come up with a scheme how to deal with the race... And
  I think we have a nice scheme now and I started implementing it.
 
  The idea is that .device units will gain a third state: currently they
  are either dead or plugged, and the new state will be
  tentative. It is entered when a device is referenced in
  /proc/self/mountinfo or /proc/swap, even though it has not yet shown
  up via udev.
 
  This new state has will not result in BindsTo= getting active.
 
  This is implemented now. Please check if this fixes this issue for
  you.

 Which commit implements this?

 I have an issue on nspawn container shutdown with v219 that I didn't
 have with v215. Systemd is trying to unmount the volumes bind mounted
 by nspawn and it fails because /bin/umount does not exist in my
 container. Systemd keeps trying to umount it in a busy loop. Reverting
 06e9783e2cc12eb6514e80c7f0014295f59b (core/mount: add
 dependencies to dynamically mounted mounts too) fixes my issue.
 There was a bunch of fixes on top. See 5bd4b173605142, 628c89cc68ab96fce2d,
 496068a8288084ab3.

Thanks for the commits. They don't seem related to containers.

I can reproduce my issue on git-master:

sudo ~/git/systemd/systemd-nspawn --register=false --bind
$HOME/tmp/vol -D debian-tree -b

Then, in the container, make sure /bin/umount does NOT exist.
Then halt the container with kill -37 1 (SIGRTMIN+3)

It will loop forever on:

[  OK  ] Failed unmounting /home/alban/tmp/vol.
 Unmounting /home/alban/tmp/vol...
[  OK  ] Failed unmounting /home/alban/tmp/vol.
 Unmounting /home/alban/tmp/vol...

I am not sure why it worked fine for me with the older version v215.

Cheers,
Alban
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] unit: When stopping due to BindsTo=, log which unit caused it

2015-04-21 Thread Alban Crequy
On Sat, Feb 28, 2015 at 5:40 PM, Lennart Poettering
lenn...@poettering.net wrote:
 On Fri, 27.02.15 17:13, Lennart Poettering (lenn...@poettering.net) wrote:

 On Thu, 26.02.15 16:50, Martin Pitt (martin.p...@ubuntu.com) wrote:

  IMHO it would be prudent to skip adding the BindsTo= if at the time of
  creating the .mount unit the backing .device unit doesn't actually
  exist. In that case it's a mount which isn't managed by systemd, and
  we shouldn't touch it. We mostly want this BindsTo= for mounts where
  the .device units *do* exist, so that when they go away we can clean
  up the mount (mostly for hotpluggable devices and removable media).
  I'll have a deeper look ASAP.

 I ran into this myself the other day, and Kay, Daniel and I spent a
 lot of time to come up with a scheme how to deal with the race... And
 I think we have a nice scheme now and I started implementing it.

 The idea is that .device units will gain a third state: currently they
 are either dead or plugged, and the new state will be
 tentative. It is entered when a device is referenced in
 /proc/self/mountinfo or /proc/swap, even though it has not yet shown
 up via udev.

 This new state has will not result in BindsTo= getting active.

 This is implemented now. Please check if this fixes this issue for
 you.

Which commit implements this?

I have an issue on nspawn container shutdown with v219 that I didn't
have with v215. Systemd is trying to unmount the volumes bind mounted
by nspawn and it fails because /bin/umount does not exist in my
container. Systemd keeps trying to umount it in a busy loop. Reverting
06e9783e2cc12eb6514e80c7f0014295f59b (core/mount: add
dependencies to dynamically mounted mounts too) fixes my issue.

Cheers,
Alban
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [PATCH] [RFC] umount: reduce verbosity

2015-04-13 Thread Alban Crequy
From: Alban Crequy al...@endocode.com

When a systemd-nspawn container terminates, systemd umounts all bind
mounts that were mounted in the container and generates a log for each
umount.

This additional log_info was added by
bce93b7ac7642426039863493694d8c12812e2a7 for debugging shutdown. But
surely log_debug is enough.

I don't want to see them when when systemd is started with --log-target=null.
There are other log_info that I would like to mute on --log-target=null.
Is changing log_info to log_debug the correct approach?

This patch is useful for:
https://github.com/coreos/rkt/issues/690

When rkt is started with --debug, the systemd logs are printed. When rkt
is started without --debug, systemd is started with --log-target=null in
order to mute the logs.
---
 src/core/umount.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/core/umount.c b/src/core/umount.c
index bee267a..b5a466c 100644
--- a/src/core/umount.c
+++ b/src/core/umount.c
@@ -401,7 +401,7 @@ static int mount_points_list_umount(MountPoint **head, bool 
*changed, bool log_e
 /* Trying to umount. We don't force here since we rely
  * on busy NFS and FUSE file systems to return EBUSY
  * until we closed everything on top of them. */
-log_info(Unmounting %s., m-path);
+log_debug(Unmounting %s., m-path);
 if (umount2(m-path, 0) == 0) {
 if (changed)
 *changed = true;
-- 
2.1.4

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [PATCH] [PATCH v2] nspawn: fallback on bind mount when mknod fails

2015-03-31 Thread Alban Crequy
From: Alban Crequy al...@endocode.com

Some systems abusively restrict mknod, even when the device node already
exists in /dev. This is unfortunate because it prevents systemd-nspawn
from creating the basic devices in /dev in the container.

This patch implements a workaround: when mknod fails, fallback on bind
mounts.

Additionally, /dev/console was created with a mknod with the same
major/minor as /dev/null before bind mounting a pts on it. This patch
removes the mknod and creates an empty regular file instead.

In order to test this patch, I used the following configuration, which I
think should replicate the system with the abusive restriction on mknod:

  # grep devices /proc/self/cgroup
  4:devices:/user.slice/restrict
  # cat /sys/fs/cgroup/devices/user.slice/restrict/devices.list
  c 1:9 r
  c 5:2 rw
  c 136:* rw
  # systemd-nspawn --register=false -D .
---
 src/nspawn/nspawn.c | 27 ++-
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
index b830141..e85d298 100644
--- a/src/nspawn/nspawn.c
+++ b/src/nspawn/nspawn.c
@@ -1449,8 +1449,17 @@ static int copy_devnodes(const char *dest) {
 return -r;
 }
 
-if (mknod(to, st.st_mode, st.st_rdev)  0)
-return log_error_errno(errno, mknod(%s) 
failed: %m, to);
+if (mknod(to, st.st_mode, st.st_rdev)  0) {
+if (errno != EPERM)
+return log_error_errno(errno, 
mknod(%s) failed: %m, to);
+
+/* Some systems abusively restrict mknod but
+ * allow bind mounts. */
+if (touch(to)  0)
+return log_error_errno(errno, touch 
(%s) failed: %m, to);
+if (mount(from, to, NULL, MS_BIND, NULL)  0)
+return log_error_errno(errno, both 
mknod and bind mount (%s) failed: %m, to);
+}
 
 if (arg_userns  arg_uid_shift != UID_INVALID)
 if (lchown(to, arg_uid_shift, arg_uid_shift)  
0)
@@ -1481,7 +1490,6 @@ static int setup_ptmx(const char *dest) {
 static int setup_dev_console(const char *dest, const char *console) {
 _cleanup_umask_ mode_t u;
 const char *to;
-struct stat st;
 int r;
 
 assert(dest);
@@ -1489,24 +1497,17 @@ static int setup_dev_console(const char *dest, const 
char *console) {
 
 u = umask();
 
-if (stat(/dev/null, st)  0)
-return log_error_errno(errno, Failed to stat /dev/null: %m);
-
 r = chmod_and_chown(console, 0600, 0, 0);
 if (r  0)
 return log_error_errno(r, Failed to correct access mode for 
TTY: %m);
 
 /* We need to bind mount the right tty to /dev/console since
  * ptys can only exist on pts file systems. To have something
- * to bind mount things on we create a device node first, and
- * use /dev/null for that since we the cgroups device policy
- * allows us to create that freely, while we cannot create
- * /dev/console. (Note that the major minor doesn't actually
- * matter here, since we mount it over anyway). */
+ * to bind mount things on we create a empty regular file. */
 
 to = strjoina(dest, /dev/console);
-if (mknod(to, (st.st_mode  ~0) | 0600, st.st_rdev)  0)
-return log_error_errno(errno, mknod() for /dev/console 
failed: %m);
+if (touch(to)  0)
+return log_error_errno(errno, touch() for /dev/console 
failed: %m);
 
 if (mount(console, to, NULL, MS_BIND, NULL)  0)
 return log_error_errno(errno, Bind mount for /dev/console 
failed: %m);
-- 
2.1.4

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] [PATCH v3] nspawn: fallback on bind mount when mknod fails

2015-03-31 Thread Alban Crequy
On Tue, Mar 31, 2015 at 5:35 PM, Dave Reisner d...@falconindy.com wrote:
 On Tue, Mar 31, 2015 at 05:14:48PM +0200, Alban Crequy wrote:
 From: Alban Crequy al...@endocode.com

 Some systems abusively restrict mknod, even when the device node already
 exists in /dev. This is unfortunate because it prevents systemd-nspawn
 from creating the basic devices in /dev in the container.

 This patch implements a workaround: when mknod fails, fallback on bind
 mounts.

 Additionally, /dev/console was created with a mknod with the same
 major/minor as /dev/null before bind mounting a pts on it. This patch
 removes the mknod and creates an empty regular file instead.

 In order to test this patch, I used the following configuration, which I
 think should replicate the system with the abusive restriction on mknod:

   # grep devices /proc/self/cgroup
   4:devices:/user.slice/restrict
   # cat /sys/fs/cgroup/devices/user.slice/restrict/devices.list
   c 1:9 r
   c 5:2 rw
   c 136:* rw
   # systemd-nspawn --register=false -D .

 v2:
  - remove bind, it is not needed since there is already MS_BIND
 v3:
  - fix error management when calling touch()
  - fix lowercase in error message
 ---
  src/nspawn/nspawn.c | 29 -
  1 file changed, 16 insertions(+), 13 deletions(-)

 diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
 index b830141..7e56cf2 100644
 --- a/src/nspawn/nspawn.c
 +++ b/src/nspawn/nspawn.c
 @@ -1449,8 +1449,18 @@ static int copy_devnodes(const char *dest) {
  return -r;
  }

 -if (mknod(to, st.st_mode, st.st_rdev)  0)
 -return log_error_errno(errno, mknod(%s) 
 failed: %m, to);
 +if (mknod(to, st.st_mode, st.st_rdev)  0) {
 +if (errno != EPERM)
 +return log_error_errno(errno, 
 mknod(%s) failed: %m, to);
 +
 +/* Some systems abusively restrict mknod but
 + * allow bind mounts. */
 +r = touch(to);
 +if (r  0)
 +return log_error_errno(r, touch 
 (%s) failed: %m, to);

 Is this really what you wanted? It's not obvious that errno will have
 the correct value when %m is evaluated. I would have used strerror(-r)
 here.

log_error_errno() is a macro that ends up calling log_internalv() and
it will assign errno to the parameter r in order to get %m working:

src/shared/log.c: log_internalv():
/* Make sure that %m maps to the specified error */
if (error != 0)
errno = error;


 +if (mount(from, to, NULL, MS_BIND, NULL)  
 0)
 +return log_error_errno(errno, Both 
 mknod and bind mount (%s) failed: %m, to);
 +}

  if (arg_userns  arg_uid_shift != UID_INVALID)
  if (lchown(to, arg_uid_shift, 
 arg_uid_shift)  0)
 @@ -1481,7 +1491,6 @@ static int setup_ptmx(const char *dest) {
  static int setup_dev_console(const char *dest, const char *console) {
  _cleanup_umask_ mode_t u;
  const char *to;
 -struct stat st;
  int r;

  assert(dest);
 @@ -1489,24 +1498,18 @@ static int setup_dev_console(const char *dest, const 
 char *console) {

  u = umask();

 -if (stat(/dev/null, st)  0)
 -return log_error_errno(errno, Failed to stat /dev/null: 
 %m);
 -
  r = chmod_and_chown(console, 0600, 0, 0);
  if (r  0)
  return log_error_errno(r, Failed to correct access mode 
 for TTY: %m);

  /* We need to bind mount the right tty to /dev/console since
   * ptys can only exist on pts file systems. To have something
 - * to bind mount things on we create a device node first, and
 - * use /dev/null for that since we the cgroups device policy
 - * allows us to create that freely, while we cannot create
 - * /dev/console. (Note that the major minor doesn't actually
 - * matter here, since we mount it over anyway). */
 + * to bind mount things on we create a empty regular file. */

  to = strjoina(dest, /dev/console);
 -if (mknod(to, (st.st_mode  ~0) | 0600, st.st_rdev)  0)
 -return log_error_errno(errno, mknod() for /dev/console 
 failed: %m);
 +r = touch(to);
 +if (r  0)
 +return log_error_errno(r, touch() for /dev/console failed: 
 %m);

  if (mount(console, to, NULL, MS_BIND, NULL)  0)
  return log_error_errno(errno, Bind mount for /dev/console 
 failed: %m);
 --
 2.1.4

 ___
 systemd-devel mailing list
 systemd-devel@lists.freedesktop.org
 http

[systemd-devel] [PATCH] nspawn: fallback on bind mount when mknod fails

2015-03-29 Thread Alban Crequy
From: Alban Crequy al...@endocode.com

Some systems abusively restrict mknod, even when the device node already
exists in /dev. This is unfortunate because it prevents systemd-nspawn
from creating the basic devices in /dev in the container.

This patch implements a workaround: when mknod fails, fallback on bind
mounts.

Additionally, /dev/console was created with a mknod with the same
major/minor as /dev/null before bind mounting a pts on it. This patch
removes the mknod and creates an empty regular file instead.

In order to test this patch, I used the following configuration, which I
think should replicate the system with the abusive restriction on mknod:

  # grep devices /proc/self/cgroup
  4:devices:/user.slice/restrict
  # cat /sys/fs/cgroup/devices/user.slice/restrict/devices.list
  c 1:9 r
  c 5:2 rw
  c 136:* rw
  # systemd-nspawn --register=false -D .
---
 src/nspawn/nspawn.c | 27 ++-
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
index 300b6df..09fff38 100644
--- a/src/nspawn/nspawn.c
+++ b/src/nspawn/nspawn.c
@@ -1449,8 +1449,17 @@ static int copy_devnodes(const char *dest) {
 return -r;
 }
 
-if (mknod(to, st.st_mode, st.st_rdev)  0)
-return log_error_errno(errno, mknod(%s) 
failed: %m, to);
+if (mknod(to, st.st_mode, st.st_rdev)  0) {
+if (errno != EPERM)
+return log_error_errno(errno, 
mknod(%s) failed: %m, to);
+
+/* Some systems abusively restrict mknod but
+ * allow bind mounts. */
+if (touch(to)  0)
+return log_error_errno(errno, touch 
(%s) failed: %m, to);
+if (mount(from, to, bind, MS_BIND, NULL)  0)
+return log_error_errno(errno, both 
mknod and bind mount (%s) failed: %m, to);
+}
 
 if (arg_userns  arg_uid_shift != UID_INVALID)
 if (lchown(to, arg_uid_shift, arg_uid_shift)  
0)
@@ -1481,7 +1490,6 @@ static int setup_ptmx(const char *dest) {
 static int setup_dev_console(const char *dest, const char *console) {
 _cleanup_umask_ mode_t u;
 const char *to;
-struct stat st;
 int r;
 
 assert(dest);
@@ -1489,24 +1497,17 @@ static int setup_dev_console(const char *dest, const 
char *console) {
 
 u = umask();
 
-if (stat(/dev/null, st)  0)
-return log_error_errno(errno, Failed to stat /dev/null: %m);
-
 r = chmod_and_chown(console, 0600, 0, 0);
 if (r  0)
 return log_error_errno(r, Failed to correct access mode for 
TTY: %m);
 
 /* We need to bind mount the right tty to /dev/console since
  * ptys can only exist on pts file systems. To have something
- * to bind mount things on we create a device node first, and
- * use /dev/null for that since we the cgroups device policy
- * allows us to create that freely, while we cannot create
- * /dev/console. (Note that the major minor doesn't actually
- * matter here, since we mount it over anyway). */
+ * to bind mount things on we create a empty regular file. */
 
 to = strjoina(dest, /dev/console);
-if (mknod(to, (st.st_mode  ~0) | 0600, st.st_rdev)  0)
-return log_error_errno(errno, mknod() for /dev/console 
failed: %m);
+if (touch(to)  0)
+return log_error_errno(errno, touch() for /dev/console 
failed: %m);
 
 if (mount(console, to, bind, MS_BIND, NULL)  0)
 return log_error_errno(errno, Bind mount for /dev/console 
failed: %m);
-- 
2.1.4

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] nspawn: fallback on bind mount when mknod fails

2015-03-29 Thread Alban Crequy
On Sun, Mar 29, 2015 at 5:24 PM, Tom Gundersen t...@jklm.no wrote:

 On Mar 29, 2015 5:18 PM, Alban Crequy alban.cre...@gmail.com wrote:

 From: Alban Crequy al...@endocode.com

 Some systems abusively restrict mknod, even when the device node already
 exists in /dev. This is unfortunate because it prevents systemd-nspawn
 from creating the basic devices in /dev in the container.

 This patch implements a workaround: when mknod fails, fallback on bind
 mounts.

 Could we just always use bind mounts and avoid the two code paths?

It's possible but I avoided it in order not to add to many entries in
/proc/self/mounts and /proc/self/mountinfo in the normal case when
mknod is not restricted.

I don't have a strong opinion about this. If you think my concern
about the mount entries is less important than avoiding two code
paths, I can prepare another patch.

Alban

 Tom

 Additionally, /dev/console was created with a mknod with the same
 major/minor as /dev/null before bind mounting a pts on it. This patch
 removes the mknod and creates an empty regular file instead.

 In order to test this patch, I used the following configuration, which I
 think should replicate the system with the abusive restriction on mknod:

   # grep devices /proc/self/cgroup
   4:devices:/user.slice/restrict
   # cat /sys/fs/cgroup/devices/user.slice/restrict/devices.list
   c 1:9 r
   c 5:2 rw
   c 136:* rw
   # systemd-nspawn --register=false -D .
 ---
  src/nspawn/nspawn.c | 27 ++-
  1 file changed, 14 insertions(+), 13 deletions(-)

 diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
 index 300b6df..09fff38 100644
 --- a/src/nspawn/nspawn.c
 +++ b/src/nspawn/nspawn.c
 @@ -1449,8 +1449,17 @@ static int copy_devnodes(const char *dest) {
  return -r;
  }

 -if (mknod(to, st.st_mode, st.st_rdev)  0)
 -return log_error_errno(errno, mknod(%s)
 failed: %m, to);
 +if (mknod(to, st.st_mode, st.st_rdev)  0) {
 +if (errno != EPERM)
 +return log_error_errno(errno,
 mknod(%s) failed: %m, to);
 +
 +/* Some systems abusively restrict mknod
 but
 + * allow bind mounts. */
 +if (touch(to)  0)
 +return log_error_errno(errno,
 touch (%s) failed: %m, to);
 +if (mount(from, to, bind, MS_BIND,
 NULL)  0)
 +return log_error_errno(errno,
 both mknod and bind mount (%s) failed: %m, to);
 +}

  if (arg_userns  arg_uid_shift != UID_INVALID)
  if (lchown(to, arg_uid_shift,
 arg_uid_shift)  0)
 @@ -1481,7 +1490,6 @@ static int setup_ptmx(const char *dest) {
  static int setup_dev_console(const char *dest, const char *console) {
  _cleanup_umask_ mode_t u;
  const char *to;
 -struct stat st;
  int r;

  assert(dest);
 @@ -1489,24 +1497,17 @@ static int setup_dev_console(const char *dest,
 const char *console) {

  u = umask();

 -if (stat(/dev/null, st)  0)
 -return log_error_errno(errno, Failed to stat /dev/null:
 %m);
 -
  r = chmod_and_chown(console, 0600, 0, 0);
  if (r  0)
  return log_error_errno(r, Failed to correct access mode
 for TTY: %m);

  /* We need to bind mount the right tty to /dev/console since
   * ptys can only exist on pts file systems. To have something
 - * to bind mount things on we create a device node first, and
 - * use /dev/null for that since we the cgroups device policy
 - * allows us to create that freely, while we cannot create
 - * /dev/console. (Note that the major minor doesn't actually
 - * matter here, since we mount it over anyway). */
 + * to bind mount things on we create a empty regular file. */

  to = strjoina(dest, /dev/console);
 -if (mknod(to, (st.st_mode  ~0) | 0600, st.st_rdev)  0)
 -return log_error_errno(errno, mknod() for /dev/console
 failed: %m);
 +if (touch(to)  0)
 +return log_error_errno(errno, touch() for /dev/console
 failed: %m);

  if (mount(console, to, bind, MS_BIND, NULL)  0)
  return log_error_errno(errno, Bind mount for
 /dev/console failed: %m);
 --
 2.1.4

 ___
 systemd-devel mailing list
 systemd-devel@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/systemd-devel


 ___
 systemd-devel mailing list
 systemd-devel@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] Appc support in systemd-importd

2015-03-17 Thread Alban Crequy
On Tue, Mar 17, 2015 at 6:12 PM, Vincent Batts vba...@redhat.com wrote:
 On 11/03/15 15:24 +0100, Iago L?pez Galeiras wrote:

 Hi,

 We're looking into adding appc[1] support in systemd-importd. An appc
 image (ACI) is just a tar with a rootfs directory and a json manifest.
 We would have to implement image discovery and simply extract the rootfs.
 Some questions:

 * I see that the dkr implementation ignores the information in the docker
 manifest (resource restrictions, binary to exec...). Is that by design or
 just not implemented yet? Should we do the same with appc (*only* extract
 the rootfs)?

 * Alban liked the idea of saving the manifest so we can extract the whole
 ACI and modify nspawn to detect a container is an ACI and set
 /var/lib/machines/appc-image.aci/rootfs as the container's root. The benefit
 of keeping the manifest would be knowing which binary to start. Is that
 acceptable?

 [1]: https://github.com/appc/spec/blob/master/SPEC.md


 Is there a place that you are working on this logic? I'm will
 participate. My simple stub is
 https://github.com/vbatts/systemd/commit/6067a9f484af4cfeb9c4b01965edcd50ff0cffc3
 rebased on master.

Iago and I have a couple of branches on:
https://github.com/endocode/systemd/branches
- iaguis/nspawn-aci for the code in systemd-nspawn
- alban/aci for the rest of the code

It's not ready for submitting upstream (there are hardcoded URLs to
etcd...) but it could download the aci, and nspawn could run it.

I am albanc on IRC Freenode #systemd. Working together on this would
be great :)

Cheers,
Alban
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [PATCH] [PATCH v2] util: add rename_noreplace

2015-03-12 Thread Alban Crequy
From: Alban Crequy al...@endocode.com

renameat2() exists since Linux 3.15 but btrfs support for the flag
RENAME_NOREPLACE was added later.

This patch implements a fallback when renameat2() returns EINVAL.
EINVAL is the error returned when the filesystem does not support one of
the flags.
---
 src/import/import-raw.c|  2 +-
 src/import/import-tar.c|  2 +-
 src/import/pull-raw.c  |  2 +-
 src/import/pull-tar.c  |  2 +-
 src/shared/copy.c  |  7 ++-
 src/shared/machine-image.c |  2 +-
 src/shared/machine-pool.c  |  2 +-
 src/shared/util.c  | 32 
 src/shared/util.h  |  2 ++
 9 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/src/import/import-raw.c b/src/import/import-raw.c
index 15e5eb2..e354023 100644
--- a/src/import/import-raw.c
+++ b/src/import/import-raw.c
@@ -245,7 +245,7 @@ static int raw_import_finish(RawImport *i) {
 (void) rm_rf_dangerous(i-final_path, false, true, false);
 }
 
-if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, 
RENAME_NOREPLACE)  0)
+if (rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path) 
 0)
 return log_error_errno(errno, Failed to move image into 
place: %m);
 
 free(i-temp_path);
diff --git a/src/import/import-tar.c b/src/import/import-tar.c
index d5b6dad..dcbe721 100644
--- a/src/import/import-tar.c
+++ b/src/import/import-tar.c
@@ -201,7 +201,7 @@ static int tar_import_finish(TarImport *i) {
 (void) rm_rf_dangerous(i-final_path, false, true, false);
 }
 
-if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, 
RENAME_NOREPLACE)  0)
+if (rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path) 
 0)
 return log_error_errno(errno, Failed to move image into 
place: %m);
 
 free(i-temp_path);
diff --git a/src/import/pull-raw.c b/src/import/pull-raw.c
index d1d77d5..d4b17e7 100644
--- a/src/import/pull-raw.c
+++ b/src/import/pull-raw.c
@@ -384,7 +384,7 @@ static void raw_pull_job_on_finished(PullJob *j) {
 if (r  0)
 goto finish;
 
-r = renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, 
RENAME_NOREPLACE);
+r = rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, 
i-final_path);
 if (r  0) {
 r = log_error_errno(errno, Failed to move RAW file 
into place: %m);
 goto finish;
diff --git a/src/import/pull-tar.c b/src/import/pull-tar.c
index 504642f..69df7af 100644
--- a/src/import/pull-tar.c
+++ b/src/import/pull-tar.c
@@ -281,7 +281,7 @@ static void tar_pull_job_on_finished(PullJob *j) {
 if (r  0)
 goto finish;
 
-if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, 
RENAME_NOREPLACE)  0) {
+if (rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, 
i-final_path)  0) {
 r = log_error_errno(errno, Failed to rename to final 
image name: %m);
 goto finish;
 }
diff --git a/src/shared/copy.c b/src/shared/copy.c
index 0239a58..e578d03 100644
--- a/src/shared/copy.c
+++ b/src/shared/copy.c
@@ -404,7 +404,12 @@ int copy_file_atomic(const char *from, const char *to, 
mode_t mode, bool replace
 if (r  0)
 return r;
 
-if (renameat2(AT_FDCWD, t, AT_FDCWD, to, replace ? 0 : 
RENAME_NOREPLACE)  0) {
+if (replace) {
+r = renameat(AT_FDCWD, t, AT_FDCWD, to);
+} else {
+r = rename_noreplace(AT_FDCWD, t, AT_FDCWD, to);
+}
+if (r  0) {
 unlink_noerrno(t);
 return -errno;
 }
diff --git a/src/shared/machine-image.c b/src/shared/machine-image.c
index c6d2850..ad5bff4 100644
--- a/src/shared/machine-image.c
+++ b/src/shared/machine-image.c
@@ -440,7 +440,7 @@ int image_rename(Image *i, const char *new_name) {
 if (!nn)
 return -ENOMEM;
 
-if (renameat2(AT_FDCWD, i-path, AT_FDCWD, new_path, RENAME_NOREPLACE) 
 0)
+if (rename_noreplace(AT_FDCWD, i-path, AT_FDCWD, new_path)  0)
 return -errno;
 
 /* Restore the immutable bit, if it was set before */
diff --git a/src/shared/machine-pool.c b/src/shared/machine-pool.c
index e7671a3..7e902a3 100644
--- a/src/shared/machine-pool.c
+++ b/src/shared/machine-pool.c
@@ -140,7 +140,7 @@ static int setup_machine_raw(uint64_t size, sd_bus_error 
*error) {
 goto fail;
 }
 
-if (renameat2(AT_FDCWD, tmp, AT_FDCWD, /var/lib/machines.raw, 
RENAME_NOREPLACE)  0) {
+if (rename_noreplace(AT_FDCWD, tmp, AT_FDCWD, /var/lib/machines.raw) 
 0) {
 r = sd_bus_error_set_errnof(error, errno, Failed to move 
/var/lib/machines.raw into place: %m);
 goto fail;
 }
diff --git a/src/shared

[systemd-devel] [PATCH] [PATCH v3] util: add rename_noreplace

2015-03-10 Thread Alban Crequy
From: Alban Crequy al...@endocode.com

renameat2() exists since Linux 3.15 but btrfs support for the flag
RENAME_NOREPLACE was added later.

This patch implements a fallback when renameat2() returns EINVAL.
EINVAL is the error returned when the filesystem does not support one of
the flags.
---
 src/import/import-raw.c|  5 +++--
 src/import/import-tar.c|  5 +++--
 src/import/pull-raw.c  |  4 ++--
 src/import/pull-tar.c  |  5 +++--
 src/shared/copy.c  | 13 ++---
 src/shared/machine-image.c |  5 +++--
 src/shared/machine-pool.c  |  5 +++--
 src/shared/util.c  | 41 +
 src/shared/util.h  |  2 ++
 9 files changed, 70 insertions(+), 15 deletions(-)

diff --git a/src/import/import-raw.c b/src/import/import-raw.c
index 15e5eb2..c53b0e4 100644
--- a/src/import/import-raw.c
+++ b/src/import/import-raw.c
@@ -245,8 +245,9 @@ static int raw_import_finish(RawImport *i) {
 (void) rm_rf_dangerous(i-final_path, false, true, false);
 }
 
-if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, 
RENAME_NOREPLACE)  0)
-return log_error_errno(errno, Failed to move image into 
place: %m);
+r = rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path);
+if (r  0)
+return log_error_errno(r, Failed to move image into place: 
%m);
 
 free(i-temp_path);
 i-temp_path = NULL;
diff --git a/src/import/import-tar.c b/src/import/import-tar.c
index d5b6dad..65262d8 100644
--- a/src/import/import-tar.c
+++ b/src/import/import-tar.c
@@ -201,8 +201,9 @@ static int tar_import_finish(TarImport *i) {
 (void) rm_rf_dangerous(i-final_path, false, true, false);
 }
 
-if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, 
RENAME_NOREPLACE)  0)
-return log_error_errno(errno, Failed to move image into 
place: %m);
+r = rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path);
+if (r  0)
+return log_error_errno(r, Failed to move image into place: 
%m);
 
 free(i-temp_path);
 i-temp_path = NULL;
diff --git a/src/import/pull-raw.c b/src/import/pull-raw.c
index d1d77d5..988fc59 100644
--- a/src/import/pull-raw.c
+++ b/src/import/pull-raw.c
@@ -384,9 +384,9 @@ static void raw_pull_job_on_finished(PullJob *j) {
 if (r  0)
 goto finish;
 
-r = renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, 
RENAME_NOREPLACE);
+r = rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, 
i-final_path);
 if (r  0) {
-r = log_error_errno(errno, Failed to move RAW file 
into place: %m);
+r = log_error_errno(r, Failed to move RAW file into 
place: %m);
 goto finish;
 }
 
diff --git a/src/import/pull-tar.c b/src/import/pull-tar.c
index 504642f..84ad571 100644
--- a/src/import/pull-tar.c
+++ b/src/import/pull-tar.c
@@ -281,8 +281,9 @@ static void tar_pull_job_on_finished(PullJob *j) {
 if (r  0)
 goto finish;
 
-if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, 
RENAME_NOREPLACE)  0) {
-r = log_error_errno(errno, Failed to rename to final 
image name: %m);
+r = rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, 
i-final_path);
+if (r  0) {
+r = log_error_errno(r, Failed to rename to final 
image name: %m);
 goto finish;
 }
 
diff --git a/src/shared/copy.c b/src/shared/copy.c
index 0239a58..edbbb3c 100644
--- a/src/shared/copy.c
+++ b/src/shared/copy.c
@@ -404,9 +404,16 @@ int copy_file_atomic(const char *from, const char *to, 
mode_t mode, bool replace
 if (r  0)
 return r;
 
-if (renameat2(AT_FDCWD, t, AT_FDCWD, to, replace ? 0 : 
RENAME_NOREPLACE)  0) {
-unlink_noerrno(t);
-return -errno;
+if (replace) {
+r = renameat(AT_FDCWD, t, AT_FDCWD, to);
+if (r  0)
+r = -errno;
+} else {
+r = rename_noreplace(AT_FDCWD, t, AT_FDCWD, to);
+}
+if (r  0) {
+(void) unlink_noerrno(t);
+return r;
 }
 
 return 0;
diff --git a/src/shared/machine-image.c b/src/shared/machine-image.c
index c6d2850..440a569 100644
--- a/src/shared/machine-image.c
+++ b/src/shared/machine-image.c
@@ -440,8 +440,9 @@ int image_rename(Image *i, const char *new_name) {
 if (!nn)
 return -ENOMEM;
 
-if (renameat2(AT_FDCWD, i-path, AT_FDCWD, new_path, RENAME_NOREPLACE) 
 0)
-return -errno;
+r = rename_noreplace(AT_FDCWD, i-path, AT_FDCWD, new_path);
+if (r  0)
+return r

[systemd-devel] [PATCH] util: add rename_noreplace

2015-03-10 Thread Alban Crequy
From: Alban Crequy al...@endocode.com

renameat2() exists since Linux 3.15 but btrfs support for the flag
RENAME_NOREPLACE was added later.

This patch implements a fallback when renameat2() returns EINVAL.
EINVAL is the error returned when the filesystem does not support one of
the flags.
---
 src/import/import-raw.c|  2 +-
 src/import/import-tar.c|  2 +-
 src/import/pull-raw.c  |  2 +-
 src/import/pull-tar.c  |  2 +-
 src/shared/copy.c  | 13 ++---
 src/shared/machine-image.c |  2 +-
 src/shared/machine-pool.c  |  2 +-
 src/shared/util.c  | 34 ++
 src/shared/util.h  |  2 ++
 9 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/src/import/import-raw.c b/src/import/import-raw.c
index 15e5eb2..e354023 100644
--- a/src/import/import-raw.c
+++ b/src/import/import-raw.c
@@ -245,7 +245,7 @@ static int raw_import_finish(RawImport *i) {
 (void) rm_rf_dangerous(i-final_path, false, true, false);
 }
 
-if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, 
RENAME_NOREPLACE)  0)
+if (rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path) 
 0)
 return log_error_errno(errno, Failed to move image into 
place: %m);
 
 free(i-temp_path);
diff --git a/src/import/import-tar.c b/src/import/import-tar.c
index d5b6dad..dcbe721 100644
--- a/src/import/import-tar.c
+++ b/src/import/import-tar.c
@@ -201,7 +201,7 @@ static int tar_import_finish(TarImport *i) {
 (void) rm_rf_dangerous(i-final_path, false, true, false);
 }
 
-if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, 
RENAME_NOREPLACE)  0)
+if (rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path) 
 0)
 return log_error_errno(errno, Failed to move image into 
place: %m);
 
 free(i-temp_path);
diff --git a/src/import/pull-raw.c b/src/import/pull-raw.c
index d1d77d5..d4b17e7 100644
--- a/src/import/pull-raw.c
+++ b/src/import/pull-raw.c
@@ -384,7 +384,7 @@ static void raw_pull_job_on_finished(PullJob *j) {
 if (r  0)
 goto finish;
 
-r = renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, 
RENAME_NOREPLACE);
+r = rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, 
i-final_path);
 if (r  0) {
 r = log_error_errno(errno, Failed to move RAW file 
into place: %m);
 goto finish;
diff --git a/src/import/pull-tar.c b/src/import/pull-tar.c
index 504642f..69df7af 100644
--- a/src/import/pull-tar.c
+++ b/src/import/pull-tar.c
@@ -281,7 +281,7 @@ static void tar_pull_job_on_finished(PullJob *j) {
 if (r  0)
 goto finish;
 
-if (renameat2(AT_FDCWD, i-temp_path, AT_FDCWD, i-final_path, 
RENAME_NOREPLACE)  0) {
+if (rename_noreplace(AT_FDCWD, i-temp_path, AT_FDCWD, 
i-final_path)  0) {
 r = log_error_errno(errno, Failed to rename to final 
image name: %m);
 goto finish;
 }
diff --git a/src/shared/copy.c b/src/shared/copy.c
index 0239a58..458a654 100644
--- a/src/shared/copy.c
+++ b/src/shared/copy.c
@@ -404,9 +404,16 @@ int copy_file_atomic(const char *from, const char *to, 
mode_t mode, bool replace
 if (r  0)
 return r;
 
-if (renameat2(AT_FDCWD, t, AT_FDCWD, to, replace ? 0 : 
RENAME_NOREPLACE)  0) {
-unlink_noerrno(t);
-return -errno;
+if (replace) {
+if (renameat(AT_FDCWD, t, AT_FDCWD, to)  0) {
+unlink_noerrno(t);
+return -errno;
+}
+} else {
+if (rename_noreplace(AT_FDCWD, t, AT_FDCWD, to)  0) {
+unlink_noerrno(t);
+return -errno;
+}
 }
 
 return 0;
diff --git a/src/shared/machine-image.c b/src/shared/machine-image.c
index c6d2850..ad5bff4 100644
--- a/src/shared/machine-image.c
+++ b/src/shared/machine-image.c
@@ -440,7 +440,7 @@ int image_rename(Image *i, const char *new_name) {
 if (!nn)
 return -ENOMEM;
 
-if (renameat2(AT_FDCWD, i-path, AT_FDCWD, new_path, RENAME_NOREPLACE) 
 0)
+if (rename_noreplace(AT_FDCWD, i-path, AT_FDCWD, new_path)  0)
 return -errno;
 
 /* Restore the immutable bit, if it was set before */
diff --git a/src/shared/machine-pool.c b/src/shared/machine-pool.c
index e7671a3..7e902a3 100644
--- a/src/shared/machine-pool.c
+++ b/src/shared/machine-pool.c
@@ -140,7 +140,7 @@ static int setup_machine_raw(uint64_t size, sd_bus_error 
*error) {
 goto fail;
 }
 
-if (renameat2(AT_FDCWD, tmp, AT_FDCWD, /var/lib/machines.raw, 
RENAME_NOREPLACE)  0) {
+if (rename_noreplace(AT_FDCWD, tmp

Re: [systemd-devel] Notification socket and chroot vs PrivateNetwork conflict (abstract vs file-system)

2015-03-06 Thread Alban Crequy
On 9 December 2014 at 17:28, Lennart Poettering lenn...@poettering.net wrote:
 On Tue, 09.12.14 16:24, Krzysztof Kotlenga (k.kotle...@sims.pl) wrote:

 Hi.

 Currently notify socket is unavailable in chrooted services (again)
 unless you bind mount it there. Is there perhaps another, less
 cumbersome way?

 So far notify socket was:
 1. abstract socket

commit 8c47c7325fa1ab72febf807f8831ff24c75fbf45
notify: add minimal readiness/status protocol for spawned daemons

 2. file-system socket

commit 91b22f21f3824c1766d34f622c5bbb70cbe881a8
core: move abstract namespace sockets to /dev/.run

Now that we have /dev/.run there's no need to use abstract
namespace sockets. So, let's move things to /dev/.run, to make
things more easily discoverable and improve compat with chroot()
and fs namespacing.

 3. abstract socket again

commit 29252e9e5bad3b0bcfc45d9bc761aee4b0ece1da
manager: turn notify socket into abstract namespace socket again

sd_notify() should work for daemons that chroot() as part of their
initilization, hence it's a good idea to use an abstract namespace
socket which is not affected by chroot.

 4. file-system socket again

commit 7181dbdb2e3112858d62bdaea4f0ad2ed685ccba
core: move notify sockets to /run and $XDG_RUNTIME_DIR

A service with PrivateNetwork= cannot access abstract namespace
sockets of the host anymore, hence let's better not use abstract
namespace sockets for this, since we want to make sure that
PrivateNetwork= is useful and doesn't break sd_notify().


 So... would it be acceptable to have two notify sockets, one abstract
 and one normal, the latter only set for services with PrivateNetwork
 or - better maybe - explicitly selectable? Any other ideas?

 Hmm, but what would you do for a service that has both PrivateNetwork
 and chroot enabled?

 I am all open for shifting things around again, but I inda would
 prefer a solution that works universally in the end...

 Ideas?

 I figure we could open a new mount namespace and mount the file system
 socket into the chroot, but not sure I like the idea...

Maybe that's the way to do it... but where would you bind mount the
socket file? in $CHROOT/tmp which should be writeable when
PrivateTmp=true? Of course it will not work if the daemon is doing the
chroot itself instead of relying on systemd's RootDirectory.

The same problem exists even without using
PrivateNetwork/RootDirectory when the service starts a container with
nspawn --private-network and the program inside the container wants
to notify when it's ready. This has the same root cause: the service
runs in a new mount/chroot and a new network namespace.

There is also the additional problem that the program inside the
container runs in a different cgroup (/system.slice/docker-... for
docker containers, or /machine.slice... for nspawn containers).

There is the tool sdnotify-proxy to proxy the notify socket from
systemd to a socket file which can be bind mounted in the container.
sdnotify-proxy works, but I would like to know if someone finds a
better way for containers :)

Alban
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Notification socket and chroot vs PrivateNetwork conflict (abstract vs file-system)

2015-03-05 Thread Alban Crequy
On 9 December 2014 at 17:28, Lennart Poettering lenn...@poettering.net wrote:
 On Tue, 09.12.14 16:24, Krzysztof Kotlenga (k.kotle...@sims.pl) wrote:

 Hi.

 Currently notify socket is unavailable in chrooted services (again)
 unless you bind mount it there. Is there perhaps another, less
 cumbersome way?

 So far notify socket was:
 1. abstract socket

commit 8c47c7325fa1ab72febf807f8831ff24c75fbf45
notify: add minimal readiness/status protocol for spawned daemons

 2. file-system socket

commit 91b22f21f3824c1766d34f622c5bbb70cbe881a8
core: move abstract namespace sockets to /dev/.run

Now that we have /dev/.run there's no need to use abstract
namespace sockets. So, let's move things to /dev/.run, to make
things more easily discoverable and improve compat with chroot()
and fs namespacing.

 3. abstract socket again

commit 29252e9e5bad3b0bcfc45d9bc761aee4b0ece1da
manager: turn notify socket into abstract namespace socket again

sd_notify() should work for daemons that chroot() as part of their
initilization, hence it's a good idea to use an abstract namespace
socket which is not affected by chroot.

 4. file-system socket again

commit 7181dbdb2e3112858d62bdaea4f0ad2ed685ccba
core: move notify sockets to /run and $XDG_RUNTIME_DIR

A service with PrivateNetwork= cannot access abstract namespace
sockets of the host anymore, hence let's better not use abstract
namespace sockets for this, since we want to make sure that
PrivateNetwork= is useful and doesn't break sd_notify().


 So... would it be acceptable to have two notify sockets, one abstract
 and one normal, the latter only set for services with PrivateNetwork
 or - better maybe - explicitly selectable? Any other ideas?

 Hmm, but what would you do for a service that has both PrivateNetwork
 and chroot enabled?

 I am all open for shifting things around again, but I inda would
 prefer a solution that works universally in the end...

 Ideas?

 I figure we could open a new mount namespace and mount the file system
 socket into the chroot, but not sure I like the idea...

I don't know what is the best solution but using a socket file seems
better than using an abstract unix socket: processes in a
systemd-nspawn container could discover the notify socket of the host
in /proc/net/unix (if it does not use a new network namespace) and
send garbage file descriptors with SCM_RIGHTS from the container to
the host. Systemd on the host does the right thing: it closes the fds
when the datagram was not sent by a managed unit. Does
manager_get_unit_by_pid() matches the exact cgroup path only or does
it also matches a prefix path? I wonder about nspawn containers
started by a service unit on the host.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [PATCH] mount: create mount point correctly in case of bind mount

2015-02-20 Thread Alban Crequy
From: Alban Crequy al...@endocode.com

Manpage systemd.mount(5) says: If the mount point does not exist at the
time of mounting, it is created. However, it was not working for bind
mounts of non-directory files (regular, device, socket, etc).

This patch checks the type of the resource to bind mount and does a
touch instead of a mkdir when necessary.
---
 src/core/mount.c | 32 
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/src/core/mount.c b/src/core/mount.c
index f3977e6..5a373ba 100644
--- a/src/core/mount.c
+++ b/src/core/mount.c
@@ -904,19 +904,43 @@ fail:
 static void mount_enter_mounting(Mount *m) {
 int r;
 MountParameters *p;
+bool is_bind;
+bool is_file_bind;
 
 assert(m);
 
 m-control_command_id = MOUNT_EXEC_MOUNT;
 m-control_command = m-exec_command + MOUNT_EXEC_MOUNT;
 
-mkdir_p_label(m-where, m-directory_mode);
+p = get_mount_parameters_fragment(m);
+
+/* Bind-mounts: A directory can be bind-mounted only on directories. A
+ * non-directory file (regular, device, socket etc.) can be bind
+ * mounted on any non-directory file (even of a different kind).
+ *
+ * Non-bind mounts: they are only mounted on directories.
+ */
+
+is_bind = p  mount_is_bind(p);
+is_file_bind = false;
+if (is_bind) {
+struct stat st;
+
+/* both stat(2) and mount(8) follow symlinks, so it's fine */
+r = stat(p-what, st);
+if (r == 0  !S_ISDIR(st.st_mode))
+is_file_bind = true;
+}
 
-warn_if_dir_nonempty(m-meta.id, m-where);
+if (is_file_bind) {
+touch_file(m-where, true, USEC_INFINITY, 0, 0, 
m-directory_mode);
+} else {
+mkdir_p_label(m-where, m-directory_mode);
+warn_if_dir_nonempty(m-meta.id, m-where);
+}
 
 /* Create the source directory for bind-mounts if needed */
-p = get_mount_parameters_fragment(m);
-if (p  mount_is_bind(p))
+if (is_bind)
 mkdir_p_label(p-what, m-directory_mode);
 
 r = fail_if_symlink(m-meta.id, m-where);
-- 
2.1.4

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [PATCH] PrivateDevices: fix /dev mount when a service is chrooted

2015-02-19 Thread Alban Crequy
When a service uses both RootDirectory=/opt/... and PrivateDevices=true,
the private /dev must not be mounted in /dev but in /opt/.../dev.
---
 src/core/execute.c   |  6 +-
 src/core/namespace.c | 10 +-
 src/core/namespace.h |  2 +-
 src/test/test-ns.c   |  2 +-
 4 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/src/core/execute.c b/src/core/execute.c
index 1815e3d..69b7990 100644
--- a/src/core/execute.c
+++ b/src/core/execute.c
@@ -1571,6 +1571,7 @@ static int exec_child(
 context-protect_home != PROTECT_HOME_NO) {
 
 char *tmp = NULL, *var = NULL;
+char private_dev_dir[512];
 
 /* The runtime struct only contains the parent
  * of the private /tmp, which is
@@ -1585,6 +1586,9 @@ static int exec_child(
 var = strjoina(runtime-var_tmp_dir, /tmp);
 }
 
+if (params-apply_chroot  context-root_directory)
+snprintf(private_dev_dir, sizeof(private_dev_dir) - 1, 
%s/dev, context-root_directory);
+
 r = setup_namespace(
 context-read_write_dirs,
 context-read_only_dirs,
@@ -1592,7 +1596,7 @@ static int exec_child(
 tmp,
 var,
 params-bus_endpoint_path,
-context-private_devices,
+params-apply_chroot  
context-private_devices ? private_dev_dir : NULL,
 context-protect_home,
 context-protect_system,
 context-mount_flags);
diff --git a/src/core/namespace.c b/src/core/namespace.c
index 4fecd32..7a6561a 100644
--- a/src/core/namespace.c
+++ b/src/core/namespace.c
@@ -234,7 +234,7 @@ static int mount_dev(BindMount *m) {
 
 dev_setup(temporary_mount);
 
-if (mount(dev, /dev/, NULL, MS_MOVE, NULL)  0) {
+if (mount(dev, m-path, NULL, MS_MOVE, NULL)  0) {
 r = -errno;
 goto fail;
 }
@@ -419,7 +419,7 @@ int setup_namespace(
 const char* tmp_dir,
 const char* var_tmp_dir,
 const char* bus_endpoint_path,
-bool private_dev,
+const char* private_dev_dir,
 ProtectHome protect_home,
 ProtectSystem protect_system,
 unsigned long mount_flags) {
@@ -438,7 +438,7 @@ int setup_namespace(
 strv_length(read_write_dirs) +
 strv_length(read_only_dirs) +
 strv_length(inaccessible_dirs) +
-private_dev +
+!!private_dev_dir +
 (protect_home != PROTECT_HOME_NO ? 3 : 0) +
 (protect_system != PROTECT_SYSTEM_NO ? 2 : 0) +
 (protect_system == PROTECT_SYSTEM_FULL ? 1 : 0);
@@ -469,8 +469,8 @@ int setup_namespace(
 m++;
 }
 
-if (private_dev) {
-m-path = /dev;
+if (private_dev_dir) {
+m-path = private_dev_dir;
 m-mode = PRIVATE_DEV;
 m++;
 }
diff --git a/src/core/namespace.h b/src/core/namespace.h
index 42b92e7..4b06802 100644
--- a/src/core/namespace.h
+++ b/src/core/namespace.h
@@ -47,7 +47,7 @@ int setup_namespace(char **read_write_dirs,
 const char *tmp_dir,
 const char *var_tmp_dir,
 const char *endpoint_path,
-bool private_dev,
+const char *private_dev_dir,
 ProtectHome protect_home,
 ProtectSystem protect_system,
 unsigned long mount_flags);
diff --git a/src/test/test-ns.c b/src/test/test-ns.c
index 7cd7b77..d973d0d 100644
--- a/src/test/test-ns.c
+++ b/src/test/test-ns.c
@@ -60,7 +60,7 @@ int main(int argc, char *argv[]) {
 tmp_dir,
 var_tmp_dir,
 NULL,
-true,
+/dev,
 PROTECT_HOME_NO,
 PROTECT_SYSTEM_NO,
 0);
-- 
2.1.4

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] systemd-nspawn create container under unprivileged user

2015-02-05 Thread Alban Crequy
On 5 February 2015 at 12:48, Vasiliy Tolstov v.tols...@selfip.ru wrote:

 2015-02-05 12:44 GMT+03:00 Alban Crequy alban.cre...@gmail.com:

 Manual page namespaces(7):

Creation of new namespaces using clone(2) and unshare(2) in most
 cases
requires the CAP_SYS_ADMIN capability.  User namespaces are the
exception: since  Linux 3.8, no privilege is required to create a
 user
namespace.


 So as i understand i can't create full featured container with network under
 non root user (and not have cap_sys_admin)

caps like CAP_SYS_ADMIN don't have an global meaning anymore but
refers to operations a process can do *in its current namespace*. An
unprivileged process (uid!=0, without cap_sys_admin) can join a user
namespace and get uid=0  cap_sys_admin for operations inside the user
namespace, but it will still have uid!=0  !cap_sys_admin for
operations in the parent user namespace.

user_namespaces(7) contains userns_child_exec.c and it creates a fully
featured container with network without being root. (I attached a
patched version I was testing)

# # Because I'm using the kernel patched by my distribution
# echo 1  /proc/sys/kernel/unprivileged_userns_clone

$ gcc -lcap -o userns_child_exec userns_child_exec.c

Here it seems to work:

alban@alban:~$ ls -l /tmp/userns_child_exec
-rwxr-xr-x 1 alban alban 14488 Feb  5 23:24 /tmp/userns_child_exec
alban@alban:~$ id -u
1000
alban@alban:~$ ip link # --- will show lo, eth0, wlan0...
alban@alban:~$ /tmp/userns_child_exec -p -m -U -M '0 1000 1' -G '0
1000 1' -n bash
About to exec bash
root@alban:~# id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
root@alban:~# ip link # --- only lo visible in this namespace

Cheers,
Alban
--- userns_child_exec.orig.c	2015-02-05 23:20:19.208741366 +0100
+++ userns_child_exec.c	2015-01-30 17:01:56.948493001 +0100
@@ -108,6 +108,30 @@
 close(fd);
 }
 
+static void
+write_file(char *content, char *path)
+{
+int fd;
+size_t content_len;
+
+content_len = strlen(content);
+
+fd = open(path, O_RDWR);
+if (fd == -1) {
+fprintf(stderr, ERROR: open %s: %s\n, path,
+strerror(errno));
+exit(EXIT_FAILURE);
+}
+
+if (write(fd, content, content_len) != content_len) {
+fprintf(stderr, ERROR: write %s: %s\n, content,
+strerror(errno));
+exit(EXIT_FAILURE);
+}
+
+close(fd);
+}
+
 static int  /* Start function for cloned child */
 childFunc(void *arg)
 {
@@ -149,6 +173,7 @@
 const int MAP_BUF_SIZE = 100;
 char map_buf[MAP_BUF_SIZE];
 char map_path[PATH_MAX];
+char groups_path[PATH_MAX];
 
 /* Parse command-line options. The initial '+' character in
the final getopt() argument prevents GNU-style permutation
@@ -225,6 +250,11 @@
 update_map(uid_map, map_path);
 }
 if (gid_map != NULL || map_zero) {
+snprintf(groups_path, PATH_MAX, /proc/%ld/setgroups,
+(long) child_pid);
+write_file(deny\n, groups_path);
+}
+if (gid_map != NULL || map_zero) {
 snprintf(map_path, PATH_MAX, /proc/%ld/gid_map,
 (long) child_pid);
 if (map_zero) {
/* userns_child_exec.c

   Licensed under GNU General Public License v2 or later

   Create a child process that executes a shell command in new
   namespace(s); allow UID and GID mappings to be specified when
   creating a user namespace.
*/
#define _GNU_SOURCE
#include sched.h
#include unistd.h
#include stdlib.h
#include sys/wait.h
#include signal.h
#include fcntl.h
#include stdio.h
#include string.h
#include limits.h
#include errno.h

/* A simple error-handling function: print an error message based
   on the value in 'errno' and terminate the calling process */

#define errExit(msg)do { perror(msg); exit(EXIT_FAILURE); \
} while (0)

struct child_args {
char **argv;/* Command to be executed by child, with args */
intpipe_fd[2];  /* Pipe used to synchronize parent and child */
};

static int verbose;

static void
usage(char *pname)
{
fprintf(stderr, Usage: %s [options] cmd [arg...]\n\n, pname);
fprintf(stderr, Create a child process that executes a shell 
command in a new user namespace,\n
and possibly also other new namespace(s).\n\n);
fprintf(stderr, Options can be:\n\n);
#define fpe(str) fprintf(stderr, %s, str);
fpe(-i  New IPC namespace\n);
fpe(-m  New mount namespace\n);
fpe(-n  New network namespace\n);
fpe(-p  New PID namespace\n);
fpe(-u  New UTS namespace\n);
fpe(-U  New user namespace\n);
fpe(-M uid_map  Specify UID map for user namespace\n);
fpe(-G gid_map  Specify GID map for user namespace\n);
fpe(-z  Map user's UID and GID to 0 in user namespace\n);
fpe((equivalent to: -M '0 uid 1' -G '0 gid 1')\n);
fpe(-v  Display verbose messages\n);
fpe(\n

Re: [systemd-devel] systemd-nspawn create container under unprivileged user

2015-02-05 Thread Alban Crequy
[reposting - sorry I forgot to Cc the mailing list]

On 4 February 2015 at 23:03, Vasiliy Tolstov v.tols...@selfip.ru wrote:
 Hello!
 Does it possible to create container as regular user? Oh what capabilities i
 need to add to create container not using root?

Hello,

Manual page namespaces(7):

   Creation of new namespaces using clone(2) and unshare(2) in most cases
   requires the CAP_SYS_ADMIN capability.  User namespaces are the
   exception: since  Linux 3.8, no privilege is required to create a user
   namespace.

systemd-nspawn uses: src/nspawn/nspawn.c:

pid = raw_clone(SIGCHLD|CLONE_NEWNS|
  (arg_share_system ? 0 : CLONE_NEWIPC|CLONE_NEWPID|CLONE_NEWUTS)|
  (arg_private_network ? CLONE_NEWNET : 0), NULL);

So you need to have CAP_SYS_ADMIN to use systemd-nspawn.


If you want to try user namespaces, it is something that is still
moving... Manual page user_namespaces(7):

   Starting  in  Linux  3.8,  unprivileged  processes  can create
   user namespaces, and mount, PID, IPC, network, and UTS
   namespaces can be created with just the CAP_SYS_ADMIN
   capability in the caller's user namespace.

But it is not true in most Linux distributions as they disable
unprivileged user namespaces and require CAP_SYS_ADMIN anyway. See for
example:
http://anonscm.debian.org/viewvc/kernel/dists/trunk/linux/debian/patches/debian/add-sysctl-to-disallow-unprivileged-CLONE_NEWUSER-by-default.patch?revision=20773view=markup
and: echo 1  /proc/sys/kernel/unprivileged_userns_clone

Additionally, the program userns_child_exec.c included in manual page
namespaces(7) does not work as is yet because since the changes
introduced by CVE-2014-8989, it needs to adjust /proc/pid/setgroups.
See:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=66d2f338ee4c449396b6f99f5e75cd18eb6df272

Cheers,
Alban
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [PATCH] nspawn: allow bind-mounting char and block files

2015-01-22 Thread Alban Crequy
From: Alban Crequy al...@endocode.com

---
 src/nspawn/nspawn.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
index 3fce3ad..db57b24 100644
--- a/src/nspawn/nspawn.c
+++ b/src/nspawn/nspawn.c
@@ -911,8 +911,7 @@ static int mount_binds(const char *dest, char **l, bool ro) 
{
 return -errno;
 }
 
-/* Create the mount point, but be conservative -- refuse to 
create block
- * and char devices. */
+/* Create the mount point */
 if (S_ISDIR(source_st.st_mode)) {
 r = mkdir_label(where, 0755);
 if (r  0  errno != EEXIST)
@@ -929,6 +928,10 @@ static int mount_binds(const char *dest, char **l, bool 
ro) {
 r = touch(where);
 if (r  0)
 return log_error_errno(r, Failed to create 
mount point %s: %m, where);
+} else if (S_ISCHR(source_st.st_mode) || 
!S_ISBLK(source_st.st_mode)) {
+r = mknod(where, source_st.st_mode, source_st.st_rdev) 
 0;
+if (r  0  errno != EEXIST)
+return log_error_errno(errno, Failed to 
create mount point %s: %m, where);
 } else {
 log_error(Refusing to create mountpoint for file: 
%s, *x);
 return -ENOTSUP;
-- 
2.1.4

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [PATCH v2] nspawn: allow bind-mounting char and block files

2015-01-22 Thread Alban Crequy
From: Alban Crequy al...@endocode.com

v2:
- simplify the patch: any non-directory file can be mounted on any
  non-directory file.
- allow bind mount of files of different types
---
 src/nspawn/nspawn.c | 27 +++
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
index 3fce3ad..2736c0a 100644
--- a/src/nspawn/nspawn.c
+++ b/src/nspawn/nspawn.c
@@ -898,8 +898,12 @@ static int mount_binds(const char *dest, char **l, bool 
ro) {
 
 r = stat(where, dest_st);
 if (r == 0) {
-if ((source_st.st_mode  S_IFMT) != (dest_st.st_mode  
S_IFMT)) {
-log_error(The file types of %s and %s do not 
match. Refusing bind mount, *x, where);
+if (S_ISDIR(source_st.st_mode)  
!S_ISDIR(dest_st.st_mode)) {
+log_error(Cannot bind mount directory %s on 
file %s, *x, where);
+return -EINVAL;
+}
+if (!S_ISDIR(source_st.st_mode)  
S_ISDIR(dest_st.st_mode)) {
+log_error(Cannot bind mount file %s on 
directory %s, *x, where);
 return -EINVAL;
 }
 } else if (errno == ENOENT) {
@@ -911,27 +915,18 @@ static int mount_binds(const char *dest, char **l, bool 
ro) {
 return -errno;
 }
 
-/* Create the mount point, but be conservative -- refuse to 
create block
- * and char devices. */
+/* Create the mount point. Any non-directory file can be
+ * mounted on any non-directory file (regular, fifo, socket,
+ * char, block).
+ */
 if (S_ISDIR(source_st.st_mode)) {
 r = mkdir_label(where, 0755);
 if (r  0  errno != EEXIST)
 return log_error_errno(r, Failed to create 
mount point %s: %m, where);
-} else if (S_ISFIFO(source_st.st_mode)) {
-r = mkfifo(where, 0644);
-if (r  0  errno != EEXIST)
-return log_error_errno(errno, Failed to 
create mount point %s: %m, where);
-} else if (S_ISSOCK(source_st.st_mode)) {
-r = mknod(where, 0644 | S_IFSOCK, 0);
-if (r  0  errno != EEXIST)
-return log_error_errno(errno, Failed to 
create mount point %s: %m, where);
-} else if (S_ISREG(source_st.st_mode)) {
+} else {
 r = touch(where);
 if (r  0)
 return log_error_errno(r, Failed to create 
mount point %s: %m, where);
-} else {
-log_error(Refusing to create mountpoint for file: 
%s, *x);
-return -ENOTSUP;
 }
 
 if (mount(*x, where, bind, MS_BIND, NULL)  0)
-- 
2.1.4

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] nspawn: allow bind-mounting char and block files

2015-01-22 Thread Alban Crequy
On 22 January 2015 at 13:51, Lennart Poettering lenn...@poettering.net wrote:
 On Thu, 22.01.15 13:25, Alban Crequy (mua...@gmail.com) wrote:

 From: Alban Crequy al...@endocode.com

 Hmm, I wonder if we can actually simplify this. IIRC the rules for
 over-mounting are simpler than I thought initially:

 a) dirs can only over-mount dirs

 b) everything else can over-mount everything else

 With that in mind I think we can collapse this code to only have two
 branches: one branch for the S_ISDIR() case, and another one that uses
 touch() for everything else.

 Anychance you can simplify the patch like this? The benefit would be
 that we can do without CAP_SYS_MKNOD for all of this. Also, your patch
 woud then shorten the code, while adding a feature, not make it
 longer!

The patch will be a bit longer because the file type checks in
mount_binds() need to be updated. Otherwise, the second attempt of
running nspawn would fail.

I will send the patch v2 shortly.

 ---
  src/nspawn/nspawn.c | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

 diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
 index 3fce3ad..db57b24 100644
 --- a/src/nspawn/nspawn.c
 +++ b/src/nspawn/nspawn.c
 @@ -911,8 +911,7 @@ static int mount_binds(const char *dest, char **l, bool 
 ro) {
  return -errno;
  }

 -/* Create the mount point, but be conservative -- refuse to 
 create block
 - * and char devices. */
 +/* Create the mount point */
  if (S_ISDIR(source_st.st_mode)) {
  r = mkdir_label(where, 0755);
  if (r  0  errno != EEXIST)
 @@ -929,6 +928,10 @@ static int mount_binds(const char *dest, char **l, bool 
 ro) {
  r = touch(where);
  if (r  0)
  return log_error_errno(r, Failed to create 
 mount point %s: %m, where);
 +} else if (S_ISCHR(source_st.st_mode) || 
 !S_ISBLK(source_st.st_mode)) {
 +r = mknod(where, source_st.st_mode, 
 source_st.st_rdev)  0;
 +if (r  0  errno != EEXIST)
 +return log_error_errno(errno, Failed to 
 create mount point %s: %m, where);
  } else {
  log_error(Refusing to create mountpoint for file: 
 %s, *x);
  return -ENOTSUP;
 --
 2.1.4

 ___
 systemd-devel mailing list
 systemd-devel@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/systemd-devel


 Lennart

 --
 Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [PATCH] bus.h: add missing include to linux/kref.h

2014-08-14 Thread Alban Crequy
Symptoms:
 kdbus/bus.h:56:14: error: field ‘kref’ has incomplete type
   struct kref kref;

Signed-off-by: Alban Crequy alban.cre...@collabora.co.uk
---
 bus.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/bus.h b/bus.h
index a5832b8..c7ce2fa 100644
--- a/bus.h
+++ b/bus.h
@@ -16,6 +16,7 @@
 #include linux/hashtable.h
 #include linux/spinlock.h
 #include linux/idr.h
+#include linux/kref.h
 
 #include util.h
 
-- 
1.8.5.3

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [PATCH] [RFC] [WIP] [kdbus] Attempt to recursively pass fd

2014-08-14 Thread Alban Crequy
Before Linux commit 25888e (from 2.6.37-rc4, Nov 2010), fd-passing on Unix
sockets could recursively be stacked, allowing a process to exhaust the open
files limit (/proc/sys/fs/file-max) on the system without restriction from
ulimit -n.

This DoS on Unix sockets was fixed by commit:

 commit 25888e30319f8896fc656fc68643e6a078263060
 Author: Eric Dumazet eric.duma...@gmail.com
 Date:   Thu Nov 25 04:11:39 2010 +

 af_unix: limit recursion level

But that commit introduced a bug in dbus:
https://bugs.freedesktop.org/show_bug.cgi?id=80163

kdbus does not use fd-passing on Unix sockets so it is not affected by this.

However, it allows fd-passing similarly. This patch shows it is possible to
recursively pass file descriptors in kdbus and stack them without keeping them
attached to the initial process. I could stack passed fds 256 times, probably
because of the limit KDBUS_USER_MAX_CONN:

defaults.h:#define KDBUS_USER_MAX_CONN  256

But this limit could probably be overrided by using fds from different
endpoints.

I am also afraid that fds from Unix sockets and fds from kdbus could be stacked
together, defeating the limit implemented in unix_attach_fds() by commit 25888e
because the check uses unix_get_socket(), making the assumption that only Unix
sockets could carry file descriptors:
http://lxr.free-electrons.com/source/net/unix/af_unix.c#L1380
http://lxr.free-electrons.com/source/net/unix/garbage.c#L99
---
 test/test-kdbus.c | 72 +++
 1 file changed, 72 insertions(+)

diff --git a/test/test-kdbus.c b/test/test-kdbus.c
index f0bf705..3c79877 100644
--- a/test/test-kdbus.c
+++ b/test/test-kdbus.c
@@ -12,6 +12,8 @@
 #include limits.h
 #include sys/mman.h
 #include sys/ioctl.h
+#include sys/types.h
+#include sys/socket.h
 #include getopt.h
 #include stdbool.h
 
@@ -1004,6 +1006,75 @@ static int check_msg_basic(struct kdbus_check_env *env)
return CHECK_OK;
 }
 
+static int send_fds(struct conn *conn, uint64_t dst_id, int fds[2])
+{
+   struct kdbus_msg *msg;
+   struct kdbus_item *item;
+   uint64_t size;
+   int ret;
+
+   size = sizeof(struct kdbus_msg);
+   size += KDBUS_ITEM_SIZE(sizeof(int[2]));
+
+   msg = malloc(size);
+   ASSERT_RETURN (msg != NULL);
+
+   memset(msg, 0, size);
+   msg-size = size;
+   msg-src_id = conn-id;
+   msg-dst_id = dst_id;
+   msg-payload_type = KDBUS_PAYLOAD_DBUS;
+
+   item = msg-items;
+
+   item-type = KDBUS_ITEM_FDS;
+   item-size = KDBUS_ITEM_HEADER_SIZE + sizeof(int[2]);
+   item-fds[0] = fds[0];
+   item-fds[1] = fds[1];
+   item = KDBUS_ITEM_NEXT(item);
+
+   ret = ioctl(conn-fd, KDBUS_CMD_MSG_SEND, msg);
+   if (ret) {
+   fprintf(stderr, error sending message: %d err %d (%m)\n, ret, 
errno);
+   return EXIT_FAILURE;
+   }
+
+   free(msg);
+
+   return 0;
+}
+
+static int check_fds_passing(struct kdbus_check_env *env)
+{
+   struct conn *conn_src, *conn_dst;
+   int fds[2];
+   int ret;
+   int i;
+
+   /* create two connections */
+   conn_src = kdbus_hello(env-buspath, 0, NULL, 0);
+   ASSERT_RETURN(conn_src != NULL);
+
+   for (i = 0; i = 0; i++) {
+   conn_dst = kdbus_hello(env-buspath, 0, NULL, 0);
+   ASSERT_RETURN(conn_dst != NULL);
+
+   fds[0] = conn_src-fd;
+   fds[1] = conn_dst-fd;
+
+   ret = send_fds (conn_src, conn_dst-id, fds);
+   printf(check_fds_passing: iter: %d fds %d-%d ret %d err %d 
(%m)\n, i, fds[0], fds[1], ret, errno);
+   ASSERT_RETURN(ret == 0);
+
+   close(conn_src-fd);
+   free(conn_src);
+
+   conn_src = conn_dst;
+   }
+
+   return CHECK_OK;
+}
+
 static int check_msg_free(struct kdbus_check_env *env)
 {
int ret;
@@ -,6 +1182,7 @@ static const struct kdbus_check checks[] = {
{ name queue, check_name_queue,   
CHECK_CREATE_BUS | CHECK_CREATE_CONN},
{ message basic,  check_msg_basic,
CHECK_CREATE_BUS | CHECK_CREATE_CONN},
{ message free,   check_msg_free, 
CHECK_CREATE_BUS | CHECK_CREATE_CONN},
+   { fds passing,check_fds_passing,  
CHECK_CREATE_BUS},
{ connection info,check_conn_info,
CHECK_CREATE_BUS | CHECK_CREATE_CONN},
{ match id add,   check_match_id_add, 
CHECK_CREATE_BUS | CHECK_CREATE_CONN},
{ match id remove,check_match_id_remove,  
CHECK_CREATE_BUS | CHECK_CREATE_CONN},
-- 
1.8.5.3

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] User sessions: limit the ability to migrate cgroups

2014-08-13 Thread Alban Crequy
On Wed, 13 Aug 2014 16:37:17 +0200
Lennart Poettering lenn...@poettering.net wrote:

 On Thu, 07.08.14 15:19, Alban Crequy (alban.cre...@collabora.co.uk)
 wrote:
 
  Hi,
  
  Should unprivileged processes be allowed to change cgroup?
 
 Well, they shouldn#t do it. But I think it's OK as long as this is
 only done within the specific user's hierarchies.
 
  As I understand it, it is not possible to block processes to
  leave a cgroup, but only to block processes to enter a cgroup.
 
 Correct.
 
  In the following example, session-c4.scope/tasks belongs to
  root:root with -rw-r--r-- and user@1000.service/tasks belongs to
  user:user with -rw-r--r--.
 
 Yes, this is because systemd --user needs to be able to manage its own
 cgroup subtree, so we have to open this up for the user@1000.service
 service, but keep it restricted otherwise...

It makes sense.

  So processes can freely move from session-c4.scope to
  user@1000.service. But not in the other direction.
 
 Correct.
  
  $ systemd-cgls
  Working Directory /sys/fs/cgroup/systemd/user.slice/user-1000.slice:
  ├─session-c4.scope
  │ ├─713 sshd: user [priv]  
  │ ├─722 sshd: user@pts/2   
  │ ├─723 -bash
  │ ├─732 systemd-cgls
  │ └─733 pager
  ├─user@1000.service
  │ ├─406 /lib/systemd/systemd --user
  
  With user sessions managed by systemd, will it be possible to
  restrict unprivileged users from migrating to other cgroups?
 
 Unlikely. Access control on Unix is generally bound to user IDs, not
 processes, and we really shouldn't start here with departing from
 that...

I tested SELinux and AppArmor to restrict /sys/fs/cgroup/.

SELinux didn't help because the cgroup file system does not support
extended attributes such as security.selinux, but AppArmor was able
to block an application from changing cgroup.

Alban
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] User sessions: limit the ability to migrate cgroups

2014-08-07 Thread Alban Crequy
Hi,

Should unprivileged processes be allowed to change cgroup?

As I understand it, it is not possible to block processes to
leave a cgroup, but only to block processes to enter a cgroup.

In the following example, session-c4.scope/tasks belongs to root:root
with -rw-r--r-- and user@1000.service/tasks belongs to user:user with
-rw-r--r--.

So processes can freely move from session-c4.scope to
user@1000.service. But not in the other direction.

$ systemd-cgls
Working Directory /sys/fs/cgroup/systemd/user.slice/user-1000.slice:
├─session-c4.scope
│ ├─713 sshd: user [priv]  
│ ├─722 sshd: user@pts/2   
│ ├─723 -bash
│ ├─732 systemd-cgls
│ └─733 pager
├─user@1000.service
│ ├─406 /lib/systemd/systemd --user

With user sessions managed by systemd, will it be possible to restrict
unprivileged users from migrating to other cgroups?

Best regards,
Alban
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel