Re: [lxc-devel] call to setup_dev_symlinks with lxc.autodev
On Fri, Apr 25, 2014 at 5:53 PM, Michael H. Warfield m...@wittsend.com wrote: Bingo! I guess my conjecture about it being a quirk in the kernel VFS must be pretty close. Ok... I'll submit a formal patch shortly. any news about this? Thanks, -- William ___ lxc-devel mailing list lxc-devel@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-devel
Re: [lxc-devel] [PATCH v2] add support for qcow2
On Wed, 14 May 2014 19:59:11 + Serge Hallyn serge.hal...@ubuntu.com wrote: Quoting Dwight Engen (dwight.en...@oracle.com): On Mon, 12 May 2014 18:02:28 + Serge Hallyn serge.hal...@ubuntu.com wrote: qcow2 backing stores can be attached to a nbd block device using qemu-nbd. This user-space process (pair) stays around for the duration of the device attachment. Obviously we want it to go away when the container shuts down, but not before the filesystems have been cleanly unmounted. The device attachment is done from the task which will become the container monitor before the container setup+init task is spawned. That task starts in a new pid namespace to ensure that the qemu-nbd process will be killed if need be. It sets its parent death signal to sighup, and, on receiving sighup, attempts to do a clean qemu-device detach, then exits. This should ensure that the device is detached if the qemu monitor crashes or exits. It may be worth adding a delay before the qemu-nbd is detached, but my brief tests haven't seen any data corruption. Only the parts required for running a qcow2-backed container are implemented here. Create, destroy, and clone are not. The first use of this that I imagine is for people to use downloaded qcow2-backed images (like ubuntu cloud images, or anything previously used with qemu). I imagine people will want to create/clone/destroy out of band using qemu-img, but if I'm wrong about that we can implement the rest later. Because attach_block_device() is done before the bdev is initialized, and bdev_init needs to know the nbd index so that it can mount the filesystem, we now need to pass the lxc_conf. file_exists() is moved to utils.c so we can use it from bdev.c The nbd attach/detach should lay the groundwork for trivial implementation of qed and raw images. changelog (may 12): qcow: fix idx check at detach Hey Serge, I had to check the code for how to use this so maybe we should document somewhere what the rootfs line needs to look like (ie. lxc.rootfs = qcow2:/path/to/diskimg:1). Also, I used this against a .vdi image just fine, so maybe we should be more generic than just qcow2 and call it qemu? Not sure if qemu-nbd supports all the same image formats as qemu-img. so, nbd:/file[:partition] ? Sounds good to me. ___ lxc-devel mailing list lxc-devel@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-devel
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Wed, 2014-05-14 at 21:00 -0700, Greg Kroah-Hartman wrote: On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote: On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote: Using devtmpfs is one possible solution, and it would have the added benefit of making container setup simpler. But simply letting containers mount devtmpfs isn't sufficient since the container may need to see a different, more limited set of devices, and because different environments making modifications to the filesystem could lead to conflicts. This series solves these problems by assigning devices to user namespaces. Each device has an owner namespace which specifies which devtmpfs mount the device should appear in as well allowing priveleged operations on the device from that namespace. This defaults to init_user_ns. There's also an ns_global flag to indicate a device should appear in all devtmpfs mounts. I'd strongly argue that this isn't even a problem at all. And, as I said at the Plumbers conference last year, adding namespaces to devices isn't going to happen, sorry. Please don't continue down this path. I was just mentioning that to Serge just a week or so ago reminding him of what you told all of us face to face back then. We were having a discussion over loop devices into containers and this topic came up. It was the loop device use case that got me started down this path in the first place, so I don't personally have any interest in physical devices right now (though I was sure others would). Why do you want to give access to a loop device to a container? Shouldn't you set up the loop devices before creating the container and then pass those mount points into the container? I thought that was how things worked today, or am I missing something? Ah, you keep feeding me easy ones. I need raw access to loop devices and loop-control because I'm using containers to build NST (Network Security Toolkit) distribution iso images (one container is x86_64 while the other is i686). Each requires 2 loop devices. You can't set up the loop devices in advance since the containers will be creating the images and building them. NST tinkers with the base build engine configuration, so I really DON'T want it running on a hard iron host. There may be other cases where I need other specialized containers for building distros. I'm also looking at custom builds of Kali (another security distribution). Giving the ability for a container to create a loop device at all is a horrid idea, as you have pointed out, lots of information leakage could easily happen. It does but only slightly. I noticed that losetup will list all the devices regardless of container where run or the container where set up. But that seems to be largely cosmetic. You can't do anything with the loop device in the other container. You can't disconnected it, read it, or mount it (I've tested it). In the former case, losetup returns with no error but does nothing. In the later case, you get a busy error. Not clean, not pretty, but no damage. Since loop-control is working on the global pool of loop devices, it's impossible to know what device to move to what container when the container runs losetup. For me, this isn't a serious problem, since it only involves 2 specialized containers out of over 4 dozen containers I have running across 3 sites. And those two containers are under my explicit and exclusive control. None of the others need it. I can get away with adding extra loop devices and adding them to the containers and let losetup deal with allocation and contention. Serge mentioned something to me about a loopdevfs (?) thing that someone else is working on. That would seem to be a better solution in this particular case but I don't know much about it or where it's at. Mind you, I heard your arguments at LinuxPlumbers regarding pushing user space policies into the kernel and all and basically I agree with you, this should be handled in host system user space and it seems reasonable. I'm just pointing out real world cases I have in operation right now and pointing out that I have solutions for them in host user space, even if some of them may not be estheticly pretty. greg k-h Regards, Mike -- Michael H. Warfield (AI4NB) | (770) 978-7061 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part ___ lxc-devel mailing list lxc-devel@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-devel
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Thu, May 15, 2014 at 09:42:17AM -0400, Michael H. Warfield wrote: On Wed, 2014-05-14 at 21:00 -0700, Greg Kroah-Hartman wrote: On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote: On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote: Using devtmpfs is one possible solution, and it would have the added benefit of making container setup simpler. But simply letting containers mount devtmpfs isn't sufficient since the container may need to see a different, more limited set of devices, and because different environments making modifications to the filesystem could lead to conflicts. This series solves these problems by assigning devices to user namespaces. Each device has an owner namespace which specifies which devtmpfs mount the device should appear in as well allowing priveleged operations on the device from that namespace. This defaults to init_user_ns. There's also an ns_global flag to indicate a device should appear in all devtmpfs mounts. I'd strongly argue that this isn't even a problem at all. And, as I said at the Plumbers conference last year, adding namespaces to devices isn't going to happen, sorry. Please don't continue down this path. I was just mentioning that to Serge just a week or so ago reminding him of what you told all of us face to face back then. We were having a discussion over loop devices into containers and this topic came up. It was the loop device use case that got me started down this path in the first place, so I don't personally have any interest in physical devices right now (though I was sure others would). Why do you want to give access to a loop device to a container? Shouldn't you set up the loop devices before creating the container and then pass those mount points into the container? I thought that was how things worked today, or am I missing something? Ah, you keep feeding me easy ones. I need raw access to loop devices and loop-control because I'm using containers to build NST (Network Security Toolkit) distribution iso images (one container is x86_64 while the other is i686). Each requires 2 loop devices. You can't set up the loop devices in advance since the containers will be creating the images and building them. NST tinkers with the base build engine configuration, so I really DON'T want it running on a hard iron host. There may be other cases where I need other specialized containers for building distros. I'm also looking at custom builds of Kali (another security distribution). Then don't use a container to build such a thing, or fix the build scripts to not do that :) That is not a normal use case for a container at all. Containers are not for everything, use a virtual machine for some tasks (like this one). Serge mentioned something to me about a loopdevfs (?) thing that someone else is working on. That would seem to be a better solution in this particular case but I don't know much about it or where it's at. Ok, let's see those patches then. thanks, greg k-h ___ lxc-devel mailing list lxc-devel@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-devel
Re: [lxc-devel] call to setup_dev_symlinks with lxc.autodev
On Thu, May 15, 2014 at 3:45 PM, Michael H. Warfield m...@wittsend.com wrote: The patch was submitted and committed to git head shortly after I submitted it. It's there now but there hasn't been a point release since. 1.0.4 is not out yet and I have no idea if this patch made the cut for 1.0.4 or not. That release seems to have taken a week or two longer than expected, so I hope it will be included. ok thanks for the info -- William ___ lxc-devel mailing list lxc-devel@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-devel
[lxc-devel] [PATCH 1/2] add support for nbd (v3)
backing stores supported by qemu-nbd can be attached to a nbd block device using qemu-nbd. This user-space process (pair) stays around for the duration of the device attachment. Obviously we want it to go away when the container shuts down, but not before the filesystems have been cleanly unmounted. The device attachment is done from the task which will become the container monitor before the container setup+init task is spawned. That task starts in a new pid namespace to ensure that the qemu-nbd process will be killed if need be. It sets its parent death signal to sighup, and, on receiving sighup, attempts to do a clean qemu-device detach, then exits. This should ensure that the device is detached if the qemu monitor crashes or exits. It may be worth adding a delay before the qemu-nbd is detached, but my brief tests haven't seen any data corruption. Only the parts required for running a nbd-backed container are implemented here. Create, destroy, and clone are not. The first use of this that I imagine is for people to use downloaded nbd-backed images (like ubuntu cloud images, or anything previously used with qemu). I imagine people will want to create/clone/destroy out of band using qemu-img, but if I'm wrong about that we can implement the rest later. Because attach_block_device() is done before the bdev is initialized, and bdev_init needs to know the nbd index so that it can mount the filesystem, we now need to pass the lxc_conf. file_exists() is moved to utils.c so we can use it from bdev.c The nbd attach/detach should lay the groundwork for trivial implementation of qed and raw images. changelog (may 12): fix idx check at detach changelog (may 15): generalize qcow2 to nbd Signed-off-by: Serge Hallyn serge.hal...@ubuntu.com --- src/lxc/bdev.c | 293 - src/lxc/bdev.h | 17 ++- src/lxc/conf.c | 3 +- src/lxc/conf.h | 1 + src/lxc/lxccontainer.c | 19 +--- src/lxc/start.c| 11 +- src/lxc/utils.c| 7 ++ src/lxc/utils.h| 1 + 8 files changed, 329 insertions(+), 23 deletions(-) diff --git a/src/lxc/bdev.c b/src/lxc/bdev.c index 20e9fb3..e22d83d 100644 --- a/src/lxc/bdev.c +++ b/src/lxc/bdev.c @@ -41,6 +41,7 @@ #include libgen.h #include linux/loop.h #include dirent.h +#include sys/prctl.h #include lxc.h #include config.h @@ -2410,6 +2411,287 @@ static const struct bdev_ops aufs_ops = { .can_snapshot = true, }; +// +// nbd dev ops +// + +static int nbd_detect(const char *path) +{ + if (strncmp(path, nbd:, 4) == 0) + return 1; + return 0; +} + +struct nbd_attach_data { + const char *nbd; + const char *path; +}; + +static void nbd_detach(const char *path) +{ + int ret; + pid_t pid = fork(); + + if (pid 0) { + SYSERROR(Error forking to detach nbd); + return; + } + if (pid) { + ret = wait_for_pid(pid); + if (ret 0) + ERROR(nbd disconnect returned an error); + return; + } + execlp(qemu-nbd, qemu-nbd, -d, path, NULL); + SYSERROR(Error executing qemu-nbd); + exit(1); +} + +static int do_attach_nbd(void *d) +{ + struct nbd_attach_data *data = d; + const char *nbd, *path; + pid_t pid; + sigset_t mask; + int sfd; + ssize_t s; + struct signalfd_siginfo fdsi; + + sigemptyset(mask); + sigaddset(mask, SIGHUP); + sigaddset(mask, SIGCHLD); + + nbd = data-nbd; + path = data-path; + + if (sigprocmask(SIG_BLOCK, mask, NULL) == -1) { + SYSERROR(Error blocking signals for nbd watcher); + exit(1); + } + + sfd = signalfd(-1, mask, 0); + if (sfd == -1) { + SYSERROR(Error opening signalfd for nbd task); + exit(1); + } + + if (prctl(PR_SET_PDEATHSIG, SIGHUP, 0, 0, 0) 0) + SYSERROR(Error setting parent death signal for nbd watcher); + + pid = fork(); + if (pid) { + for (;;) { + s = read(sfd, fdsi, sizeof(struct signalfd_siginfo)); + if (s != sizeof(struct signalfd_siginfo)) + SYSERROR(Error reading from signalfd); + + if (fdsi.ssi_signo == SIGHUP) { + /* container has exited */ + nbd_detach(nbd); + exit(0); + } else if (fdsi.ssi_signo == SIGCHLD) { + int status; + while (waitpid(-1, status, WNOHANG) 0); + } + } + } + + close(sfd); + if (sigprocmask(SIG_UNBLOCK, mask, NULL) == -1) + WARN(Warning: unblocking signals for nbd watcher); + + execlp(qemu-nbd, qemu-nbd, -c,
[lxc-devel] [PATCH 2/2] lxc.container.conf: document the type: lxc.rootfs conventions
Signed-off-by: Serge Hallyn serge.hal...@ubuntu.com --- doc/lxc.container.conf.sgml.in | 14 ++ 1 file changed, 14 insertions(+) diff --git a/doc/lxc.container.conf.sgml.in b/doc/lxc.container.conf.sgml.in index 6e96889..39de1cc 100644 --- a/doc/lxc.container.conf.sgml.in +++ b/doc/lxc.container.conf.sgml.in @@ -876,6 +876,20 @@ Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA specified, the container shares its root file system with the host. /para + para + For directory or simple block-device backed containers, + a pathname can be used. If the rootfs is backed by a nbd + device, then filenamenbd:file:1/filename specifies that + filenamefile/filename should be attached to a nbd device, + and partition 1 should be mounted as the rootfs. + filenamenbd:file/filename specifies that the nbd device + itself should be mounted. filenameoverlayfs:/lower:/upper/filename + specifies that the rootfs should be an overlay with filename/upper/filename + being mounted read-write over a read-only mount of filename/lower/filename. + filenameaufs:/lower:/upper/filename does the same using aufs in place + of overlayfs. filenameloop:/file/filename tells lxc to attach + filename/file/filename to a loop device and mount the loop device. + /para /listitem /varlistentry -- 1.9.1 ___ lxc-devel mailing list lxc-devel@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-devel
Re: [lxc-devel] [PATCH 1/2] add support for nbd (v3)
On Thu, 15 May 2014 14:33:18 + Serge Hallyn serge.hal...@ubuntu.com wrote: backing stores supported by qemu-nbd can be attached to a nbd block device using qemu-nbd. This user-space process (pair) stays around for the duration of the device attachment. Obviously we want it to go away when the container shuts down, but not before the filesystems have been cleanly unmounted. The device attachment is done from the task which will become the container monitor before the container setup+init task is spawned. That task starts in a new pid namespace to ensure that the qemu-nbd process will be killed if need be. It sets its parent death signal to sighup, and, on receiving sighup, attempts to do a clean qemu-device detach, then exits. This should ensure that the device is detached if the qemu monitor crashes or exits. It may be worth adding a delay before the qemu-nbd is detached, but my brief tests haven't seen any data corruption. Only the parts required for running a nbd-backed container are implemented here. Create, destroy, and clone are not. The first use of this that I imagine is for people to use downloaded nbd-backed images (like ubuntu cloud images, or anything previously used with qemu). I imagine people will want to create/clone/destroy out of band using qemu-img, but if I'm wrong about that we can implement the rest later. Because attach_block_device() is done before the bdev is initialized, and bdev_init needs to know the nbd index so that it can mount the filesystem, we now need to pass the lxc_conf. file_exists() is moved to utils.c so we can use it from bdev.c The nbd attach/detach should lay the groundwork for trivial implementation of qed and raw images. changelog (may 12): fix idx check at detach changelog (may 15): generalize qcow2 to nbd Signed-off-by: Serge Hallyn serge.hal...@ubuntu.com Acked-by: Dwight Engen dwight.en...@oracle.com --- src/lxc/bdev.c | 293 - src/lxc/bdev.h | 17 ++- src/lxc/conf.c | 3 +- src/lxc/conf.h | 1 + src/lxc/lxccontainer.c | 19 +--- src/lxc/start.c| 11 +- src/lxc/utils.c| 7 ++ src/lxc/utils.h| 1 + 8 files changed, 329 insertions(+), 23 deletions(-) diff --git a/src/lxc/bdev.c b/src/lxc/bdev.c index 20e9fb3..e22d83d 100644 --- a/src/lxc/bdev.c +++ b/src/lxc/bdev.c @@ -41,6 +41,7 @@ #include libgen.h #include linux/loop.h #include dirent.h +#include sys/prctl.h #include lxc.h #include config.h @@ -2410,6 +2411,287 @@ static const struct bdev_ops aufs_ops = { .can_snapshot = true, }; +// +// nbd dev ops +// + +static int nbd_detect(const char *path) +{ + if (strncmp(path, nbd:, 4) == 0) + return 1; + return 0; +} + +struct nbd_attach_data { + const char *nbd; + const char *path; +}; + +static void nbd_detach(const char *path) +{ + int ret; + pid_t pid = fork(); + + if (pid 0) { + SYSERROR(Error forking to detach nbd); + return; + } + if (pid) { + ret = wait_for_pid(pid); + if (ret 0) + ERROR(nbd disconnect returned an error); + return; + } + execlp(qemu-nbd, qemu-nbd, -d, path, NULL); + SYSERROR(Error executing qemu-nbd); + exit(1); +} + +static int do_attach_nbd(void *d) +{ + struct nbd_attach_data *data = d; + const char *nbd, *path; + pid_t pid; + sigset_t mask; + int sfd; + ssize_t s; + struct signalfd_siginfo fdsi; + + sigemptyset(mask); + sigaddset(mask, SIGHUP); + sigaddset(mask, SIGCHLD); + + nbd = data-nbd; + path = data-path; + + if (sigprocmask(SIG_BLOCK, mask, NULL) == -1) { + SYSERROR(Error blocking signals for nbd watcher); + exit(1); + } + + sfd = signalfd(-1, mask, 0); + if (sfd == -1) { + SYSERROR(Error opening signalfd for nbd task); + exit(1); + } + + if (prctl(PR_SET_PDEATHSIG, SIGHUP, 0, 0, 0) 0) + SYSERROR(Error setting parent death signal for nbd watcher); + + pid = fork(); + if (pid) { + for (;;) { + s = read(sfd, fdsi, sizeof(struct signalfd_siginfo)); + if (s != sizeof(struct signalfd_siginfo)) + SYSERROR(Error reading from signalfd); + + if (fdsi.ssi_signo == SIGHUP) { + /* container has exited */ + nbd_detach(nbd); + exit(0); + } else if (fdsi.ssi_signo == SIGCHLD) { + int status; + while (waitpid(-1, status, WNOHANG) 0); + } + } + } + + close(sfd); +
Re: [lxc-devel] [PATCH 2/2] lxc.container.conf: document the type: lxc.rootfs conventions
On Thu, 2014-05-15 at 14:33 +, Serge Hallyn wrote: Signed-off-by: Serge Hallyn serge.hal...@ubuntu.com --- doc/lxc.container.conf.sgml.in | 14 ++ 1 file changed, 14 insertions(+) diff --git a/doc/lxc.container.conf.sgml.in b/doc/lxc.container.conf.sgml.in index 6e96889..39de1cc 100644 --- a/doc/lxc.container.conf.sgml.in +++ b/doc/lxc.container.conf.sgml.in @@ -876,6 +876,20 @@ Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA specified, the container shares its root file system with the host. /para + para + For directory or simple block-device backed containers, + a pathname can be used. If the rootfs is backed by a nbd + device, then filenamenbd:file:1/filename specifies that + filenamefile/filename should be attached to a nbd device, + and partition 1 should be mounted as the rootfs. + filenamenbd:file/filename specifies that the nbd device + itself should be mounted. filenameoverlayfs:/lower:/upper/filename + specifies that the rootfs should be an overlay with filename/upper/filename + being mounted read-write over a read-only mount of filename/lower/filename. + filenameaufs:/lower:/upper/filename does the same using aufs in place + of overlayfs. filenameloop:/file/filename tells lxc to attach + filename/file/filename to a loop device and mount the loop device. + /para /listitem /varlistentry -- 1.9.1 I may be off base here but, does this relate to that exchange on the -users list a couple of weeks ago about the Fedora template and an lvm backing store? Or is that one of these simple block-device backed things? The OP never got back with us and I haven't tried testing it myself. Regards, Mike -- Michael H. Warfield (AI4NB) | (770) 978-7061 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part ___ lxc-devel mailing list lxc-devel@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-devel
[lxc-devel] [PATCH 3/2] nbd: exit cleanly if nbd fails to attach
Signed-off-by: Serge Hallyn serge.hal...@ubuntu.com --- src/lxc/bdev.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/src/lxc/bdev.c b/src/lxc/bdev.c index e22d83d..1d9a25a 100644 --- a/src/lxc/bdev.c +++ b/src/lxc/bdev.c @@ -2491,7 +2491,15 @@ static int do_attach_nbd(void *d) exit(0); } else if (fdsi.ssi_signo == SIGCHLD) { int status; - while (waitpid(-1, status, WNOHANG) 0); + /* If qemu-nbd fails, or is killed by a signal, +* then exit */ + while (waitpid(-1, status, WNOHANG) 0) { + if ((WIFEXITED(status) WEXITSTATUS(status) != 0) || + WIFSIGNALED(status)) { + nbd_detach(nbd); + exit(1); + } + } } } } -- 1.9.1 ___ lxc-devel mailing list lxc-devel@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-devel
Re: [lxc-devel] [PATCH 2/2] lxc.container.conf: document the type: lxc.rootfs conventions
On Thu, 15 May 2014 14:33:47 + Serge Hallyn serge.hal...@ubuntu.com wrote: Signed-off-by: Serge Hallyn serge.hal...@ubuntu.com Acked-by: Dwight Engen dwight.en...@oracle.com --- doc/lxc.container.conf.sgml.in | 14 ++ 1 file changed, 14 insertions(+) diff --git a/doc/lxc.container.conf.sgml.in b/doc/lxc.container.conf.sgml.in index 6e96889..39de1cc 100644 --- a/doc/lxc.container.conf.sgml.in +++ b/doc/lxc.container.conf.sgml.in @@ -876,6 +876,20 @@ Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA specified, the container shares its root file system with the host. /para + para + For directory or simple block-device backed containers, + a pathname can be used. If the rootfs is backed by a nbd + device, then filenamenbd:file:1/filename specifies that + filenamefile/filename should be attached to a nbd device, + and partition 1 should be mounted as the rootfs. + filenamenbd:file/filename specifies that the nbd device + itself should be mounted. filenameoverlayfs:/lower:/upper/filename + specifies that the rootfs should be an overlay with filename/upper/filename + being mounted read-write over a read-only mount of filename/lower/filename. + filenameaufs:/lower:/upper/filename does the same using aufs in place + of overlayfs. filenameloop:/file/filename tells lxc to attach + filename/file/filename to a loop device and mount the loop device. + /para /listitem /varlistentry ___ lxc-devel mailing list lxc-devel@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-devel
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman gre...@linuxfoundation.org wrote: Then don't use a container to build such a thing, or fix the build scripts to not do that :) I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM would much better fit in. Please don't put more complexity into containers. They are already horrible complex and error prone. -- Thanks, //richard ___ lxc-devel mailing list lxc-devel@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-devel
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting Richard Weinberger (richard.weinber...@gmail.com): On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman gre...@linuxfoundation.org wrote: Then don't use a container to build such a thing, or fix the build scripts to not do that :) I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM would much better fit in. Please don't put more complexity into containers. They are already horrible complex and error prone. I, naturally, disagree :) The only use case which is inherently not valid for containers is running a kernel. Practically speaking there are other things which likely will never be possible, but if someone offers a way to do something in containers, you can't do that in containers is not an apropos response. That abstraction is wrong is certainly valid, as when vpids were originally proposed and rejected, resulting in the development of pid namespaces. We have to work out (x) first can be valid (and I can think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem. Finally, saying containers are complex and error prone is conflating several large suites of userspace code and many kernel features which support them. Being more precise would, if the argument is valid, lend it a lot more weight. ___ lxc-devel mailing list lxc-devel@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-devel
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Am 15.05.2014 21:50, schrieb Serge Hallyn: Quoting Richard Weinberger (richard.weinber...@gmail.com): On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman gre...@linuxfoundation.org wrote: Then don't use a container to build such a thing, or fix the build scripts to not do that :) I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM would much better fit in. Please don't put more complexity into containers. They are already horrible complex and error prone. I, naturally, disagree :) The only use case which is inherently not valid for containers is running a kernel. Practically speaking there are other things which likely will never be possible, but if someone offers a way to do something in containers, you can't do that in containers is not an apropos response. That abstraction is wrong is certainly valid, as when vpids were originally proposed and rejected, resulting in the development of pid namespaces. We have to work out (x) first can be valid (and I can think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem. Finally, saying containers are complex and error prone is conflating several large suites of userspace code and many kernel features which support them. Being more precise would, if the argument is valid, lend it a lot more weight. We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the internals better I also wrote my own userspace to create/start containers. There are so many things which can hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a user is allowed to mount filesystems. Ask Andy, he found already lots of nasty things... I agree that user namespaces are the way to go, all the papering with LSM over security issues is much worse. But we have to make sure that we don't add too much features too fast. That said, I like containers a lot because they are cheap but as they are lightweight also therefore also isolation level is lightweight. IMHO containers are not a cheap replacement for KVM. Thanks, //richard ___ lxc-devel mailing list lxc-devel@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-devel
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting Richard Weinberger (rich...@nod.at): Am 15.05.2014 21:50, schrieb Serge Hallyn: Quoting Richard Weinberger (richard.weinber...@gmail.com): On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman gre...@linuxfoundation.org wrote: Then don't use a container to build such a thing, or fix the build scripts to not do that :) I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM would much better fit in. Please don't put more complexity into containers. They are already horrible complex and error prone. I, naturally, disagree :) The only use case which is inherently not valid for containers is running a kernel. Practically speaking there are other things which likely will never be possible, but if someone offers a way to do something in containers, you can't do that in containers is not an apropos response. That abstraction is wrong is certainly valid, as when vpids were originally proposed and rejected, resulting in the development of pid namespaces. We have to work out (x) first can be valid (and I can think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem. Finally, saying containers are complex and error prone is conflating several large suites of userspace code and many kernel features which support them. Being more precise would, if the argument is valid, lend it a lot more weight. We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the internals better I also wrote my own userspace to create/start containers. There are so many things which can hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a user is allowed to mount filesystems. That is currently not the case. They can mount some virtual filesystems and do bind mounts, but cannot mount most real filesystems. This keeps us protected (for now) from potentially unsafe superblock readers in the kernel. Ask Andy, he found already lots of nasty things... Yes, of course, and there may be more to come... I agree that user namespaces are the way to go, all the papering with LSM over security issues is much worse. But we have to make sure that we don't add too much features too fast. Agreed. Like I said, 'we have to work (x) out first' could be valid, including 'we should wait (a year?) for user ns issues to fall out before relaxing any of the current user ns constraints. On the other hand, not exercising the new code may only mean that existing flaws stick around longer, undetected (by most). That said, I like containers a lot because they are cheap but as they are lightweight also therefore also isolation level is lightweight. IMHO containers are not a cheap replacement for KVM. The building blocks for containers can also be used for entirely new, simpler use cases - i.e. perhaps a new fakeroot alternative based on user namespace mappings. Which is why this is not a use case for containers is not the right way to push back, whether or not the feature ends up being appropriate. -serge ___ lxc-devel mailing list lxc-devel@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-devel
[lxc-devel] [PATCH] Refactoring lxc-autostart boot process and group handling.
Ok... Here it is. The roll up of Dwight's earlier patch, my previous 3 patches plus, now, the rework to lxc-autostart to support multiple invocations of the -g option, allow ordinal inclusion of the NULL group, and process containers first in order of the groups specified on the command line and their membership in those groups, subject to all other constraints (lxc.start.auto) and orderings (lxc.start.order). I have also noticed that host shutdown time can be very exhorbatently long. I added parameters to set a shutdown time of -t 5 seconds to speed that process up (may still not be fast enough). Problem here is that while startup returns quickly while a container initializes and we need a delay, shutdown is largely serial and not parallelized at all. We may need to address that in some way moving forward. Here's the grand patch: -- Refactoring lxc-autostart boot process and group handling. This is a rollup of 4 earlier patches patching the systemd init to use the sysvinit script, adding an onboot group to the boot set, updating upstart to include the onboot group, and adding documentation for the special boot groups. This also adds new functionality to lxc-autostart. *) The -g / --groups option is multiple cummulative entry. This may be mixed freely with the previous comma separated group list convention. Groups are processed in the order they first appear in the aggregated group list. *) The NULL group may be specified in the group list using either a leading comma, a trailing comma, or an embedded comma. *) Booting proceeds in order of the groups specified on the command line then ordered by lxc.start.org and name collalating sequence. *) Default host bootup is now specified as -g onboot, meaning that first the onboot group is booted and then any remaining enabled containers in the NULL group are booted. From the previous 4 individual patches: Reported-by: CDR vene...@gmail.com Signed-off-by: Dwight Engen dwight.en...@oracle.com - reuse the sysvinit script to ensure that if the lxc is configured to use a bridge setup by libvirt, the bridge will be available before starting the container - made the sysvinit script check for the existance of ifconfig, and fall back to ip link list if available - made the lxc service also dependant on the network.target - autoconfized the paths in the service file and sysvinit script - v2: rename script lxc-autostart to lxc-autostart-helper to avoid confusion From: Michael H. Warfield m...@wittsend.com - This adds a non-null group (onboot) to the sysvinit startup script for autobooting containers. This allows for containers which are in other groups to be included in the autoboot process. This script is used by both the sysvinit systems and the systemd systems. From: Michael H. Warfield m...@wittsend.com - Add the feature to the Upstart init script to boot the onboot group dependent on the start.auto = 1 flag. This brings the the Upstart behavior into congruence with the sysvinit script for SysV Init and Systemd. From: Michael H. Warfield m...@wittsend.com Added sections to lxc-autostart and lxc.container.config to document the behavior of the LXC service at host system boot time with regards to the lxc.group and lxc.start.auto parameters. Signed-off-by: Michael H. Warfield m...@wittsend.com --- .gitignore | 3 + config/init/systemd/Makefile.am| 14 +- config/init/systemd/lxc.service| 17 -- config/init/systemd/lxc.service.in | 17 ++ config/init/sysvinit/lxc | 66 config/init/sysvinit/lxc.in| 86 ++ config/init/upstart/lxc.conf | 8 +- configure.ac | 2 + doc/lxc-autostart.sgml.in | 30 doc/lxc.container.conf.sgml.in | 23 +++ lxc.spec.in| 1 + src/lxc/lxc_autostart.c| 331 +++-- 12 files changed, 427 insertions(+), 171 deletions(-) delete mode 100644 config/init/systemd/lxc.service create mode 100644 config/init/systemd/lxc.service.in delete mode 100755 config/init/sysvinit/lxc create mode 100644 config/init/sysvinit/lxc.in diff --git a/.gitignore b/.gitignore index 8145f81..2b478cd 100644 --- a/.gitignore +++ b/.gitignore @@ -111,6 +111,9 @@ config/missing config/libtool.m4 config/lt*.m4 config/bash/lxc +config/init/systemd/lxc-autostart-helper +config/init/systemd/lxc.service +config/init/sysvinit/lxc doc/*.1 doc/*.5 diff --git a/config/init/systemd/Makefile.am b/config/init/systemd/Makefile.am index de5ee50..fc374c5 100644 --- a/config/init/systemd/Makefile.am +++ b/config/init/systemd/Makefile.am @@ -5,7 +5,17 @@ EXTRA_DIST = \ if INIT_SCRIPT_SYSTEMD SYSTEMD_UNIT_DIR = $(prefix)/lib/systemd/system -install-systemd: lxc.service lxc-devsetup +lxc-autostart-helper: ../sysvinit/lxc.in $(top_builddir)/config.status + $(AM_V_GEN)sed
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote: What exactly defines 'normal use case for a container'? Well, I'd say acting like a virtual machine is a good start :) Not too long ago much of what we can now do with network namespaces was not a normal container use case. Neither you can't do it now nor I don't use it like that should be grounds for a pre-emptive nack. It will horribly break security assumptions certainly would be. I agree, and maybe we will get there over time, but this patch is nto the way to do that. That's not to say there might not be good reasons why this in particular is not appropriate, but ISTM if things are going to be nacked without consideration of the patchset itself, we ought to be having a ksummit session to come to a consensus [ or receive a decree, presumably by you :) but after we have a chance to make our case ] on what things are going to be un/acceptable. I already stood up and publically said this last year at Plumbers, why is anything now different? And this patchset is proof of why it's not a good idea. You really didn't do anything with all of the namespace stuff, except change loop. That's the only thing that cares, so, just do it there, like I said to do so, last August. And you are ignoring the notifications to userspace and how namespaces here would deal with that. Serge mentioned something to me about a loopdevfs (?) thing that someone else is working on. That would seem to be a better solution in this particular case but I don't know much about it or where it's at. Ok, let's see those patches then. I think Seth has a git tree ready, but not sure which branch he'd want us to look at. Splitting a namespaced devtmpfs from loopdevfs discussion might be sensible. However, in defense of a namespaced devtmpfs I'd say that for userspace to, at every container startup, bind-mount in devices from the global devtmpfs into a private tmpfs (for systemd's sake it can't just be on the container rootfs), seems like something worth avoiding. I think having to pick and choose what device nodes you want in a container is a good thing. Becides, you would have to do the same thing in the kernel anyway, what's wrong with userspace making the decision here, especially as it knows exactly what it wants to do much more so than the kernel ever can. PS - Apparently both parallels and Michael independently project devices which are hot-plugged on the host into containers. That also seems like something worth talking about (best practices, shortcomings, use cases not met by it, any ways tha the kernel can help out) at ksummit/linuxcon. I was told that containers would never want devices hotplugged into them. What use case has this happening / needed? thanks, greg k-h ___ lxc-devel mailing list lxc-devel@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-devel
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote: I think having to pick and choose what device nodes you want in a container is a good thing. Becides, you would have to do the same thing in the kernel anyway, what's wrong with userspace making the decision here, especially as it knows exactly what it wants to do much more so than the kernel ever can. For 'real' devices that sounds sensible. The thing about loop devices is that we simply want to allow a container to say give me a loop device to use and have it receive a unique loop device (or 3), without having to pre-assign them. I think that would be cleaner to do using a pseudofs and loop-control device, rather than having to have a daemon in userspace on the host farming those out in response to some, I don't know, dbus request? I agree that loop devices would be nice to have in a container, and that the existing loop interface doesn't really lend itself to that. So create a new type of thing that acts like a loop device in a container. But don't try to mess with the whole driver core just for a single type of device. greg k-h ___ lxc-devel mailing list lxc-devel@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-devel