Re: [systemd-devel] [PATCH] audit: Fix journal failing on unsupported audit in containers [was: journal: don't complain about audit socket errors in a container.]

2015-05-21 Thread Serge Hallyn
Quoting Lennart Poettering (lenn...@poettering.net):
> On Wed, 20.05.15 22:40, Martin Pitt (martin.p...@ubuntu.com) wrote:
> 
> > Hey Lennart,
> > 
> > Lennart Poettering [2015-05-20 17:49 +0200]:
> > > Nope, ConditionSecurity=audit is only a simple boolean check that
> > > holds when audit is enabled at all. It doesn't tell you anything about
> > > the precise audit feature set of the kernel.
> > 
> > Ah, thanks for the clarification.
> > 
> > > I have now conditionalized the unit on CAP_ADMIN_READ, which is the
> > > cap that you need to read the audit multicast stuff. You container
> > > manager hence should simply drop that cap fro, the cap set it passes
> > > and all should be good.

I want to clarify this point.  Dropping CAP_ADMIN_READ from the bounding
set means dropping it from the capabilities targeted at your own user
namespace.  The only check in the kernel for CAP_ADMIN_READ currently is
against the initial user namespace.  One day of course (maybe soon) this
may change so that you only need CAP_ADMIN_READ against your own
user_ns.  Following the above, container managers could then again keep
CAP_ADMIN_READ in the bounding set.

But I'm claiming that checking for CAP_ADMIN_READ in your bounding set
is the wrong check here.  It simply has nothing to do with what you
actually want to be able to do.  One could argue that the right answer
is a new kernel facility to check for caps against init_user_ns, but no
that will have the same problem after audit ns becomes possible.  I
think the right check for systemd to perform to check whether this is
allowed is to actuallly try the bind().  That will return the right
answer both now and when namespaced audit is possible, without taking a
probably-wrong unrelated cue from the container manager.

It's not earth-shatteringly important and what you've got is workable,
but I think it may set a better precedent to do it the other way.

-serge

(One might almost think that we should have a new kernel facility to
answer questions such questions.  CAP_MAC_ADMIN is similar.)
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] "dynamic" uid allocation (was: [PATCH] loopback setup in unprivileged containers)

2015-02-03 Thread Serge Hallyn
Quoting Lennart Poettering (lenn...@poettering.net):
> On Tue, 03.02.15 15:03, Daniel P. Berrange (berra...@redhat.com) wrote:
> 
> > > Hmm, so, I thought a lot about this in the past weeks. I think the way
> > > I'd really like to see this work in the end is that we never have to
> > > persist the UID mappings. This could work if the kernel would provide
> > > us with the ability to bind mount a file system into the container
> > > applying a fixed UID shift. That way, the shifted UIDs would never hit
> > > the actual disk, and hence we wouldn't have to persist their mappings.
> > > 
> > > Instead on each container startup we'd look for a new UID range, and
> > > release it entirely when the container shuts down. The bind mount with
> > > UID shift would then shift the UIDs up, the userns stuff would shift
> > > it down from inside the container again.
> > > 
> > > Of course, this all depends on whether the kernel will get an
> > > extension to apply uid shifts to bind mounts. I hear they want to
> > > provide this, but let's see.
> > 
> > I would dearly love to see that happen. Having to recursively change

It'd definately be useful (though not without issues).

> > the UID/GID on entire filesystem sub-trees given to containers with
> > userns is a real unpleasant thing to have to deal with. I'd not want

Of course you would *not* want to take a stock rootfs where uid == 0
and shift that into the container, as that would give root in the
container a chance to write root-owned files on the host to leverage
later in a convoluted attack :)  We might want to come up with a
containers concensus that container rootfs's are always shipped with
uid range 0-65535 -> 10-165535.  That still leaves a chance for
container A (mapped to 20-265535) to write valid setuid-root
binary for container B (mapped to 30-365535), which isn't possible
otherwise.  But that's better than doing so for host-root.

> > the filesystem UID shift to only apply to bind mounts though. It is
> > not uncommon to use a disk image[1] for a container's filesystem, so
> > being able to request a UID shift on *any* filesystem mount is pretty
> > desirable, rather than having to mount the image and then bind mount
> > it onto itself just to apply the UID shift.
> 
> Well, you can always change the bind mount flags without creating a
> new bind mount with MS_BIND|MS_REMOUNT.
> 
> > [1] Using a separate disk image per container means a container can't
> > DOS other containers by exhausting inodes for example with $millions
> > of small files.
> 
> Indeed. I'd claim that without such a concept of mount point uid
> shifting the whole userns story is not very useful IRL...

I had always thought this would eventually be done using a stackable
filesystem, but doing it at bind mount time would be neat too, and
less objectionable to the kernel folks.  (Though overlayfs is in now,
so )

I'm actually quite surprised noone has sat down and written a
stackable uid-shifting fs yet.

-serge
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] arch linux container filesystems

2014-06-20 Thread Serge Hallyn
Quoting Lennart Poettering (lenn...@poettering.net):
> On Fri, 20.06.14 15:47, Robin Becker (ro...@reportlab.com) wrote:
> > In any case, some might argue that a container (lightweight or not)
> > should be virtually indistinguishable from the original system which
> > would mean such a bug could not happen.
> 
> Well, these are containers not VMs. They are actually massively
> different from the host. For example /sys nor /dev are virtualized, and
> they are unlikely to ever be. Neither is SELinux or anythign like that.
> 
> Containers *are* distuingishable from normal hosts, and that's by
> design. And in no way systemd's design but Linux kernel stuff.

Yup, as proclaimed at kernel summit in 2008 or so.

-serge
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Running a systemd service in capability-only environment as non-root user

2014-05-27 Thread Serge Hallyn
Quoting Mantas Mikulėnas (graw...@gmail.com):
> On Tue, May 27, 2014 at 4:31 PM, Michal Witanowski
>  wrote:
> > Hi,
> >
> > first of all I'd like to mark that I'm not sure if I'm writing in the right
> > place.
> >
> > I have a problem with running a systemd service in "capability-only
> > environment": I want to run a process with some caps (cap_sys_admin
> > cap_dac_override cap_mac_override) as a regular user (UID != 0).
> > My service config file looks something like this:
> >
> > User=test
> > CapabilityBoundingSet=cap_sys_admin cap_dac_override cap_mac_override
> > Capabilities=cap_sys_admin,cap_dac_override,cap_mac_override=eip
> > SecureBits=keep-caps
> >
> > Unfortunately, the process does not gain any permissive capabilities:
> >
> > CapInh: 00010022
> > CapPrm: 
> > CapEff: 
> > CapBnd: 00010022
> >
> > However, when I run the service as root (by removing "User=test") the
> > process does own required caps:
> >
> > CapInh: 00010022
> > CapPrm: 00010022
> > CapEff: 00010022
> > CapBnd: 00010022
> 
> Does the executable file itself have these capabilities set as "=ei"?
> 
> According to the same manual page, each capability must be set as
> inheritable for both the process and the file, to receive them _at
> all_...
> 
> "P'(permitted) = (P(inheritable) & F(inheritable)) | (F(permitted) & 
> cap_bset)"
> 
> ...and as effective on the file, otherwise the new process has to
> manually 'enable' them:
> 
> "P'(effective) = F(effective) ? P'(permitted) : 0"
> 
> ...or at least that seems to be how it works. Damn thing is confusing.

Correct.  keep_caps is about not dropping the capabilities when dropping
all 0 uids, i.e. when systemd does setuid(test).  But systemd is going
to proceed to exec(), causing the capabilities to be re-execed from
there.

So as Mantas wrote, I think what you want is to set
cap_sys_admin,cap_dac_override,cap_mac_override=ie to the file you are
executing.  So long as you don't also set it to 'p', this should be ok
as only processes with cap_sys_admin,cap_dac_override,cap_mac_override=i
will inherit it from the file.

Kinda neat, I especially like the ability to also lock the
service into no-setuid and noroot secbits.

-serge
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] netns: unix: only allow to find out unix socket in same net namespace

2013-08-26 Thread Serge Hallyn
Quoting Gao feng (gaof...@cn.fujitsu.com):
> On 08/26/2013 11:19 AM, James Bottomley wrote:
> > On Mon, 2013-08-26 at 09:06 +0800, Gao feng wrote:
> >> On 08/26/2013 02:16 AM, James Bottomley wrote:
> >>> On Sun, 2013-08-25 at 19:37 +0200, Kay Sievers wrote:
>  On Sun, Aug 25, 2013 at 7:16 PM, James Bottomley
>   wrote:
> > On Wed, 2013-08-21 at 11:51 +0200, Kay Sievers wrote:
> >> On Wed, Aug 21, 2013 at 9:22 AM, Gao feng  
> >> wrote:
> >>> On 08/21/2013 03:06 PM, Eric W. Biederman wrote:
> >>
>  I suspect libvirt should simply not share /run or any other normally
>  writable directory with the host.  Sharing /run /var/run or even /tmp
>  seems extremely dubious if you want some kind of containment, and
>  without strange things spilling through.
> >>
> >> Right, /run or /var cannot be shared. It's not only about sockets,
> >> many other things will also go really wrong that way.
> >
> > This is very narrow thinking about what a container might be and will
> > cause trouble as people start to create novel uses for containers in the
> > cloud if you try to impose this on our current infrastructure.
> >
> > One of the cgroup only container uses we see at Parallels (so no
> > separate filesystem and no net namespaces) is pure apache load balancer
> > type shared hosting.  In this scenario, base apache is effectively
> > brought up in the host environment, but then spawned instances are
> > resource limited using cgroups according to what the customer has paid.
> > Obviously all apache instances are sharing /var and /run from the host
> > (mostly for logging and pid storage and static pages).  The reason some
> > hosters do this is that it allows much higher density simple web serving
> > (either static pages from quota limited chroots or dynamic pages limited
> > by database space constraints) because each "instance" shares so much
> > from the host.  The service is obviously much more basic than giving
> > each customer a container running apache, but it's much easier for the
> > hoster to administer and it serves the customer just as well for a large
> > cross section of use cases and for those it doesn't serve, the hoster
> > usually has separate container hosting (for a higher price, of course).
> 
>  The "container" as we talk about has it's own init, and no, it cannot
>  share /var or /run.
> >>>
> >>> This is what we would call an IaaS container: bringing up init and
> >>> effectively a new OS inside a container is the closest containers come
> >>> to being like hypervisors.  It's the most common use case of Parallels
> >>> containers in the field, so I'm certainly not telling you it's a bad
> >>> idea.
> >>>
>  The stuff you talk about has nothing to do with that, it's not
>  different from all services or a multi-instantiated service on the
>  host sharing the same /run and /var.
> >>>
> >>> I gave you one example: a really simplistic one.  A more sophisticated
> >>> example is a PaaS or SaaS container where you bring the OS up in the
> >>> host but spawn a particular application into its own container (this is
> >>> essentially similar to what Docker does).  Often in this case, you do
> >>> add separate mount and network namespaces to make the application
> >>> isolated and migrateable with its own IP address.  The reason you share
> >>> init and most of the OS from the host is for elasticity and density,
> >>> which are fast becoming a holy grail type quest of cloud orchestration
> >>> systems: if you don't have to bring up the OS from init and you can just
> >>> start the application from a C/R image (orders of magnitude smaller than
> >>> a full system image) and slap on the necessary namespaces as you clone
> >>> it, you have something that comes online in miliseconds which is a feat
> >>> no hypervisor based virtualisation can match.
> >>>
> >>> I'm not saying don't pursue the IaaS case, it's definitely useful ...
> >>> I'm just saying it would be a serious mistake to think that's the only
> >>> use case for containers and we certainly shouldn't adjust Linux to serve
> >>> only that use case.
> >>>
> >>
> >> The feature you said above VS contianer-reboot-host bug, I prefer to
> >> fix
> >> the bug.
> > 
> > What bug?
> > 
> >>  and this feature can be achieved even container unshares /run
> >> directory
> >> with host by default, for libvirt, user can set the container
> >> configuration to
> >> make the container shares the /run directory with host.
> >>
> >> I would like to say, the reboot from container bug is more urgent and
> >> need
> >> to be fixed.
> > 
> > Are you talking about the old bug where trying to reboot an lxc
> > container from within it would reboot the entire system? 
> 
> Yes, we are discussing this problem in this whole thread.
> 
>  If so, OpenVZ
> > has never suffered from that problem and I tho

Re: [systemd-devel] [Lxc-users] Unable to run systemd in an LXC / cgroup container.

2012-10-26 Thread Serge Hallyn
Quoting Michael H. Warfield (m...@wittsend.com):
> Adding in the lxc-devel list.
> 
> On Thu, 2012-10-25 at 22:59 -0400, Michael H. Warfield wrote:
> > On Thu, 2012-10-25 at 15:42 -0400, Michael H. Warfield wrote:
> > > On Thu, 2012-10-25 at 14:02 -0500, Serge Hallyn wrote:
> > > > Quoting Michael H. Warfield (m...@wittsend.com):
> > > > > On Thu, 2012-10-25 at 13:23 -0400, Michael H. Warfield wrote:
> > > > > > Hey Serge,
> > > > > > 
> > > > > > On Thu, 2012-10-25 at 11:19 -0500, Serge Hallyn wrote:
> > > > > 
> > > > > ...
> > > > > 
> > > > > > > Oh, sorry - I take back that suggestion :)
> > > > > > 
> > > > > > > Note that we have mount hooks, so templates could install a mount 
> > > > > > > hook to
> > > > > > > mount a tmpfs onto /dev and populate it.
> > > > > > 
> > > > > > Ok...  I've done some cursory search and turned up nothing but some
> > > > > > comments about "pre mount hooks".  Where is the documentation about 
> > > > > > this
> > > > > > feature and how I might use / implement it?  Some examples would
> > > > > > probably suffice.  Is there a require release version of lxc-utils?
> > > > > 
> > > > > I think I found what I needed in the changelog here:
> > > > > 
> > > > > http://www.mail-archive.com/lxc-devel@lists.sourceforge.net/msg01490.html
> > > > > 
> > > > > I'll play with it and report back.
> > > 
> > > > Also the "Lifecycle management hooks" section in
> > > > https://help.ubuntu.com/12.10/serverguide/lxc.html
> > > 
> > > This isn't working...
> > > 
> > > Based on what was in both of those articles, I added this entry to
> > > another container (Plover) to test...
> > > 
> > > lxc.hook.mount = /var/lib/lxc/Plover/mount
> > > 
> > > When I run "lxc-start -n Plover", I see this:
> > > 
> > > [root@forest ~]# lxc-start -n Plover
> > > lxc-start: unknow key lxc.hook.mount
> > > lxc-start: failed to read configuration file
> > > 
> > > I'm running the latest rc...
> > > 
> > > [root@forest ~]# rpm -qa | grep lxc
> > > lxc-0.8.0.rc2-1.fc16.x86_64
> > > lxc-libs-0.8.0.rc2-1.fc16.x86_64
> > > lxc-doc-0.8.0.rc2-1.fc16.x86_64
> > > 
> > > Is it something in git that hasn't made it to a release yet?
> 
> > nm...  I see it.  It's in git and hasn't made it to a release.  I'm
> > working on a git build to test now.  If this is something that solves
> > some of this, we need to move things along here and get these things
> > moved out.  According to git, 0.8.0rc2 was 7 months ago?  What's the
> > show stoppers here?
> 
> While the git repo says 7 months ago, the date stamp on the
> lxc-0.8.0-rc2 tarball is from July 10, so about 3-1/2 months ago.
> Sounds like we've accumulated some features (like the hooks) we are
> going to need like months ago to deal with this systemd debacle.  How
> close are we to either 0.8.0rc3 or 0.8.0?  Any blockers or are we just
> waiting on some more features?

Daniel has simply been too busy.  Stéphane has made a new branch which
cherrypicks 50 bugfixes for 0.8.0, with the remaining patches (about
twice as many) left for 0.9.0.  I'm hoping we get 0.8.0 next week :)
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [Lxc-users] Unable to run systemd in an LXC / cgroup container.

2012-10-26 Thread Serge Hallyn
Quoting Lennart Poettering (lenn...@poettering.net):
> On Thu, 25.10.12 14:02, Serge Hallyn (serge.hal...@canonical.com) wrote:
> 
> > > > Ok...  I've done some cursory search and turned up nothing but some
> > > > comments about "pre mount hooks".  Where is the documentation about this
> > > > feature and how I might use / implement it?  Some examples would
> > > > probably suffice.  Is there a require release version of lxc-utils?
> > > 
> > > I think I found what I needed in the changelog here:
> > > 
> > > http://www.mail-archive.com/lxc-devel@lists.sourceforge.net/msg01490.html
> > > 
> > > I'll play with it and report back.
> > 
> > Also the "Lifecycle management hooks" section in
> > https://help.ubuntu.com/12.10/serverguide/lxc.html
> > 
> > Note that I'm thinking that having lxc-start guess how to fill in /dev
> > is wrong, because different distros and even different releases of the
> > same distros have different expectations.  For instance ubuntu lucid
> > wants /dev/shm to be a directory, while precise+ wants a symlink.  So
> > somehow the template should get involved, be it by adding a hook, or
> > simply specifying a configuration file which lxc uses internally to
> > decide how to create /dev.
> 
> /dev/shm can be created/mounted/symlinked by the OS in the
> container. This is nothing LXC should care about.
> 
> My recommendation for LXC would be to unconditionally pre-mount /dev as
> tmpfs, and add exactly the device nodes /dev/null, /dev/zero, /dev/full,
> /dev/urandom, /dev/random, /dev/tty, /dev/ptmx to it. That is the
> minimal set you need to boot a machine. All further
> submounts/symlinks/dirs can be created by the OS boot logic in the
> container.

I'm thinking we'll do that, optionally.  Templates (including fedora
and ubuntu) can simply always set the option to mount and fill /dev.
Others (like busybox and mini-sshd) won't.

> That's what libvirt-lxc and nspawn do, and is what we defined in:
> 
> http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface
> 
> It would be good if LXC would do the same in order to minimize the
> manual user configuration necessary.
> 
> Lennart

Agreed it simplifies things for full system containers with modern distros.

thanks,
-serge
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [Lxc-users] Unable to run systemd in an LXC / cgroup container.

2012-10-26 Thread Serge Hallyn
Quoting Michael H. Warfield (m...@wittsend.com):
> On Thu, 2012-10-25 at 20:30 -0500, Serge Hallyn wrote:
> > Quoting Michael H. Warfield (m...@wittsend.com):
> > > On Thu, 2012-10-25 at 23:38 +0200, Lennart Poettering wrote:
> > > > On Thu, 25.10.12 11:59, Michael H. Warfield (m...@wittsend.com) wrote:
> > > 
> > > > > I've got some more problems relating to shutting down containers, some
> > > > > of which may be related to mounting tmpfs on /run to which /var/run is
> > > > > symlinked to.  We're doing halt / restart detection by monitoring utmp
> > > > > in that directory but it looks like utmp isn't even in that directory
> > > > > anymore and mounting tmpfs on it was always problematical.  We may 
> > > > > have
> > > > > to have a more generic method to detect when a container has shut down
> > > > > or is restarting in that case.
> > > 
> > > > I can't parse this. The system call reboot() is virtualized for
> > > > containers just fine and the container managaer (i.e. LXC) can check for
> > > > that easily.
> > > 
> > > The problem we have had was with differentiating between reboot and halt
> > > to either shut the container down cold or restarted it.  You say
> > > "easily" and yet we never came up with an "easy" solution and monitored
> > > utmp instead for the next runlevel change.  What is your "easy" solution
> > > for that problem?
> 
> > I think you're on older kernels, where we had to resort to that.  Pretty
> > recently Daniel Lezcano's patch was finally accepted upstream, which lets
> > a container call reboot() and lets the parent of init tell whether it
> > called reboot or shutdown by looking at wTERMSIG(status).
> 
> Now THAT is wonderful news!  I hadn't realized that had been accepted.
> So we no longer need to rely on the old utmp kludge?

Yup :)  It was very liberating, in terms of what containers can do with
mounting.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [Lxc-users] Unable to run systemd in an LXC / cgroup container.

2012-10-25 Thread Serge Hallyn
Quoting Michael H. Warfield (m...@wittsend.com):
> On Thu, 2012-10-25 at 23:38 +0200, Lennart Poettering wrote:
> > On Thu, 25.10.12 11:59, Michael H. Warfield (m...@wittsend.com) wrote:
> 
> > > I've got some more problems relating to shutting down containers, some
> > > of which may be related to mounting tmpfs on /run to which /var/run is
> > > symlinked to.  We're doing halt / restart detection by monitoring utmp
> > > in that directory but it looks like utmp isn't even in that directory
> > > anymore and mounting tmpfs on it was always problematical.  We may have
> > > to have a more generic method to detect when a container has shut down
> > > or is restarting in that case.
> 
> > I can't parse this. The system call reboot() is virtualized for
> > containers just fine and the container managaer (i.e. LXC) can check for
> > that easily.
> 
> The problem we have had was with differentiating between reboot and halt
> to either shut the container down cold or restarted it.  You say
> "easily" and yet we never came up with an "easy" solution and monitored
> utmp instead for the next runlevel change.  What is your "easy" solution
> for that problem?

I think you're on older kernels, where we had to resort to that.  Pretty
recently Daniel Lezcano's patch was finally accepted upstream, which lets
a container call reboot() and lets the parent of init tell whether it
called reboot or shutdown by looking at wTERMSIG(status).

-serge
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [Lxc-users] Unable to run systemd in an LXC / cgroup container.

2012-10-25 Thread Serge Hallyn
Quoting Michael H. Warfield (m...@wittsend.com):
> On Thu, 2012-10-25 at 13:23 -0400, Michael H. Warfield wrote:
> > Hey Serge,
> > 
> > On Thu, 2012-10-25 at 11:19 -0500, Serge Hallyn wrote:
> 
> ...
> 
> > > Oh, sorry - I take back that suggestion :)
> > 
> > > Note that we have mount hooks, so templates could install a mount hook to
> > > mount a tmpfs onto /dev and populate it.
> > 
> > Ok...  I've done some cursory search and turned up nothing but some
> > comments about "pre mount hooks".  Where is the documentation about this
> > feature and how I might use / implement it?  Some examples would
> > probably suffice.  Is there a require release version of lxc-utils?
> 
> I think I found what I needed in the changelog here:
> 
> http://www.mail-archive.com/lxc-devel@lists.sourceforge.net/msg01490.html
> 
> I'll play with it and report back.

Also the "Lifecycle management hooks" section in
https://help.ubuntu.com/12.10/serverguide/lxc.html

Note that I'm thinking that having lxc-start guess how to fill in /dev
is wrong, because different distros and even different releases of the
same distros have different expectations.  For instance ubuntu lucid
wants /dev/shm to be a directory, while precise+ wants a symlink.  So
somehow the template should get involved, be it by adding a hook, or
simply specifying a configuration file which lxc uses internally to
decide how to create /dev.

Personally I'd prefer if /dev were always populated by the templates,
and containers (i.e. userspace) didn't mount a fresh tmpfs for /dev.
But that does complicate userspace, and we've seen it in debian/ubuntu
as well (i.e. at certain package upgrades which rely on /dev being
cleared after a reboot).

-serge
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [Lxc-users] Unable to run systemd in an LXC / cgroup container.

2012-10-25 Thread Serge Hallyn
Quoting Michael H. Warfield (m...@wittsend.com):
> Sorry for taking a few days to get back on this.  I was delivering a
> guest lecture up at Fordham University last Tuesday so I was out of
> pocket a couple of days or I would have responded sooner...
> 
> On Mon, 2012-10-22 at 16:59 -0400, Michael H. Warfield wrote:
> > On Mon, 2012-10-22 at 22:50 +0200, Lennart Poettering wrote:
> > > On Mon, 22.10.12 11:48, Michael H. Warfield (m...@wittsend.com) wrote:
> > > 
> > > > > > To summarize the problem...  The LXC startup binary sets up various
> > > > > > things for /dev and /dev/pts for the container to run properly and 
> > > > > > this
> > > > > > works perfectly fine for SystemV start-up scripts and/or Upstart.
> > > > > > Unfortunately, systemd has mounts of devtmpfs on /dev and devpts
> > > > > > on /dev/pts which then break things horribly.  This is because the
> > > > > > kernel currently lacks namespaces for devices and won't for some 
> > > > > > time to
> > > > > > come (in design).  When devtmpfs gets mounted over top of /dev in 
> > > > > > the
> > > > > > container, it then hijacks the hosts console tty and several other
> > > > > > devices which had been set up through bind mounts by LXC and should 
> > > > > > have
> > > > > > been LEFT ALONE.
> > > > 
> > > > > Please initialize a minimal tmpfs on /dev. systemd will then work 
> > > > > fine.
> > > > 
> > > > My containers have a reasonable /dev that work with Upstart just fine
> > > > but they are not on tmpfs.  Is mounting tmpfs on /dev and recreating
> > > > that minimal /dev required?
> 
> > > Well, it can be any kind of mount really. Just needs to be a mount. And
> > > the idea is to use tmpfs for this.
> 
> > > What /dev are you currently using? It's probably not a good idea to
> > > reuse the hosts' /dev, since it contains so many device nodes that
> > > should not be accessible/visible to the container.
> 
> > Got it.  And that explains the problems we're seeing but also what I'm
> > seeing in some libvirt-lxc related pages, which is a separate and
> > distinct project in spite of the similarities in the name...
> 
> > http://wiki.1tux.org/wiki/Lxc/Installation#Additional_notes
> 
> > Unfortunately, in our case, merely getting a mount in there is a
> > complication in that it also has to be populated but, at least, we
> > understand the problem set now.
> 
> Ok...  Serge and I were corresponding on the lxc-users list and he had a
> suggestion that worked but I consider to be a bit of a sub-optimal
> workaround.  Ironically, it was to mount devtmpfs on /dev.  We don't

Oh, sorry - I take back that suggestion :)

Note that we have mount hooks, so templates could install a mount hook to
mount a tmpfs onto /dev and populate it.

Or, if everyone is going to need it, we could just add a 'lxc.populatedevs = 1'
option which does that without needing a hook.

devtmpfs should not be used in containers :)

-serge
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] SELINUX: add /sys/fs/selinux mount point to put selinuxfs

2011-05-11 Thread Serge Hallyn
Quoting Eric Paris (epa...@parisplace.org):
> On Wed, May 11, 2011 at 11:13 AM, Stephen Smalley  wrote:
> > On Wed, 2011-05-11 at 10:58 -0400, Eric Paris wrote:
> >> On Wed, May 11, 2011 at 10:54 AM, John Johansen
> 
> >> > AppArmor, Tomoyo and IMA all create their own subdirectoy under 
> >> > securityfs
> >> > so this should not be a problem
> >>
> >> I guess the question is, should SELinux try to move to /sys/fs/selinux
> >> or /sys/security/selinux.  The only minor issue I see with the later
> >> is that it requires both sysfs and securityfs to be mounted before you
> >> can mount selinuxfs, whereas the first only requires sysfs.  Stephen,
> >> Casey, either of you have thoughts on the matter?
> >
> > Unless we plan to re-implement selinuxfs as securityfs nodes, I don't
> > see why we'd move to /sys/security/selinux; we don't presently depend on
> > securityfs and it isn't commonly mounted today.  selinuxfs has some
> > specialized functionality that may not be trivial to reimplement via
> > securityfs, and there was concern about userspace compatibility breakage
> > when last we considered using securityfs.
> 
> The reason we would move to /sys/security/ instead of /sys/fs/ is
> because other LSMs are already there and it would look consistent.

Actually I think it'd be deceptive precisely because (aiui) /sys/security
is for securityfs, while /sys/fs is for virtual filesystems.

I suppose we could whip this issue by having /sys/security be under
/sys/fs/security :)  Too late for that too.

-serge
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel