Re: [libvirt] [systemd-devel] How to make udev not touch my device?

2016-11-11 Thread Lennart Poettering
On Fri, 11.11.16 14:15, Michal Sekletar (msekl...@redhat.com) wrote:

> On Mon, Nov 7, 2016 at 1:20 PM, Daniel P. Berrange <berra...@redhat.com> 
> wrote:
> 
> > So if libvirt creates a private mount namespace for each QEMU and mounts
> > a custom /dev there, this is invisible to udev, and thus udev won't/can't
> > mess with permissions we set in our private /dev.
> >
> > For hotplug, the libvirt QEMU would do the same as the libvirt LXC driver
> > currently does. It would fork and setns() into the QEMU mount namespace
> > and run mknod()+chmod() there, before doing the rest of its normal hotplug
> > logic. See lxcDomainAttachDeviceMknodHelper() for what LXC does.
> 
> We try to migrate people away from using mknod and messing with /dev/
> from user-space. For example, we had to deal with non-trivial problems
> wrt. mknod and Veritas storage stack in the past (most of these issues
> remain unsolved to date). I don't like to hear that you plan to get
> into /dev management business in libvirt too. I am judging based on
> past experiences, nevertheless, I don't like this plan.

Well, I'd say: if people create their own /dev, they are welcome to do
in it whatever they want. They should just stay away from the host's
/dev however, and not interfere with udev's own managing of that.

Lennart

-- 
Lennart Poettering, Red Hat

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [systemd-devel] How to make udev not touch my device?

2016-11-11 Thread Lennart Poettering
On Mon, 07.11.16 09:17, Daniel P. Berrange (berra...@redhat.com) wrote:

> On Fri, Nov 04, 2016 at 08:47:34AM +0100, Michal Privoznik wrote:
> > Hey udev developers,
> > 
> > I'm a libvirt developer and I've been facing an interesting issue
> > recently. Libvirt is a library for managing virtual machines and as such
> > allows basically any device to be exposed to a virtual machine. For
> > instance, a virtual machine can use /dev/sdX as its own disk. Because of
> > security reasons we allow users to configure their VMs to run under
> > different UID/GID and also SELinux context. That means that whenever a
> > VM is being started up, libvirtd (our daemon we have) relabels all the
> > necessary paths that QEMU process (representing VM) can touch.
> > However, I'm facing an issue that I don't know how to fix. In some cases
> > QEMU can close & reopen a block device. However, closing a block device
> > triggers an event and hence if there is a rule that sets a security
> > label on a device the QEMU process is unable to reopen the device again.
> > 
> > My question is, whet we can do to prevent udev from mangling with our
> > security labels that we've set on the devices?
> > 
> > One of the ideas our lead developer had was for libvirt to set some kind
> > of udev label on devices managed by libvirt (when setting up security
> > labels) and then whenever udev sees such labelled device it won't touch
> > it at all (this could be achieved by a rule perhaps?). Later, when
> > domain is shutting down libvirt removes that label. But I don't think
> > setting an arbitrary label on devices is supported, is it?
> 
> Having thought about this over the weekend, I'm strongly inclined to
> just take udev out of the equation by starting a new mount namespace
> for each QEMU we launch and setting up a custom /dev containing just
> the devices we need. This will be both a security improvement and
> avoid the udev races, with no complex code required in libvirt and
> will work for libvirt all the way back to RHEL6

I think this would be a pretty nice solution, indeed!

Lennart

-- 
Lennart Poettering, Red Hat

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [REPOST] regarding cgroup v2 support in libvirt

2016-10-21 Thread Lennart Poettering
On Fri, 21.10.16 11:19, Daniel P. Berrange (berra...@redhat.com) wrote:

> On Thu, Oct 20, 2016 at 02:59:45PM -0400, Tejun Heo wrote:
> > (reposting w/ libvir-list cc'd, sorry about the delay in reposting,
> >  was traveling and then on vacation)
> > 
> > Hello, Daniel.  How have you been?
> > 
> > We (facebook) are deploying cgroup v2 and internally use libvirt to
> > manage virtual machines, so I'm trying to add cgroup v2 support to
> > libvirt.
> > 
> > Because cgroup v2's resource configurations differ from v1 in varying
> > degrees depending on the specific resource type, it unfortunately
> > introduces new configurations (some completely new configs, others
> > just a different range / format).  This means that adding cgroup v2
> > support to libvirt requires adding new config options to it and maybe
> > implementing some form of translation mechanism between overlapping
> > configs.
> > 
> > The upcoming systemd release includes all that's necessary to support
> > v1/v2 compatibility so that users setting resource configs through
> > systemd don't have to worry about whether v1 or v2 is in use.  I'm
> > wondering whether it would make sense to make libvirt use dbus calls
> > to systemd to set resource configs when systemd is in use, so that it
> > can piggyback on systemd's v1/v2 compatibility.
> 
> The big question I have around cgroup v2 is state of support for all
> controllers that libvirt uses (cpu, cpuacct, cpuset, memory, devices,
> freezer, blkio).  IIUC, not all of these have been ported to cgroup
> v2 setup and the cpu port in particular was rejected by Linux maintainers.
> Libvirt has a general policy that we won't support features that only
> exist in out of tree patches (applies to kernel and any other software
> we build against or use).
> 
> IIRC from earlier discussions, the model for dealing with processes in
> cgroup v2 was quite different. In libvirt we rely on the ability to
> assign different threads within a process to different cgroups, because
> we need to control CPU schedular parameters on different threads in
> QEMU. eg we have vCPU threads, I/O threads and general emulator threads
> each of which get different policies.
> 
> When I spoke with Lennart about cgroup v2, way back in Jan, he indicated
> that while systemd can technically work with a system where some
> controllers are mounted as v1, while others are mounted as v2, this
> would not be an officially supported solution. Thus systemd in  Fedora
> was not likely to switch to v2 until all required controllers could use
> v2. I'm not sure if this still corresponds to Lennarts current views, so
> CC'ing him to confirm/deny.

So, the "hybrid" mode is probably nothing RHEL or so would want to
support. However, I think it might be a good step for Fedora at
least. But yes, supporting this mode means additional porting effort
for the various daemons that access cgroupfs...

> I recall that systemd policy for v2 was inteded to be that no app
> should write to cgroup sysfs except for systemd, unless there was
> a sub-tree created with Delegate=yes set on the scope. So this clearly
> means when using v2 we'll have to use the systemd DBus APIs for managing
> cgroups v2 on such hosts.

Yes, this is our policy: the cgroup tree is private property of
systemd (at least regarding write access), except when your have a
service or scope unit where Delegate=yes is set, in which case you can
manage your own subtree of that freely.

Lennart

-- 
Lennart Poettering, Red Hat

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [systemd-devel] systemd-cgroups-agent not working in containers

2014-11-30 Thread Lennart Poettering
 session logic correctly and didn't invoke the PAM session close
hooks, didn't keep the parent process around to do so, or
suchlike. What kind of PAM session do you into this problem with?

 How do you run the most current systemd on your distro?

Well, I as a developer just build it from the git tree, after
installing all deps, with 

./autogen.sh c  make -j6  sudo make install

Lennart

-- 
Lennart Poettering, Red Hat

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [systemd-devel] systemd-cgroups-agent not working in containers

2014-11-30 Thread Lennart Poettering
On Fri, 28.11.14 15:52, Richard Weinberger (rich...@nod.at) wrote:

 Am 28.11.2014 um 06:33 schrieb Martin Pitt:
  Hello all,
  
  Cameron Norman [2014-11-27 12:26 -0800]:
  On Wed, Nov 26, 2014 at 1:29 PM, Richard Weinberger rich...@nod.at wrote:
  Hi!
 
  I run a Linux container setup with openSUSE 13.1/2 as guest distro.
  After some time containers slow down.
  An investigation showed that the containers slow down because a lot of 
  stale
  user sessions slow down almost all systemd tools, mostly systemctl.
  loginctl reports many thousand sessions.
  All in state closing.
 
  This sounds similar to an issue that systemd-shim in Debian had.
  Martin Pitt (helps to maintain systemd in Debian) fixed that issue; he
  may have some ideas here. I CC'd him.
  
  The problem with systemd-shim under sysvinit or upstart was that shim
  didn't set a cgroup release agent like systemd itself does. Thus the
  cgroups were never cleaned up after all the session processes died.
  (See 1.4 on https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt
  for details)
  
  I don't think that SUSE uses systemd-shim, I take it in that setup you
  are running systemd proper on both the host and the guest? Then I
  suggest checking the cgroups that correspond to the closing sessions
  in the container, i. e. /sys/fs/cgroup/systemd/.../session-XX.scope/tasks.
  If there are still processes in it, logind is merely waiting for them
  to exit (or set KillUserProcesses in logind.conf). If they are empty,
  check that /sys/fs/cgroup/systemd/.../session-XX.scope/notify_on_release is 
  1
  and that /sys/fs/cgroup/systemd/release_agent is set?
 
 The problem is that within the container the release agent is not executed.
 It is executed on the host side.
 
 Lennart, how is this supposed to work?
 Is the theory of operation that the host systemd sends 
 org.freedesktop.systemd1.Agent Released
 via dbus into the guest?
 The guests systemd definitely does not receive such a signal.

No, the cgrouips agents are not reliable, because of subgroups, and
because of their incompatibility with containers. systemd uses the
events if it gets them, but we try hard to be able to live without
them (see other mail).

Lennart

-- 
Lennart Poettering, Red Hat

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [systemd-devel] Suspending access to opened/active /dev/nodes during application runtime

2014-03-07 Thread Lennart Poettering
On Fri, 07.03.14 19:45, Lukasz Pawelczyk (hav...@gmail.com) wrote:

 Problem:
 Has anyone thought about a mechanism to limit/remove an access to a
 device during an application runtime? Meaning we have an application
 that has an open file descriptor to some /dev/node and depending on
 *something* it gains or looses the access to it gracefully (with or
 without a notification, but without any fatal consequences).

logind can mute input devices as sessions are switched, to enable
unpriviliged X11 and wayland compositors.

 Example:
 LXC. Imagine we have 2 separate containers. Both running full operating
 systems. Specifically with 2 X servers. Both running concurrently of

Well, devices are not namespaced on Linux (with the single exception of
network devices). An X server needs device access, hence this doesn't
fly at all.

When you enumerate devices with libudev in a container they will never
be marked as initialized and you do not get any udev hotplug events in
containers, and you don#t have the host's udev db around, nor would it
make any sense to you if you had. X11 and friends rely on udev
however...

Before you think about doing something like this, you need to fix the
kernel to provide namespaced devices (good luck!)

 course. Both need the same input devices (e.g. we have just one mouse).
 This creates a security problem when we want to have completely separate
 environments. One container is active (being displayed on a monitor and
 controlled with a mouse) while the other container runs evtest
 /dev/input/something and grabs the secret password user typed in the
 other.

logind can do this for you between sessions. But such a container setup
will never work without proper device namespacing.

 Solutions:
 The complete solution would comprise of 2 parts:
 - a mechanism that would allow to temporally hide a device from an
 open file descriptor.
 - a mechanism for deciding whether application/process/namespace should
 have an access to a specific device at a specific moment

Well, there's no point in inventing any mechanisms like this, as long
as devices are not namespaced in the kernel, so that userspace in
containers can enumerate/probe/identify/... things correctly...

Lennart

-- 
Lennart Poettering, Red Hat

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [systemd-devel] Suspending access to opened/active /dev/nodes during application runtime

2014-03-07 Thread Lennart Poettering
On Fri, 07.03.14 21:51, Lukasz Pawelczyk (hav...@gmail.com) wrote:

  Problem:
  Has anyone thought about a mechanism to limit/remove an access to a
  device during an application runtime? Meaning we have an
  application that has an open file descriptor to some /dev/node and
  depending on *something* it gains or looses the access to it
  gracefully (with or without a notification, but without any fatal
  consequences).
  
  logind can mute input devices as sessions are switched, to enable
  unpriviliged X11 and wayland compositors.
 
 Would you please elaborate on this? Where is this mechanism? How does
 it work without kernel space support? Is there some kernel space
 support I’m not aware of?

There's EVIOCREVOKE for input devices and
DRM_IOCTL_SET_MASTER/DRM_IOCTL_DROP_MASTER for DRM devices. See logind
sources.

  Before you think about doing something like this, you need to fix the
  kernel to provide namespaced devices (good luck!)
 
 Precisly! That’s the generic idea. I’m not for implementing it though
 at this moment. I just wanted to know whether anybody actually though
 about it or maybe someone is interested in starting such a work, etc.

It's not just about turning on and turning off access to the event
stream. It's mostly about enumeration and probing which doesn't work in
containers, and is particularly messy if you intend to share devices
between containers.

  logind can do this for you between sessions. But such a container setup
  will never work without proper device namespacing.
 
 So how can it do it when there is no kernel support? You mean it could
 be doing this if the support were there?

EVIOCREVOKE and the DRM ioctls are pretty real...

Lennart

-- 
Lennart Poettering, Red Hat

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH] Change default resource partition to /machine

2013-04-18 Thread Lennart Poettering
On Thu, 18.04.13 08:33, Eric Blake (ebl...@redhat.com) wrote:

 On 04/18/2013 04:11 AM, Daniel P. Berrange wrote:
  From: Daniel P. Berrange berra...@redhat.com
  
  After discussions with systemd developers it was decided that
  a better default policy for resource partitions is to have
  3 default partitions at the top level
  
 /system   - system services
 /machine - virtual machines / containers
 /user- user login session
  
  This ensures that the default policy isolates guest from
  user login sessions  system services, so a mis-behaving
  guest can't consume 100% of CPU usage if other things are
  contending for it.
  
  Thus we change the default partition from /system to
  /machine
  
  Signed-off-by: Daniel P. Berrange berra...@redhat.com
  ---
   src/lxc/lxc_cgroup.c   | 2 +-
   src/qemu/qemu_cgroup.c | 2 +-
   2 files changed, 2 insertions(+), 2 deletions(-)
 
 ACK.  But is it worth making this configurable in qemu.conf/lxc.conf, in
 case policy changes yet again?

Just to provide some context to this: we are confident enough to
hardcode these three paths in systemd.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list