Re: [systemd-devel] udev virtio by-path naming

2017-03-01 Thread Daniel P. Berrange
On Wed, Mar 01, 2017 at 07:28:46PM +0100, Viktor Mihajlovski wrote:
> On 01.03.2017 16:58, Daniel P. Berrange wrote:
> > given a basic Fedora 25 guest, with a virtio-mmio disk added as per the
> > guide above...
> > 
> >   looking at device 
> > '/devices/platform/a003e00.virtio_mmio/virtio3/block/vda':
> > KERNEL=="vda"
> > SUBSYSTEM=="block"
> > DRIVER==""
> > ATTR{alignment_offset}=="0"
> > ATTR{badblocks}==""
> > ATTR{cache_type}=="write back"
> > ATTR{capability}=="50"
> > ATTR{discard_alignment}=="0"
> > ATTR{ext_range}=="256"
> > ATTR{inflight}=="   00"
> > ATTR{range}=="16"
> > ATTR{removable}=="0"
> > ATTR{ro}=="0"
> > ATTR{serial}==""
> > ATTR{size}=="2097152"
> > ATTR{stat}=="  940 4208  28500  
> >   0 
> >00  100  280"
> > 
> >   looking at parent device '/devices/platform/a003e00.virtio_mmio/virtio3':
> > KERNELS=="virtio3"
> > SUBSYSTEMS=="virtio"
> > DRIVERS=="virtio_blk"
> > ATTRS{device}=="0x0002"
> > 
> > ATTRS{features}=="00101011011111
> > 00"
> > ATTRS{status}=="0x0007"
> > ATTRS{vendor}=="0x554d4551"
> > 
> >   looking at parent device '/devices/platform/a003e00.virtio_mmio':
> > KERNELS=="a003e00.virtio_mmio"
> > SUBSYSTEMS=="platform"
> > DRIVERS=="virtio-mmio"
> > ATTRS{driver_override}=="(null)"
> Since I can't do that on my box, would you be so kind to run
>  ls -l /dev/disk/by-path
> If it returns ids like
>   virtio-pci-a003e00.virtio_mmio[-partn]
> my suggested patch should be OK for ARM in that it will produce ids in
> the format
>   platform-a003e00.virtio_mmio[-partn]

Ok, my guest has 4 disks

 - sda - virtio-scsi, over virtio-pci transport
 - sdb - virtio-scsi, over virtio-mmio transport
 - vda - virtio-scsi, over virtio-pci transport
 - vdb - virtio-scsi, over virtio-mmio transport

with systemd 231 I get these links

  platform-3f00.pcie-pci-:00:01.1-virtio-pci-:02:00.0-scsi-0:0:0:0 
-> ../../sda
  platform-3f00.pcie-pci-:00:01.3-virtio-pci-:04:00.0 -> ../../vda
  virtio-pci-a003c00.virtio_mmio -> ../../vdb
  virtio-pci-a003e00.virtio_mmio-scsi-0:0:0:0 -> ../../sdb

after applying your patch I get these links:

 platform-3f00.pcie-pci-:00:01.1-virtio-pci-:02:00.0-scsi-0:0:0:0 
-> ../../sda
 platform-3f00.pcie-pci-:00:01.3-virtio-pci-:04:00.0 -> ../../vda
 platform-3f00.pcie-pci-:02:00.0-scsi-0:0:0:0 -> ../../sda
 platform-3f00.pcie-pci-:04:00.0 -> ../../vda
 platform-a003c00.virtio_mmio -> ../../vdb
 platform-a003e00.virtio_mmio-scsi-0:0:0:0 -> ../../sdb
 virtio-pci-a003c00.virtio_mmio -> ../../vdb
 virtio-pci-a003e00.virtio_mmio-scsi-0:0:0:0 -> ../../sdb

So that appears to be working as designed - the 4 backcompat symlinks are
still there, and the new symlinks all live under the platform- prefix
and don't have a bogus 'pci' in the name for mmio links

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] udev virtio by-path naming

2017-03-01 Thread Daniel P. Berrange
On Wed, Mar 01, 2017 at 03:58:12PM +, Daniel P. Berrange wrote:
> On Wed, Mar 01, 2017 at 04:02:53PM +0100, Viktor Mihajlovski wrote:
> > If wanted, I can take a stab at virtio-mmio, but would need the output
> > of udevadm -a /dev/vda from a virtio-mmio system.
> 
> Presumably you mean 'udevadm info -a /dev/vda' ?  That reports the following,
> given a basic Fedora 25 guest, with a virtio-mmio disk added as per the
> guide above...
> 
>   looking at device '/devices/platform/a003e00.virtio_mmio/virtio3/block/vda':

BTW, the hex digits in here are the virtio mmio address which changes per
device eg if i have 3 virtio-mmio backed disks, I get

  looking at device '/devices/platform/a003a00.virtio_mmio/virtio3/block/vda':
  looking at device '/devices/platform/a003c00.virtio_mmio/virtio4/block/vdb':
  looking at device '/devices/platform/a003e00.virtio_mmio/virtio5/block/vdc':


Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] udev virtio by-path naming

2017-03-01 Thread Daniel P. Berrange
On Wed, Mar 01, 2017 at 04:02:53PM +0100, Viktor Mihajlovski wrote:
> On 01.03.2017 04:30, Zbigniew Jędrzejewski-Szmek wrote:
> > On Tue, Feb 28, 2017 at 09:47:42AM +0100, Viktor Mihajlovski wrote:
> >> One could argue about back-level compatibility, but virtio by-path
> >> naming has changed multiple times. We have seen virtio-pci-virtio
> >> (not predictable), pci- and virtio-pci- already. It
> >> might be a good time now to settle on a common approach for all
> >> virtio types.
> >>
> >> For the reasons above, I'd vote for -, which
> >> would work for PCI and CCW, not sure about ARM MMIO though.
> > 
> > It seems that there's agreement that - is the right
> > approach.
> > 
> > Ideally we would keep the virtio-pci- links as they appear
> > right now, for backwards compatibility, just for the pci devices, and
> > mark them as deprecated (dunno where, maybe just in NEWS), and add the
> > code to make the links.
> > 
> > I haven't looked at the code, maybe we just do this with the right
> > udev rule, and also stick the deprecation comment there?
> > 
> > Zbyszek
> > 
> I've posted a github pull request [1], and would appreciate review
> feedback. As I am lacking an ARM setup, it would also be nice if someone
> with ARM skills could have a look as well.

FYI you can install ARM7 guests on an x86_64 host, using pre-built Fedora
images

  https://fedoraproject.org/wiki/QA:Testcase_Virt_ARM_on_x86

NB, this will install the guest using virtio-pci. So if you want to
see virtio-mmio in action, you'll need to edit the libvirt XML config
afterwards to add another disk, eg


  
  
  
  



> If wanted, I can take a stab at virtio-mmio, but would need the output
> of udevadm -a /dev/vda from a virtio-mmio system.

Presumably you mean 'udevadm info -a /dev/vda' ?  That reports the following,
given a basic Fedora 25 guest, with a virtio-mmio disk added as per the
guide above...

  looking at device '/devices/platform/a003e00.virtio_mmio/virtio3/block/vda':
KERNEL=="vda"
SUBSYSTEM=="block"
DRIVER==""
ATTR{alignment_offset}=="0"
ATTR{badblocks}==""
ATTR{cache_type}=="write back"
ATTR{capability}=="50"
ATTR{discard_alignment}=="0"
ATTR{ext_range}=="256"
ATTR{inflight}=="   00"
ATTR{range}=="16"
ATTR{removable}=="0"
ATTR{ro}=="0"
ATTR{serial}==""
ATTR{size}=="2097152"
ATTR{stat}=="  940 4208  285000 
   00  100  280"

  looking at parent device '/devices/platform/a003e00.virtio_mmio/virtio3':
KERNELS=="virtio3"
SUBSYSTEMS=="virtio"
DRIVERS=="virtio_blk"
ATTRS{device}=="0x0002"
ATTRS{features}=="00101011011111
00"
ATTRS{status}=="0x0007"
ATTRS{vendor}=="0x554d4551"

  looking at parent device '/devices/platform/a003e00.virtio_mmio':
KERNELS=="a003e00.virtio_mmio"
SUBSYSTEMS=="platform"
DRIVERS=="virtio-mmio"
ATTRS{driver_override}=="(null)"

  looking at parent device '/devices/platform':
KERNELS=="platform"
SUBSYSTEMS==""
DRIVERS==""



Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] udev virtio by-path naming

2017-02-28 Thread Daniel P. Berrange
On Mon, Feb 20, 2017 at 04:14:32PM +0100, Lennart Poettering wrote:
> On Mon, 20.02.17 15:34, Viktor Mihajlovski (mihaj...@linux.vnet.ibm.com) 
> wrote:
> 
> > But then, I find this naming scheme somewhat weird.
> > A virtio disk shows up as a regular PCI function on the PCI
> > bus side by side with other (non-virtio) devices. The naming otoh
> > suggests that virtio-pci is a subsystem of its own, which is simply
> > incorrect from a by-path perspective.
> > 
> > Using just the plain PCI path id is actually sufficient to identify
> > a virtio disk by its path. This would be in line with virtio
> > network interface path names which use the plain PCI naming.
> > 
> > One could argue about back-level compatibility, but virtio by-path
> > naming has changed multiple times. We have seen virtio-pci-virtio
> > (not predictable), pci- and virtio-pci- already. It
> > might be a good time now to settle on a common approach for all
> > virtio types.
> > 
> > For the reasons above, I'd vote for -, which
> > would work for PCI and CCW, not sure about ARM MMIO though.
> > Opinions?

Virtio MMIO devices are identified by a unique control register
base address. eg 0x3000. So I think - would
work fine to all cases PCI, CCW & MMIO.  Certainly it is moire
correct than hardcoding virtio-pci as a prefix - that's just
plain broken for non-PCI transports.

> So, to make this clear, we in systemd are kinda interested in
> splitting out these virtio helpers into some external project
> maintained by virtio peopl. We as systemd/udev maintainers have very
> little understanding of the underlying technology, so we can't really
> be any good maintainers of this, and we can't really comment on this
> stuff, in particular when it gets more exotic, like the CCW stuff.
> 
> Even better would be if the kernel would do the naming on its own, and
> maybe just provide us with a sysattr on the relevant devices that we
> can read to determine the path from, so that we don#t have to maintain
> this at all in userspace. That way, the driver folks on the kernel
> side can use any naming they like without ever having to patch this
> into systemd or udev.
> 
> This is similar to SCSI stuff and all things like that: the more
> exotic it gets the less place this really has in systemd, we are not
> the right maintainers for this. And given that this is all nicely
> pluggable (you can ship your own udev extensions externally very
> easily), there's really no reason for this to be in systemd/udev.

The other post about ptp-kvm rules reminded me that I wanted to
respond to this mail too.

The problem with splitting these rules out into a separate project
is that there's no other existing place that they would live. The
"virtio people" as a group merely write specifications. The actual
implementation of those specs is done by multiple other independant
groups - QEMU (for host side, though other host side impls exist
too) and Linux (for guest side). The udev rules are Linux guest
support pieces, but of course Linux itself doesn't distribute udev
rules - it delegated that job to the udev package hence why they
are here currently. So I don't see that pushing the rules out of
the udev repo would be beneficial to people building VMs.

> Anyway, I fear you're going to have a hard time involving us in a
> technical discussions about the issue you are raising, since quite
> frankly we have no clue about virtio...

Could it be as simple as having a couple of people nominated as the
technical point of contact for the virtio rules, who can be CC'd
to get answers any questions that may need answering ? I don't have
time to actively monitor systemd pull requests for changes affecting
virtio, but I'd be ok with being pinged if issues come up that need
assistance & can pull in other virt experts where needed.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] udev rules: add udev rule to create /dev/ptp_kvm

2017-02-28 Thread Daniel P. Berrange
On Mon, Feb 27, 2017 at 11:50:59PM -0300, Marcelo Tosatti wrote:
> On Sun, Feb 26, 2017 at 09:52:18PM +0100, Lennart Poettering wrote:
> > On Thu, 23.02.17 22:20, Marcelo Tosatti (mtosa...@redhat.com) wrote:
> > 
> > > 
> > > Its necessary to specify the KVM PTP device name in userspace.
> > > 
> > > In case a network card with PTP device is assigned to the guest,
> > > it might be the case that KVM PTP gets /dev/ptp0 instead of /dev/ptp1. 
> > > 
> > > Fix a device name for the KVM PTP device.
> > 
> > What's the symlink precisely good for, can you elaborate? 
> 
> You want to configure Chrony to use PTP in the guest to sync with the
> host.
> 
> You need to add a entry to /etc/chrony.conf pointing to "/dev/ptp0", 
> the ptp_kvm device.
> 
> However, it might be the case that a PCI assigned device has a PTP
> clock, and it can be registered as "/dev/ptp0" and ptp_kvm as
> "/dev/ptp1".
> 
> > Also, what's the benefit of shipping this upstream? Why not ship that
> > rule with kvm?
> 
> qemu-kvm package? Sure i can do that, but then all distributions 
> have to do the same with their own packages.

qemu-kvm is installed in the host OS only, but this rule needs to be
set in the guest OS, unless you want to bundle it in with qemu-guest-agent
RPM, but that's not really a directly related package, so we'd liekly have
to create a new package for this and try and get distros to ensure it is
installed in all guest OS. We've had qemu-guest-agent for years now and
we've still not got all distros installing it. So not shipping this kind
of rule with udev means that it'll almost certainly end up being missing
in the majority of guest installs for many years to come.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] systemd: add RDTCacheReservation= option to support CAT (Cache Allocation Technology)

2017-01-09 Thread Daniel P. Berrange
On Fri, Jan 06, 2017 at 01:51:17PM -0200, Marcelo Tosatti wrote:
> On Fri, Jan 06, 2017 at 05:26:36PM +0200, Mantas Mikulėnas wrote:
> > On Fri, Jan 6, 2017 at 3:59 PM, Marcelo Tosatti  wrote:
> > 
> > >
> > >
> > > Cache Allocation Technology is a feature on selected recent Intel Xeon
> > > processors which allows control over L3 cache allocation.
> > >
> > > Kernel support has been merged to the upstream kernel, via a filesystem
> > > resctrlfs.
> > >
> > > On top of that, a userspace utility, resctrltool has been written
> > > to facilitate writing applications and using the filesystem
> > > interface (see the rationale at
> > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1300792.html).
> > >
> > > This patch adds a new option to systemd, RDTCacheReservation,
> > > to allow configuration of CAT via resctrltool.
> > >
> > > See the first hunk of the patch for a description of the option
> > 
> > 
> > This really doesn't look pretty, neither the approach nor the
> > implementation...
> 
> Suggestions to improve the code or the approach are welcome.
> 
> > Is the option actually so complex that calling resctrltool is the only way
> > to adjust it? What about writing to the resctrlfs directly?
> 
> You'll have to deal with the issues that resctrltool deals with,
> namely:
> 
> 1) Filesystem locking.
> 2) Reading in every directory and the default 
> directory.
> 3) Converting the reservation request to proper sizes.
> 4) Converting:
>   type=both --> type=data/type=code
> 
>   type=data/type=code --> type=both
> 
> 4) Finding free space for the reservation.
> 5) Adjusting the default group reservation.
> 
> Since this steps must be performed by every user of
> CAT (including libvirt which plans to execute resctrltool
> as well), it was decided its better to maintain this logic
> in a centralized place.

Errr, no, that's not correct - Libvirt is certainly not going to
spawn some python program to do this. If there's no C library API
for this, libvirt will simply implement all the logic itself.


Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] deny access to GPU devices

2016-11-14 Thread Daniel P. Berrange
On Mon, Nov 14, 2016 at 12:35:17PM +0100, Lennart Poettering wrote:
> On Sat, 12.11.16 07:43, Topi Miettinen (toiwo...@gmail.com) wrote:
> 
> > On 11/11/16 20:09, Lennart Poettering wrote:
> > > I have no idea what "slurm" is, but do note that the "devices" cgroup
> > > controller has no future, it is unlikely to ever become available in
> > > cgroupsv2.
> > 
> > This is unwelcome news, I think it is a simple and well contained MAC
> > that has been available in systems without a full blown MAC like SELinux
> > and with systemd support it has been very easy to set up. What will
> > happen to DevicePolicy, DeviceAllow etc. directives? Or will systemd
> > stick to cgroupsv1 forever?
> 
> No, our plan is to switch to cgroupsv2 as default as quickly as we
> can. Where "quickly as we can" means mostly: the "cpu" controllers is
> ported to cgroupsv2 in vanilla kernels.
> 
> The thing with the "devices" cgroup controller is that it is not about
> resource control, but about access control, and hence should not live
> in "cgroups" at all, but in some other framework.  "cgroups" is all
> about dynamic resource control and accounting, but "devices" doesn't
> fit that at all, hence it should move elsewhere.
> 
> We'll keep DeviceAllow/DevicePolicy around for now, and there's a TODO
> list item to implement at least the "m" part of it via seccomp, as a
> second level of protection that will still work even if cgroupsv2 is
> used. I think in the long run it might make sense to also do the "rw"
> part of it somehow in the kernel, via some new kernel subsystem, but
> we'll have to see if and how this will be implemented.

Since there is support for stackable LSM's now, I could see the cgroup
devices ACL feature being replaced with a new LSM. I imagine if stackable
LSMs had been supported back in cgroup v1 days, it probably would have
been done that way in the first place instead of adding MAC to cgroups.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [libvirt] How to make udev not touch my device?

2016-11-11 Thread Daniel P. Berrange
On Fri, Nov 11, 2016 at 05:01:40PM +0100, Michal Sekletar wrote:
> On Fri, Nov 11, 2016 at 2:20 PM, Daniel P. Berrange <berra...@redhat.com> 
> wrote:
> 
> > What kind of issues ?
> 
> General problem with manually created device nodes is that udev and
> systemd do not know about them. Device units do not exist for these
> device nodes. Hence these device units can not be a dependency of some
> other unit. Typical example is manually created device node referenced
> from /etc/fstab. Then corresponding mount unit is bound to a device
> that never shows up and hence it always fails to mount even tough
> device node is there.

Ok, that sounds irrelevant to libvirt's usage wrt QEMU, so I don't
see any problem for us here.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [libvirt] How to make udev not touch my device?

2016-11-11 Thread Daniel P. Berrange
On Fri, Nov 11, 2016 at 02:15:38PM +0100, Michal Sekletar wrote:
> On Mon, Nov 7, 2016 at 1:20 PM, Daniel P. Berrange <berra...@redhat.com> 
> wrote:
> 
> > So if libvirt creates a private mount namespace for each QEMU and mounts
> > a custom /dev there, this is invisible to udev, and thus udev won't/can't
> > mess with permissions we set in our private /dev.
> >
> > For hotplug, the libvirt QEMU would do the same as the libvirt LXC driver
> > currently does. It would fork and setns() into the QEMU mount namespace
> > and run mknod()+chmod() there, before doing the rest of its normal hotplug
> > logic. See lxcDomainAttachDeviceMknodHelper() for what LXC does.
> 
> We try to migrate people away from using mknod and messing with /dev/
> from user-space. For example, we had to deal with non-trivial problems
> wrt. mknod and Veritas storage stack in the past (most of these issues

What kind of issues ? 

> remain unsolved to date). I don't like to hear that you plan to get
> into /dev management business in libvirt too. I am judging based on
> past experiences, nevertheless, I don't like this plan.

Libvirt is already doing this for its LXC driver, populating a private
/dev with only the devices permitted for the container in question.

> Also, managing separate mount namespace for each qemu process and
> forking helper that joins the namespace to do some work seems quite
> complex too.

Again, libvirt is already doing this for LXC so its not any great
burden.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [libvirt] How to make udev not touch my device?

2016-11-07 Thread Daniel P. Berrange
On Mon, Nov 07, 2016 at 01:11:14PM +0100, Michal Privoznik wrote:
> On 07.11.2016 10:17, Daniel P. Berrange wrote:
> > On Fri, Nov 04, 2016 at 08:47:34AM +0100, Michal Privoznik wrote:
> >> Hey udev developers,
> >>
> >> I'm a libvirt developer and I've been facing an interesting issue
> >> recently. Libvirt is a library for managing virtual machines and as such
> >> allows basically any device to be exposed to a virtual machine. For
> >> instance, a virtual machine can use /dev/sdX as its own disk. Because of
> >> security reasons we allow users to configure their VMs to run under
> >> different UID/GID and also SELinux context. That means that whenever a
> >> VM is being started up, libvirtd (our daemon we have) relabels all the
> >> necessary paths that QEMU process (representing VM) can touch.
> >> However, I'm facing an issue that I don't know how to fix. In some cases
> >> QEMU can close & reopen a block device. However, closing a block device
> >> triggers an event and hence if there is a rule that sets a security
> >> label on a device the QEMU process is unable to reopen the device again.
> >>
> >> My question is, whet we can do to prevent udev from mangling with our
> >> security labels that we've set on the devices?
> >>
> >> One of the ideas our lead developer had was for libvirt to set some kind
> >> of udev label on devices managed by libvirt (when setting up security
> >> labels) and then whenever udev sees such labelled device it won't touch
> >> it at all (this could be achieved by a rule perhaps?). Later, when
> >> domain is shutting down libvirt removes that label. But I don't think
> >> setting an arbitrary label on devices is supported, is it?
> > 
> > Having thought about this over the weekend, I'm strongly inclined to
> > just take udev out of the equation by starting a new mount namespace
> > for each QEMU we launch and setting up a custom /dev containing just
> > the devices we need. This will be both a security improvement and
> > avoid the udev races, with no complex code required in libvirt and
> > will work for libvirt all the way back to RHEL6
> 
> How would this work with device hotplug, i.e. I start a domain with some
> set of devices. Then I bring up an iSCSI target (which appears under
> /dev) and how does one 'transfer' the device into the new namespace?
> BTW: can you elaborate more one udev-namespace relations? Doesn't udev
> run in the namespaces too?

A single process can only ever be in a single namespace at any point in
time and udev only ever runs in the initial namespaces. When running
containers you never have udev inside them, and udev certainly doesn't
interact with arbitrary namespaces created by other applications for
their own purposes.

So if libvirt creates a private mount namespace for each QEMU and mounts
a custom /dev there, this is invisible to udev, and thus udev won't/can't
mess with permissions we set in our private /dev.

For hotplug, the libvirt QEMU would do the same as the libvirt LXC driver
currently does. It would fork and setns() into the QEMU mount namespace
and run mknod()+chmod() there, before doing the rest of its normal hotplug
logic. See lxcDomainAttachDeviceMknodHelper() for what LXC does.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [libvirt] How to make udev not touch my device?

2016-11-07 Thread Daniel P. Berrange
On Fri, Nov 04, 2016 at 08:47:34AM +0100, Michal Privoznik wrote:
> Hey udev developers,
> 
> I'm a libvirt developer and I've been facing an interesting issue
> recently. Libvirt is a library for managing virtual machines and as such
> allows basically any device to be exposed to a virtual machine. For
> instance, a virtual machine can use /dev/sdX as its own disk. Because of
> security reasons we allow users to configure their VMs to run under
> different UID/GID and also SELinux context. That means that whenever a
> VM is being started up, libvirtd (our daemon we have) relabels all the
> necessary paths that QEMU process (representing VM) can touch.
> However, I'm facing an issue that I don't know how to fix. In some cases
> QEMU can close & reopen a block device. However, closing a block device
> triggers an event and hence if there is a rule that sets a security
> label on a device the QEMU process is unable to reopen the device again.
> 
> My question is, whet we can do to prevent udev from mangling with our
> security labels that we've set on the devices?
> 
> One of the ideas our lead developer had was for libvirt to set some kind
> of udev label on devices managed by libvirt (when setting up security
> labels) and then whenever udev sees such labelled device it won't touch
> it at all (this could be achieved by a rule perhaps?). Later, when
> domain is shutting down libvirt removes that label. But I don't think
> setting an arbitrary label on devices is supported, is it?

Having thought about this over the weekend, I'm strongly inclined to
just take udev out of the equation by starting a new mount namespace
for each QEMU we launch and setting up a custom /dev containing just
the devices we need. This will be both a security improvement and
avoid the udev races, with no complex code required in libvirt and
will work for libvirt all the way back to RHEL6

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Fedora 25, cgroups V2 and systemd roadmap

2016-10-18 Thread Daniel P. Berrange
On Tue, Oct 11, 2016 at 11:30:40PM +0300, Kevin Wilson wrote:
> Hello, Daniel,
> 
> > We don't want to support out of tree kernel patches,
> 
> This sounds very reasonable, I don't have anything against this policy.
> 
> Still, I wonder: are you ruling out implementing "hybrid mode" (like
> Lennart uses in systemd) for libvirt? I mean a mode where you will use
> the 3 currently supported cgroup V2 controllers for libvirt (memory,
> io and pids; actually I don't know if you use the cgroups pids at all
> in libvirt, it is a new controller; BTW - do you ? ). And using other
> controllers (besides io, memory and pids) from cgroup V1

A controller can only be used in one mode at any time - so while libvirt
could potentially support using some in v1 mode and some in v2 mode, it
only works if the OS distro has actually setup those controllers in v1
mode - if they're in v2 mode, we'd be forced to use them in v2 mode.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Fedora 25, cgroups V2 and systemd roadmap

2016-10-11 Thread Daniel P. Berrange
On Mon, Oct 10, 2016 at 05:30:35PM +, Jóhann B. Guðmundsson wrote:
> On 10/10/2016 04:46 PM, Lennart Poettering wrote:
> 
> > I still hope that Fedora can go the Facebook route, and just patch the
> > stuff in, and ignore the fight going on in the kernel community.
> 
> That wont fly by the kernel sub community in Fedora in which they are doing
> whatever they can not having to carry out of tree patches and wind up in the
> same scenario they have been in with "Secure Boot" for the past what 3 - 5
> years now.
> 
> I'm pretty sure that every downstream distribution has already realized that
> the longer they carry patch or patches that exist out of tree, the harder
> they get to maintain without extra support as in additional manpower in
> maintaining the kernel for that distribution and will also chose not to
> carry that patches.

Yeah, it won't really fly from libvirt POV either. We don't want to support
out of tree kernel patches, because history has shown that causes long term
pain in the (fairly likely) event that the patches gets changed before finally
merging.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Fedora 25, cgroups V2 and systemd roadmap

2016-10-10 Thread Daniel P. Berrange
On Mon, Oct 10, 2016 at 02:43:33PM +0200, Lennart Poettering wrote:
> On Mon, 10.10.16 14:31, Kevin Wilson (wkev...@gmail.com) wrote:
> 
> > Hello, systemd developers,
> > So we have now 3 V2 cgroups controller in the kernel (pids, memory and io).
> > The CPU controller as of now is not merged in and is available only in
> > an out of tree git repo (due to some debate over
> > it with kernel scheduler developers). Not sure that it will be merged
> > in the next 2 months.
> > 
> > Fedora 25 is to be released in a month and a half, on 15 of November.
> > https://fedoraproject.org/wiki/Releases/25/Schedule
> > My questions are:
> > what are the intentions regarding using cgroup v2 in systemd  in F25
> > as the default instead of using cgroup V1?
> > Is the absence of  the CPU controller is a reason for not having
> > cgroup V2 as a default in F25 ? and if so, why ?
> 
> I'd like to switch this over sooner rather than later in Fedora, but I
> figure we can't do that, unless relevant other upstreams can deal with
> the new hierarchy too. I figure on Fedora, that'd be at least libvirt
> and Docker that need to be updated for this.
> 
> I figure we should start turning this on in Rawhide, and see what
> breaks, and then revert before the release.
> 
> Before we can tell Docker/libvirt to port their stuff over I figure we
> also need one more addition in the systemd API for this: next to
> Delegate=yes|no (which we already have) we probably need to add
> DelegateController= taking a list of all controllers to
> delegate. Right now we delegate all controllers, but I figure that
> should be configurable, since turning on a controller might have
> effects people don't expect (in particular for the cpu hierarchy).

From the libvirt POV, getting the CPU controller support merged is
a blocking item, otherwise we have a major feature regression. We
have a policy of not supporting code that is out of tree, as we don't
like getting burnt when changes are inevitably made after it is finally
accepted.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Fedora 25, cgroups V2 and systemd roadmap

2016-10-10 Thread Daniel P. Berrange
On Mon, Oct 10, 2016 at 02:31:55PM +0300, Kevin Wilson wrote:
> Hello, systemd developers,
> So we have now 3 V2 cgroups controller in the kernel (pids, memory and io).
> The CPU controller as of now is not merged in and is available only in
> an out of tree git repo (due to some debate over
> it with kernel scheduler developers). Not sure that it will be merged
> in the next 2 months.
> 
> Fedora 25 is to be released in a month and a half, on 15 of November.
> https://fedoraproject.org/wiki/Releases/25/Schedule
> My questions are:
> what are the intentions regarding using cgroup v2 in systemd  in F25
> as the default instead of using cgroup V1?
> Is the absence of  the CPU controller is a reason for not having
> cgroup V2 as a default in F25 ? and if so, why ?

Ignoring the question of Fedora switching, more generally, if any OS were
to switch to cgroup v2 right now it would break a number of applications
that use cgroups v1 today. v2 is not a plain no-op drop-in replacement for
v1, as they have pretty different rules around the hierarchy management.
Applications that create/manage cgroups properties/dirs need to be manually
ported, not merely systemd itself. The absence of CPU controller support
would also be a functional regression for some applications, effectively
preventing use of cgroup v2 even if they were ported.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] machined: after CPU offline then online, vcpupin KVM guest failed to start

2016-08-05 Thread Daniel P. Berrange
On Fri, Aug 05, 2016 at 12:33:21PM +0200, Dr. Werner Fink wrote:
> On Fri, Aug 05, 2016 at 11:07:50AM +0200, Lennart Poettering wrote:
> > On Thu, 04.08.16 16:19, Cedric Bosdonnat (cbosdon...@suse.com) wrote:
> > 
> > > Hi Lennart and Werner,
> > > 
> > > On Wed, 2016-08-03 at 16:56 +0200, Lennart Poettering wrote:
> > > > On Wed, 03.08.16 14:46, Dr. Werner Fink (werner at suse.de) wrote:
> > > > > problem with v228 (and I guess this is also later AFAICS from logs of
> > > > > current git) that repeating CPU hotplug events (offline/online). The
> > > > > root cause is that cpuset.cpus become not restored by machined.
> > > > > Please note that libvirt can not do this as it is not allowed to do
> > > > > so.
> > > > 
> > > > This is a limitation of the kernel cpuset interface, and it's one of
> > > > the reasons we do not expose cpusets at all in systemd right
> > > > now. Thankfully, there's an alternative to cpusets, which is the CPU
> > > > affinity controls exposed via CPUAffinity= in systemd, which do much
> > > > of the same, but have less borked semantics.
> > > > 
> > > > We'd like to support cpusets directly in systemd, but we don't do this
> > > > as long as the kernel interfaces are as borked as they are. For
> > > > example, cpusets are flushed out entirely currently when the system
> > > > goes through a suspend/resume cycle.
> > > > 
> > > > If libvirt has hook-ups with cpuset, then it bypasses systemd for
> > > > that.
> > > 
> > > I guess by CPU affinity you mean sched_setaffinity and friends. If that is
> > > the case, then this is constrained by cpuset too as mentioned here:
> > > 
> > > http://www.mjmwired.net/kernel/Documentation/cpusets.txt#53
> > > 
> > > As long as the machine.slice cpuset isn't restored after onlining a CPU 
> > > again,
> > > then libvirt won't be able to set either the affinity or the cpuset if it
> > > contains that CPU.
> > > 
> > > May be the kernel's behaviour is weird and can be discussed, but libvirt 
> > > can't
> > > do anything on that bug.
> > 
> > Yeah, to make this clear: I do not blame libvirt for this borkedness
> > at all. I blame the kernel.
> 
> Hmmm ... IMHO it is useless to pass the buck from kernel to user space
> as well do the same from user space back to kernel.  I've an open bug
> from a customer and this bug requires a solution.  AFAICS libvirt can
> not do this but machined could do.

It is not simply a problem wrt to virtual machines, it affects any application
which is using the cpuset controller - VMs is just one such user. So it would
be inappropriate todo it in machined.

Fixing it in userspace is complicated by the fact that different levels or
branches in the cgroup hiearchy are managed by different applications, with
no single application having a single world view. Even if systemd itsef did
have support for the cpuset controller, it would still not have  a global
view of all cgroups, as applications can be created further child cgroups
below the groups managed by systemd, which systemd doesn't track.

Trying to restore correct cpuaffinity after hotplug would thus require that
multiple userspace applications all be aware of the problem and contain
logic to fix their part of the hierarchy. This is further complicated by
the ordering constraints that would require top levels to be fixed before
child levels.

Bearing all this in mind, fixing it in userspace is an incredibly hard
problem which will always be liable to race conditions between applications.

The only choices that are practical are a) not use the cpuset controller
at all, or b) fix the kernel so that it maintains 2 distinct bitmaps,
one for the set of online CPus, and one for the configured affinity in the
cpuset, and thus avoid throwing away data on CPU unplug/plug.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Confining ALL processes to a CPUs/RAM via cpuset controller

2016-07-20 Thread Daniel P. Berrange
On Wed, Jul 20, 2016 at 03:29:30PM +0200, Lennart Poettering wrote:
> On Wed, 20.07.16 12:53, Daniel P. Berrange (berra...@redhat.com) wrote:
> 
> > For virtualized hosts it is quite common to want to confine all host OS
> > processes to a subset of CPUs/RAM nodes, leaving the rest available for
> > exclusive use by QEMU/KVM.  Historically people have used the "isolcpus"
> > kernel arg todo this, but last year that had its semantics changed, so
> > that any CPUs listed there also get excluded from load balancing by the
> > schedular making it quite useless in general non-real-time use cases
> > where you still want QEMU threads load-balanced across CPUs.
> > 
> > So the only option is to use the cpuset cgroup controller to confine
> > procosses. AFAIK, systemd does not have an explicit support for the cpuset
> > controller at this time, so I'm trying to work out the "optimal" way to
> > achieve this behind systemd's back while minimising the risk that future
> > systemd releases will break things.
> 
> Yes, we don't support this as of now, but we'd like to. The thing
> though is that the kernel interface for it is pretty borked as it is
> right now, and until that's not fixed we are unlikely going to support
> this in systemd. (And as I understood Tejun the mem vs. cpu thing in
> cpuset is probably not going to stay the way it is either)
> 
> But note that the non-cgroup CPUAffinity= setting should be good
> enough for many use cases. Are you sure that isn't sufficient for you?
> 
> Also note that systemd supports setting a system-wide CPUAffinity= for
> itself during early boot, thus leaving all unlisted CPUs free for
> specific services where you use CPUAffinity= to change this default.

Ah, interesting, I didn't notice you could set that globally.


> > The key factor here is use of "Before" to ensure this gets run immediately
> > after systemd switches root out of the initrd, and before /any/ long lived
> > services are run. This lets us set cpuset placement on systemd (pid 1)
> > itself and have that inherited by everything it spawns. I felt this is
> > better than trying to move processes after they have already started,
> > because it ensures that any memory allocations get taken from the right
> > NUMA node immediately.
> >
> > Empirically this approach seems to work on Fedora 23 (systemd 222) and
> > RHEL 7 (systemd 219), but I'm wondering if there's any pitfalls that I've
> > not anticipated here.
> 
> Yes, PID 1 was moved to the special scope unit init.scope as mentioned
> above (in preparation for cgroupsv2 where inner cgroups can never
> contain PIDs). This is likely going to break then.

cgroupsv2 is likely to break many things once distros switch over, so
I assume that wouldn't be done in a minor update - only a major new
distro release so, not so concerning.

> But again, I have the suspicion that CPUAffinity= might already
> suffice for you?

Yep, it looks like it should suffice for most people, unless they also
wish to have memory node restrictions enforced from boot.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Confining ALL processes to a CPUs/RAM via cpuset controller

2016-07-20 Thread Daniel P. Berrange
For virtualized hosts it is quite common to want to confine all host OS
processes to a subset of CPUs/RAM nodes, leaving the rest available for
exclusive use by QEMU/KVM.  Historically people have used the "isolcpus"
kernel arg todo this, but last year that had its semantics changed, so
that any CPUs listed there also get excluded from load balancing by the
schedular making it quite useless in general non-real-time use cases
where you still want QEMU threads load-balanced across CPUs.

So the only option is to use the cpuset cgroup controller to confine
procosses. AFAIK, systemd does not have an explicit support for the cpuset
controller at this time, so I'm trying to work out the "optimal" way to
achieve this behind systemd's back while minimising the risk that future
systemd releases will break things.

As an example I have a host with 3 NUMA nodes, 12 CPUS and want to have
all non-QEMU processes running on CPUs 0 & 1, leaving 3-11 available
for QEMU machines

So far my best solution looks like this:

$ cat /etc/systemd/system/cpuset.service
[Unit]
Description=Restrict CPU placement
DefaultDependencies=no
Before=sysinit.target slices.target basic.target lvm2-lvmetad.service 
systemd-journald.service systemd-udevd.service

[Service]
Type=oneshot
KillMode=none
RemainAfterExit=yes
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/machine.slice
ExecStartPre=/bin/bash -c '/usr/bin/echo "0-2" > 
/sys/fs/cgroup/cpuset/system.slice/cpuset.cpus'
ExecStartPre=/bin/bash -c '/usr/bin/echo "0" > 
/sys/fs/cgroup/cpuset/system.slice/cpuset.mems'
ExecStartPre=/bin/bash -c '/usr/bin/echo "3-11" > 
/sys/fs/cgroup/cpuset/machine.slice/cpuset.cpus'
ExecStartPre=/bin/bash -c '/usr/bin/echo "0-2" > 
/sys/fs/cgroup/cpuset/machine.slice/cpuset.mems'
ExecStartPost=/bin/bash -c '/usr/bin/echo 1 > 
/sys/fs/cgroup/cpuset/system.slice/tasks'
ExecStopPost=/usr/bin/rmdir /sys/fs/cgroup/cpuset/system.slice
ExecStart=/bin/true

[Install]
WantedBy=multi-user.target


The key factor here is use of "Before" to ensure this gets run immediately
after systemd switches root out of the initrd, and before /any/ long lived
services are run. This lets us set cpuset placement on systemd (pid 1)
itself and have that inherited by everything it spawns. I felt this is
better than trying to move processes after they have already started,
because it ensures that any memory allocations get taken from the right
NUMA node immediately.

Empirically this approach seems to work on Fedora 23 (systemd 222) and
RHEL 7 (systemd 219), but I'm wondering if there's any pitfalls that I've
not anticipated here.

Conceptually I'm aiming for "Before=*" to say it should run before
everything, but explicitly listing this set of units appears to be
best I can do/

Any thoughts / feedback / suggestions welcome on how to improve this.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Utility for persistent alternative driver binding

2015-12-08 Thread Daniel P. Berrange
On Tue, Dec 08, 2015 at 09:14:17AM -0500, Charles (Chas) Williams wrote:
> On Tue, 2015-12-08 at 11:34 +0200, Panu Matilainen wrote:
> > Hmm, got a pointer? I dont think PCI slots change between reboots 
> > without physically swapping hardware, the "ethX-problem" comes from the 
> > order of device discovery being unstable across boots, which is a 
> > different issue and not relevant for this case.
> 
> With virtual machines the ordering of PCI devices isn't always the same.
> This is especially true with OpenStack which regenerates the
> configurations on the fly.  The only thing that seems to be consistent
> is the MAC address.

The MAC address is not guaranteed to be unique of course - you can have
multiple NICs with the same MAC provided they're connected to different
subnets.

For OpenStack we're working on a feature that allows the user booting
an OpenStack instance to associate an arbitrary tag with each device
they have on their VM. The info about the tags and the currently assigned
device addresses is then exposed to the guest OS via the metadata service
and/or config drive and/or firmware. There will be a utility that can read
this metadata and then register tags against the corresponding devices in
the udev database.

The idea is that people building OpenStack compatible cloud images can
configure their image to look for the device in udev with the desired
user "tag" string. That way they don't need to care about specific
device addresses directly. More info on OpenStack plans is here:

  
http://specs.openstack.org/openstack/nova-specs/specs/mitaka/approved/virt-device-role-tagging.html

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Utility for persistent alternative driver binding

2015-12-08 Thread Daniel P. Berrange
On Tue, Dec 08, 2015 at 09:46:13AM -0500, Charles (Chas) Williams wrote:
>  On Tue, 2015-12-08 at 14:21 +0000, Daniel P. Berrange wrote:
> > On Tue, Dec 08, 2015 at 09:14:17AM -0500, Charles (Chas) Williams wrote:
> > > On Tue, 2015-12-08 at 11:34 +0200, Panu Matilainen wrote:
> > > > Hmm, got a pointer? I dont think PCI slots change between reboots 
> > > > without physically swapping hardware, the "ethX-problem" comes from the 
> > > > order of device discovery being unstable across boots, which is a 
> > > > different issue and not relevant for this case.
> > > 
> > > With virtual machines the ordering of PCI devices isn't always the same.
> > > This is especially true with OpenStack which regenerates the
> > > configurations on the fly.  The only thing that seems to be consistent
> > > is the MAC address.
> > 
> > The MAC address is not guaranteed to be unique of course - you can have
> > multiple NICs with the same MAC provided they're connected to different
> > subnets.
> 
> Typically not done but yes it can happen.  I was referring specifically
> to the OpenStack case though where it causes the most trouble for me.
> 
> > For OpenStack we're working on a feature that allows the user booting
> > an OpenStack instance to associate an arbitrary tag with each device
> > they have on their VM. The info about the tags and the currently assigned
> > device addresses is then exposed to the guest OS via the metadata service
> > and/or config drive and/or firmware. There will be a utility that can read
> > this metadata and then register tags against the corresponding devices in
> > the udev database.
> > 
> > The idea is that people building OpenStack compatible cloud images can
> > configure their image to look for the device in udev with the desired
> > user "tag" string. That way they don't need to care about specific
> > device addresses directly. More info on OpenStack plans is here:
> > 
> >   
> > http://specs.openstack.org/openstack/nova-specs/specs/mitaka/approved/virt-device-role-tagging.html
> 
> That would be a huge step forward but I need this to work when someone
> hotplugs an interface to a running image.  The config drive doesn't work
> here since it wouldn't have the tag information.  The metadata service
> might work (although I think cloud-init blackholes the metadata service
> after boot).  Exposing this via the PCI device with some ACPI mechanism
> would be nice.

Yep, config drive is obviously useless for hotplug, but metadata service
is usable in the short term. There is QEMU work to let us expose data
via the firmware, but not sure help hotplug really. The would really
be a virtio based filesystem like virtio-9p except without the suckiness
of 9p - a future virtio-nfs might be best.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] RFC: removing initctl support

2015-09-24 Thread Daniel P. Berrange
On Thu, Sep 24, 2015 at 03:51:16PM +0200, Tomasz Torcz wrote:
> On Thu, Sep 24, 2015 at 03:01:21PM +0200, Lennart Poettering wrote:
> > That stackexchange link lists a pile of garbage. We have an official
> > API to check whether the system is booted with systemd:
> > sd_booted(). It's documented here:
> > 
> > http://www.freedesktop.org/software/systemd/man/sd_booted.html
> > 
> > And we even document on that man page what precisely it does
> > internally (which is equivalent to access("/run/systemd/system/",
> > F_OK) >= 0) and suggest people to reimplement that simple check in the
> > language of their choice, even in shell... That way, they don't even
> > have to link against libsystemd.
> 
>   And then there is this sabotage:
> 
> „This check is already broken, because uselessd creates this directory too”
> 
> uselessd% git grep 'mkdir.*/run/systemd/system'
> src/core/main-no-init.c:mkdir_label("/run/systemd/system", 0755);
> src/core/mount-setup.c:mkdir_label("/run/systemd/system", 0755);
> src/core/unit.c:mkdir_p("/run/systemd/system", 0755);

Well if it wants to claim to be systemd, then it is responsible for
providing the same API/ABIs as systemd. IOW it must respond to SIGRTMIN+4
in the same way, etc.


Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] RFC: removing initctl support

2015-09-22 Thread Daniel P. Berrange
On Tue, Sep 22, 2015 at 02:31:25AM +0200, Lennart Poettering wrote:
> Heya!
> 
> Since a long time systemd has been shipping with two-way compat
> support for /dev/initctl, and I am tempted to remove it. Before I do
> so, I'd like some input on the relevance of this interface:
> 
> a) there's support in systemctl to reboot the system by sending the
>right bytes to /dev/initctl as fallback, so that you can reboot a
>sysvinit system with "systemctl reboot".
> 
> b) There's a mini-daemon "systemd-initctl.service" that is
>fifo-activated on /dev/initctl, and forwards reboot requests from
>old sysvinit clients to systemd.
> 
> Both of this was supposed to help transition between sysvinit and
> systemd systems: if you mix sysvinit clients with a systemd init
> system and vice versa, you can still use the the tools to reboot the
> other system.
> 
> I'd claim the interface is borderline useless: the only operation you
> can actually readlly properly dispatch with it is rebooting, and
> reloading PID1. And that's pretty much it.
> 
> We never even really used this stuff on Fedora properly (since we
> actually transitioned from Upstart, not sysvinit, and we never had the
> same level of compat for that...).
> 
> This code has been bitrotting for a while, and nobody really cared.
> 
> And most importantly: the entire protocol use by sysvinit via
> /dev/initctl is deeply flawed, since it sends messages over
> /dev/initctl that are not a divisor of PIPE_SIZE in length. Thus, if
> PID 1 didn't read messages quick enough the messages queued could be
> half-written and be partially interleaved with another client's
> messages, and there is no way the system can ever recover from that.
> 
> Thus, I'd really like to kill this. Does anybody care about it, and
> can give me a strong enough reason to keep this anyway?

The libvirt virDomainShutdown|Reboot APIs for triggering controlled
shutdown/reboots of guest OS have support for using /dev/initctl with
containers, as it was the lowest common denominator that easily worked
across systemd, sysvinit & upstart.

We could add further code to use a systemd specific interface if
needed, so it wouldn't be the end of the world of /dev/initctl was
removed, but it'd be nice to not have todo that.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] RFC: removing initctl support

2015-09-22 Thread Daniel P. Berrange
On Tue, Sep 22, 2015 at 12:48:21PM +0200, Lennart Poettering wrote:
> On Tue, 22.09.15 11:41, Daniel P. Berrange (berra...@redhat.com) wrote:
> 
> > > One more addendum to the original mail:
> > > 
> > > We already declared the interface "obsolete" in the docs, which makes
> > > me particularly keen on dropping it...
> > 
> > I guess one thing is that even if support for /dev/intctl in systemd,
> > it is an optional unit file, so libvirt probably needs to deal with
> > the SIGRTMIN+4 stuff anyway, for case where the person building
> > the container has that unit file disabled. So from that POV, deleting
> > it won't make current situation that much worse.
> 
> Also, we support builds with all legacy cruft disabled, which is
> something where inictl currently is not disabled, but if we kept it
> should really be disable under... So I figure even if we keep the
> general support in this is not an interface you can rely on...
> 
> To make the SIGRTMIN+4 reliable all you need to do is check first if
> /run/systemd/system/ exists.

Ok, that's easy enough. So no objection from libvirt if you remove
/dev/initctl in future releases


Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] RFC: removing initctl support

2015-09-22 Thread Daniel P. Berrange
On Tue, Sep 22, 2015 at 12:32:25PM +0200, Lennart Poettering wrote:
> On Tue, 22.09.15 10:11, Daniel P. Berrange (berra...@redhat.com) wrote:
> 
> > > And most importantly: the entire protocol use by sysvinit via
> > > /dev/initctl is deeply flawed, since it sends messages over
> > > /dev/initctl that are not a divisor of PIPE_SIZE in length. Thus, if
> > > PID 1 didn't read messages quick enough the messages queued could be
> > > half-written and be partially interleaved with another client's
> > > messages, and there is no way the system can ever recover from that.
> > > 
> > > Thus, I'd really like to kill this. Does anybody care about it, and
> > > can give me a strong enough reason to keep this anyway?
> > 
> > The libvirt virDomainShutdown|Reboot APIs for triggering controlled
> > shutdown/reboots of guest OS have support for using /dev/initctl with
> > containers, as it was the lowest common denominator that easily worked
> > across systemd, sysvinit & upstart.
> 
> Ah, I see... But I wasn't aware Upstart even implemented that...

Maybe it wasn't actually upstart, but one of the other init systems.
I just recall getting a patch from Debian folks to support it via
the /run/initctl path, rather than /dev, and assumed that was upstart
related.

> > We could add further code to use a systemd specific interface if
> > needed, so it wouldn't be the end of the world of /dev/initctl was
> > removed, but it'd be nice to not have todo that.
> 
> A simple fall back could be to send SIGRTMIN+4 to PID 1, if
> /dev/initctl is not around.

Yep, though we'd have to actually check that PID 1 is systemd, since
if you run a container with a non-init program as PID 1, we don't
want to be sending it SIGRTMIN+4 :-)

> One more addendum to the original mail:
> 
> We already declared the interface "obsolete" in the docs, which makes
> me particularly keen on dropping it...

I guess one thing is that even if support for /dev/intctl in systemd,
it is an optional unit file, so libvirt probably needs to deal with
the SIGRTMIN+4 stuff anyway, for case where the person building
the container has that unit file disabled. So from that POV, deleting
it won't make current situation that much worse.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] How to set time from Perl

2015-09-07 Thread Daniel P. Berrange
On Mon, Sep 07, 2015 at 04:23:42PM +0200, Manuel Reimer wrote:
> Hello,
> 
> if I run the following code on an intel based platform, then I don't have
> any problems:
> 
>   use Net::DBus;
>   my $bus = Net::DBus->system();
>   my $logind = $bus->get_service('org.freedesktop.timedate1');
>   my $manager = $logind->get_object('/org/freedesktop/timedate1',
> 'org.freedesktop.timedate1');
>   $manager->SetTime($time * 100, 0, 0);
> 
> The variable "$time" is in seconds.
> 
> If I run this to an ARM based system, then I get the folowing time:
> 
>   # date
>   Thu Jan  1 01:00:02 CET 1970
> 
> Does someone have an idea why this doesn't work?

By "ARM system" do you mean 32-bit ArmV7, or 64-bit AArch64 ?

Based on the behaviour you describe, I'm thinking you are most
likely on 32-bit ArmV7.  Perl integers on 32-bit are only 32-bit
in length, and the SetTime() method needs a 64-bit integer,
since it is representing the time in microseconds. So when you
do $time * 100 you are probably getting integer truncation.

Net::DBus can deal with 64-bit integers, but you need to provide
them as the Perl string type, not integer type, so Net::DBus XS
module can do a safe conversion to 64-bit without truncation.

So instead of doing

  $manager->SetTime($time * 100, 0, 0);

try doing

  $manager->SetTime($time . "00", 0, 0);

which will conmvert $time to string type, and then append 6
zeros.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [ANNOUNCE] Git development moved to github

2015-06-02 Thread Daniel P. Berrange
On Tue, Jun 02, 2015 at 04:34:03PM +0200, Martin Pitt wrote:
 David Herrmann [2015-06-02 13:06 +0200]:
  Our preferred way to send future patches is the github way. This
  means sending pull-requests to the github repo. Furthermore, all
  feature patches should go through pull-requests and should get
  reviewed pre-commit. This applies to everyone. Exceptions are
  non-controversial patches like typos and obvious bug-fixes.
 
 Makes sense. On the operational level, should we use the
 automatically merge feature of git hub once approving? On the plus
 side it's very convenient, but you'll get one Merge commit for every
 PR (which is often just one commit), so we'd almost double the entries
 in git log. Or can github be told to not do that?
 
 Merging manually is quite a bit of work, as you have to add a new
 remote every time, fetch that, and pull from it. But it does keep a
 cleaner git log history.

FWIW,  'git log --no-merges' displays the clean history when
merges are present.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] dynamic uid allocation (was: [PATCH] loopback setup in unprivileged containers)

2015-02-04 Thread Daniel P. Berrange
On Tue, Feb 03, 2015 at 06:05:00PM +0100, Lennart Poettering wrote:
 On Tue, 03.02.15 16:34, Serge Hallyn (serge.hal...@ubuntu.com) wrote:
 
the UID/GID on entire filesystem sub-trees given to containers with
userns is a real unpleasant thing to have to deal with. I'd not want
  
  Of course you would *not* want to take a stock rootfs where uid == 0
  and shift that into the container, as that would give root in the
  container a chance to write root-owned files on the host to leverage
  later in a convoluted attack :)  
 
 Is this really a problem? I mean, the only way how this could be
 exploitable is if people make the container hierarchy accessible to
 other users, but that should be easy to prohibit by making the
 container's parent dir 0700, which we already do for nspawn's
 container in /var/lib/machines... The only other risk I can see here
 is that if people use traditional ext4 quota, then the container's
 disk usage will be added to the host's usage. But that's easy to
 avoid, by simply never placing container images and the host on the
 same quota device...
 
 Also, in the case of systemd-nspawn we strongly emphasize usage with
 loopback devices. In that case there's no vulnerability at all, since
 the device is completely seperate from the host fs, and it will only
 be mounted in the container, but not in the host...

NB, that the container filesystem is visible via /proc/$PID/root,
but I agree with you in general. I don't see a reason to avoid
the scenario Serge mentioned. Indeed I think it is important that
we explicitly support it, because ultimately I think we need to
be able to take any arbitrary disk image and safely boot it in
either a container or virtual machine. ie we should not have to
build custom images just for containers - any such need should be
considered a failure of the technology / impl IMHO.

  We might want to come up with a containers concensus that container
  rootfs's are always shipped with uid range 0-65535 - 10-165535.
  That still leaves a chance for container A (mapped to 20-265535)
  to write valid setuid-root binary for container B (mapped to
  30-365535), which isn't possible otherwise.  But that's better
  than doing so for host-root.
 
 Well, ultimately I'd recommend an automatism like this for container
 managers: 
 
a) if not otherwise configured, let's give each container their own
   16bit of uids. This would mean each 32bit uid could be neatly
   split into the upper 16bit that would become a container id,
   plus the lower 16bit for the actual virtual UID.
 
b) we will never set up UID ranges orthogonal from GID ranges.
 
c) when a container image is started, the container manager first
   checks the UID/GID owner of the root of the root file system. It
   masks the lower 16bit away, and only looks for the upper 16bit.
 
d) It will then look for an unused container id (which means, an
   unused range of 64K UIDs), and then shifts the offset it
   identified following c) to this new container id.
 
 With that in place it doesn't really matter which base people use in
 their containers, the container manager would do the right thing, and
 shift everything into the right place. Paranoid people could ship
 their container images shifted to some ID of their choice, and lazy
 folks could just ship their container images with base 0, but then
 must make sure they don't give anybody else access to the hierarchy,
 and don't confuse quota...


Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] dynamic uid allocation (was: [PATCH] loopback setup in unprivileged containers)

2015-02-03 Thread Daniel P. Berrange
On Tue, Feb 03, 2015 at 03:41:22PM +0100, Lennart Poettering wrote:
 On Tue, 30.12.14 06:49, Simon Peeters (peeters.si...@gmail.com) wrote:
 
  2014-12-29 14:14 GMT+00:00 Tom Gundersen t...@jklm.no:
   On Mon, Dec 29, 2014 at 2:34 PM, Lennart Poettering
   lenn...@poettering.net wrote:
  snip
   I am open to adding support for this, but I think the allocation of
   the UID ranges should really happen automatically, and not be
   something the admin has to manually assign.
  
   Which means we'd enter dynamic UID allocation terroritory, and that
   opens a huge can of worms...
  
   Would we not also need to support explicit assignment, in case someone
   has a preexisting image they want to match in a specific way? In that
   case we could start off without the dynamic allocation and add that
   later. It certainly would make testing a lot simpler if we had userns
   support sooner rather than later (at least in the case of netlink it
   appears to be quite a mess).
  
  Inspired by this topic I wrote a quick'n'dirty uid allocator[1]
  this allocator manages the upper 2G uid's, which using Matthias Urlichs 
  example
  of 2048 uid's per container, still allows for 1M containers.
  
  It curently can't persist these allocations, but that is on my
  0.0.1 todolist.
 
 Hmm, so, I thought a lot about this in the past weeks. I think the way
 I'd really like to see this work in the end is that we never have to
 persist the UID mappings. This could work if the kernel would provide
 us with the ability to bind mount a file system into the container
 applying a fixed UID shift. That way, the shifted UIDs would never hit
 the actual disk, and hence we wouldn't have to persist their mappings.
 
 Instead on each container startup we'd look for a new UID range, and
 release it entirely when the container shuts down. The bind mount with
 UID shift would then shift the UIDs up, the userns stuff would shift
 it down from inside the container again.
 
 Of course, this all depends on whether the kernel will get an
 extension to apply uid shifts to bind mounts. I hear they want to
 provide this, but let's see.

I would dearly love to see that happen. Having to recursively change
the UID/GID on entire filesystem sub-trees given to containers with
userns is a real unpleasant thing to have to deal with. I'd not want
the filesystem UID shift to only apply to bind mounts though. It is
not uncommon to use a disk image[1] for a container's filesystem, so
being able to request a UID shift on *any* filesystem mount is pretty
desirable, rather than having to mount the image and then bind mount
it onto itself just to apply the UID shift.


Regards,
Daniel

[1] Using a separate disk image per container means a container can't
DOS other containers by exhausting inodes for example with $millions
of small files.
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] perl-Net-DBus + new interactive authorization

2015-01-12 Thread Daniel P. Berrange
On Mon, Jan 12, 2015 at 11:37:12AM +, Colin Guthrie wrote:
 Angelo Naselli wrote on 12/01/15 10:30:
  Il 12/01/2015 10:16, Colin Guthrie ha scritto:
  Angelo Naselli wrote on 11/01/15 17:15:
  
  FWIW i rebuilt it in mageia 4 and libdbus1_3-1.6.18-1.8.mga4
  I haven't any issues of course. Using it as user
  for StartUnit/StopUnit for instance i got a  only a different exception
 
  org.freedesktop.DBus.Error.AccessDenied: Rejected send message
 
  while using as root worked as before even if i used new api.
 
  This is expected unless you have also backported cauldron systemd to
  MGA4. I think we discussed before that the interactive authorisation
  stuff was only added in more recent versions of systemd, so this is
  entirely what I'd expect here.
  eh eh eh, mine was only to say that even if i haven't disabled anything
  and compiled all against the old library i didn't see any crash or
  regression, just a different exception for the same -not working- thing.
  But i haven't test anything of course :)
 
 Oh, right! Gotcha, so this is just about the comment regarding the
 conditional compilation against older libdbus. Sorry for misinterpreting
 and thanks for testing that.
 
 OK, I'll have a further look at it to see if anything special is needed,
 perhaps the perl binding stuff just works without too much faff here for
 non-present APIs (I'm certainly not an expert with this stuff!).

I think you'll just need some #ifdef magic in the DBus.xs file to deal
with the new APIs being missing. Perhaps just write a stub function
in the DBus.xs that just raises a suitable perl error (see _croak_error
source in DBus.xs for example on raising errors)

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] perl-Net-DBus + new interactive authorization

2015-01-12 Thread Daniel P. Berrange
On Mon, Jan 12, 2015 at 12:04:42PM +, Colin Guthrie wrote:
 Daniel P. Berrange wrote on 12/01/15 11:40:
  On Mon, Jan 12, 2015 at 11:37:12AM +, Colin Guthrie wrote:
  Angelo Naselli wrote on 12/01/15 10:30:
  Il 12/01/2015 10:16, Colin Guthrie ha scritto:
  Angelo Naselli wrote on 11/01/15 17:15:
 
  FWIW i rebuilt it in mageia 4 and libdbus1_3-1.6.18-1.8.mga4
  I haven't any issues of course. Using it as user
  for StartUnit/StopUnit for instance i got a  only a different exception
 
  org.freedesktop.DBus.Error.AccessDenied: Rejected send message
 
  while using as root worked as before even if i used new api.
 
  This is expected unless you have also backported cauldron systemd to
  MGA4. I think we discussed before that the interactive authorisation
  stuff was only added in more recent versions of systemd, so this is
  entirely what I'd expect here.
  eh eh eh, mine was only to say that even if i haven't disabled anything
  and compiled all against the old library i didn't see any crash or
  regression, just a different exception for the same -not working- thing.
  But i haven't test anything of course :)
 
  Oh, right! Gotcha, so this is just about the comment regarding the
  conditional compilation against older libdbus. Sorry for misinterpreting
  and thanks for testing that.
 
  OK, I'll have a further look at it to see if anything special is needed,
  perhaps the perl binding stuff just works without too much faff here for
  non-present APIs (I'm certainly not an expert with this stuff!).
  
  I think you'll just need some #ifdef magic in the DBus.xs file to deal
  with the new APIs being missing. Perhaps just write a stub function
  in the DBus.xs that just raises a suitable perl error (see _croak_error
  source in DBus.xs for example on raising errors)
 
 Perhaps, but I think in this case it would be better to simply silently
 ignore the error as this is more of a nice additional feature rather
 than a core part. I think if someone wrote some perl code that took
 advantage of this, they would prefer it would just work as expected
 rather than have any need to push up conditional checks into the calling
 perl code.

Sure, if it is semantically reasonable from an app's POV for it to be a
no-op on old DBus, that's fine too.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] journalctl: allow customizable output formats

2014-10-24 Thread Daniel P. Berrange
On Wed, Oct 08, 2014 at 11:53:38PM +0200, Lennart Poettering wrote:
 On Mon, 22.09.14 16:33, Daniel P. Berrange (berra...@redhat.com) wrote:
 
  The current '--output FORMAT' argument defines a number of
  common output formats, but there are some useful cases it
  does cover. In particular when reading application logs it
  is often desirable to display the code file name, line number
  and function name. Rather than defining yet more fixed output
  formats, this patch introduces user defined output formats.
  
  The format string is an arbitrary string which contains a
  mixture of literal text and variable subsistitions. Each
  variable name corresponds to a journal field name. A variable
  name can be optionally followed by a data type, and in the
  case of string types, a length limit.
 
 Hmm, hmm, hmm.
 
 I am quite afraid about inventing a new template language for this. I
 can see the usecase though, and I sympasize with it.
 
 I am particularly afraid of the entire type thing. The fact that the
 journal is more or less typeless is after all by design: i really
 didn't want to invent a new type system. Adding this to the formatter
 now, kinda feels like adding it after all, but through the backdoor...
 
 So, I am not against this in general, but I'd really be careful with
 the language we define here, and try to make this as similar to an
 existing language (like the python/java one Zbigniew mentioned) as we
 can.
 
 Or even better, we already have a very limited formatting language in
 place, which is the specifier logic, that can replace %i, %f and
 such things in unit files, maybe we can build on this, and allow
 specifiers to take a field name as parameter. Then, if we really need
 formatters for different field types, we could just give them
 high-level characters or so?
 
 Hmm, also, we already have a really bad formatter in place for the
 journal catalog files (which only replaces @foo@ by the value of field
 foo). We should probably use the same code for this new journalctl
 formatter and the catalog code. In fact the catalog formatter could
 really use some improvement...

Ok, I didn't know about the catalog files until now, so I'll
investigate that and see what I can do about unifying the code
for these two options.

Do you consider the catalog file format to be part of the stable
ABI ?  ie, do we need to preserve support for @foo@  if we took
the %s{FOO} approach ?

 Maybe something like this:
 
 journalctl -O %t %s{CODE_FILE}:%s{CODE_LINE} %d{_SOURCE_REALTIME_TIMESTAMP}
 
 or something like that, where %t would simply map to the timestamp,
 and %s{} maps to a field name, and %d{} the same, but reformats the
 field as timestamp, assuming it is a UNIX timestamp formatted as
 number...
 
 Or something like that...

Yep, that would work for me. I'll cook up another patch to demonstrate
that approach along with catalog support.

I'm about to be travelling for KVM Forum / LinuxCon so probably won't get
a chance to send an updated patch for a week or two.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Should user mode linux register with machined?

2014-10-14 Thread Daniel P. Berrange
On Fri, Oct 10, 2014 at 06:44:03PM +0200, Lennart Poettering wrote:
 On Wed, 17.09.14 10:24, Richard Weinberger (richard.weinber...@gmail.com) 
 wrote:
 
  On Wed, Sep 17, 2014 at 1:09 AM, Zbigniew Jędrzejewski-Szmek
  zbys...@in.waw.pl wrote:
   On Tue, Sep 16, 2014 at 05:31:05PM +0200, Thomas Meyer wrote:
   Hi,
  
   I wrote a small patch for user-mode linux to register with machined by
   calling CreateMachine. Is this a good idea to do so?
   Yes, this sounds useful. After all is just another mechanism of
   virtualization, and in this case can be treated similarly to
   containers and vms.
  
  I still want a sane reason and a usecase for that.
  Can someone please educate me? :-)
  
  Please note that also qemu does not register itself to systemd.
  libvirt does. I think going down this path makes also sense for UML
  as libvirt has a UML driver too.
  qemu and the UML ELF image are the low level building blocks.
  Managers like libvirt should register the virtual machines created by
  LXC, UML, qemu, etc.. to systemd.
 
 It's a bit more complex. While UML, qemu, kvm, currently don't, LXC,
 systemd-nspawn and libvirt-lxc all do talk directly to machined. (Note
 that LXC and libvirt-lxc are separate codebases, the latter is *not* a
 wrapper around the former).

Libvirt registers both LXC  QEMU/KVM guests with machined.

We don't currently register UML guests with machined, but that
is simply because UML isn't really a high priority target for
people anymore and so hasn't been updated to use libvirt's
cgroup/systemd integration support. From the libvirt POV i'd
be happy to see patches to make it register with machined.

I'm not sure that standalone UML binaries need to directly
integrate/register with systemd - I tend to view it as a job
for whatever is managing UML to decide todo that.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] journalctl: allow customizable output formats

2014-10-06 Thread Daniel P. Berrange
On Fri, Oct 03, 2014 at 02:13:51AM +0200, Zbigniew Jędrzejewski-Szmek wrote:
 On Mon, Sep 22, 2014 at 04:33:28PM +0100, Daniel P. Berrange wrote:
  The current '--output FORMAT' argument defines a number of
  common output formats, but there are some useful cases it
  does cover. In particular when reading application logs it
  is often desirable to display the code file name, line number
  and function name. Rather than defining yet more fixed output
  formats, this patch introduces user defined output formats.
 Hi,
 
 I think this makes sense. But I think that the format strings you
 propose are damn ugly :). Using %() for variables seems too heavy.
 Also, journal fields are all text, so I don't think that specifying
 the type is useful.

Well there are two virtual fields which are timestamps which the existing
hardcoded output modes convert into a date string in various ways. I want
the format strings we define here to be able to express the semantics of
the current hardcoded output modes, so this neccessitates a way to ask
for various date formats. 

Also although the physically stored journal fields are strings per the
journal API  storage backend, they can be simple string versions of
other data types. eg an application defined journal field could be used
to store an integer, floating point, boolean, etc. It would be natural
for the app to use many decimal places if storing a floating point value
in the journal, so being able to give data types in the output mode lets
us alter the precision displayed when extracting it again.  Of course my
patch didn't try todo this, it only deal with dates.
 
 Maybe we could adopt the {} format from Java and Python, as
 implemented in Python [1]. It has a fairly rich and consistent field
 formatting language. We would care only about the part relevant
 to strings, at least in the beginning.

I'll see what I can cook up along these lines, but the existing python
language is focused on C data types and doesn't directly provide types
for the various date formats to support, so we can't use it 100% as-is.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] journalctl: allow customizable output formats

2014-09-23 Thread Daniel P. Berrange
On Mon, Sep 22, 2014 at 12:43:28PM -0400, Daurnimator wrote:
 On 22 September 2014 11:33, Daniel P. Berrange berra...@redhat.com wrote:
 
  The current '--output FORMAT' argument defines a number of
  common output formats, but there are some useful cases it
  does cover. In particular when reading application logs it
  is often desirable to display the code file name, line number
  and function name. Rather than defining yet more fixed output
  formats, this patch introduces user defined output formats.
 
  The format string is an arbitrary string which contains a
  mixture of literal text and variable subsistitions. Each
  variable name corresponds to a journal field name. A variable
  name can be optionally followed by a data type, and in the
  case of string types, a length limit.
 
 
 As an opposing point of view, I've been accomplishing this by piping output
 through a script that parses and displays JSON.
 I rather this style of composability than passing format strings to
 journalctl itself.

Sure you could do that, but it is really madness to expect anyone who
just wants to display a slightly different set of fields to write a
script to parse JSON and re-write it. When I have end users doing
troubleshooting of libvirt for bug reports, I want to be able to just
tell them to run a direct journalctl command to collect data I need,
not have to write or download some extra script to parse JSON.


Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Cannot get Shutdown Script to Run (Libvirt Virtual Machine Shutdown)

2014-09-23 Thread Daniel P. Berrange
On Sun, Sep 21, 2014 at 11:40:03PM -0400, Alexander Groleau wrote:
 Hello systemd users,
 
 I have been trying desperately for weeks to get my simple shutdown script
 for a Libvirt guest to run before libvirtd is shut down, without success.
 Essentially, I need the libvirt-windows.sh script to run before the
 libvirtd service is terminated (which occurs right after systemd-logind
 outputs its reboot message). How can I get my script into this initial
 section of daemon shutdowns, at the top?

Any reason you've created your own shutdown script instead of using the
libvirt-guests.service script that libvirt includes ? To get the ordering
right, we have a number of rules:

  - libvirtd.service contains Before=libvirt-guests.service
  - libvirt-guests.service contains After=libvirtd.service
  - The guest scope unit contain After=libvirtd.service and
Before=libvirt-guests.service

It was the two rules aginst the .scope units that we found to be the key
part to making shutdown work, whereby guests get stopped gracefully before
the libvirtd daemon is stopped.

The .scope units do not have any file on disk, they are generated on the
fly as libvirt talks to systemd-machined, so you've no way to alter them
to work with your custom shutdown script. Thus if you are not using the
standard  libvirt-guests.service, then you should at least use the name
libvirt-guests.service for your own custom service.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [PATCH] journalctl: allow customizable output formats

2014-09-22 Thread Daniel P. Berrange
The current '--output FORMAT' argument defines a number of
common output formats, but there are some useful cases it
does cover. In particular when reading application logs it
is often desirable to display the code file name, line number
and function name. Rather than defining yet more fixed output
formats, this patch introduces user defined output formats.

The format string is an arbitrary string which contains a
mixture of literal text and variable subsistitions. Each
variable name corresponds to a journal field name. A variable
name can be optionally followed by a data type, and in the
case of string types, a length limit.

This is best illustrated with an example:

  $ journalctl -o format:%(__REALTIME_TIMESTAMP) \
[%(CODE_FILE):%(CODE_LINE):%(CODE_FUNC)] \
%(MESSAGE:string:80)\n  _COMM=libvirtd
  -- Logs begin at Mon 2013-12-23 16:31:41 GMT, end at Mon 2014-09-22 16:13:00 
BST. --
  Dec 23 17:19:25 [util/virlog.c:877:virLogVMessage] libvirt version: 1.1.3.1, 
package: 2.fc20 (Fedora Project, 2013-11-17-23:28:43, ...
  Dec 23 17:19:25 [conf/storage_conf.c:854:virStoragePoolDefParseXML] XML 
error: unknown storage pool type btrfs
  Dec 23 17:19:30 [conf/domain_conf.c:12671:virDomainObjParseNode] XML error: 
unexpected root element domain, expecting domstatus
  Dec 23 17:24:45 [qemu/qemu_monitor.c:653:qemuMonitorIO] internal error: End 
of file from monitor
  Dec 23 20:12:00 [qemu/qemu_monitor.c:653:qemuMonitorIO] internal error: End 
of file from monitor
  -- Reboot --
  Dec 23 21:06:14 [util/virlog.c:877:virLogVMessage] libvirt version: 1.1.3.1, 
package: 2.fc20 (Fedora Project, 2013-11-17-23:28:43, ...
  Dec 23 21:06:21 [conf/storage_conf.c:854:virStoragePoolDefParseXML] XML 
error: unknown storage pool type btrfs

Signed-off-by: Daniel P. Berrange berra...@redhat.com
---
 man/journalctl.xml|  76 +
 src/journal-remote/journal-gatewayd.c |  11 +-
 src/journal/journalctl.c  |  39 ++-
 src/shared/logs-show.c| 532 ++
 src/shared/logs-show.h|  16 +-
 src/shared/output-mode.h  |   1 +
 src/systemctl/systemctl.c |  20 +-
 7 files changed, 615 insertions(+), 80 deletions(-)

diff --git a/man/journalctl.xml b/man/journalctl.xml
index acd75a6..bd8c2bd 100644
--- a/man/journalctl.xml
+++ b/man/journalctl.xml
@@ -375,6 +375,21 @@
 /para
 /listitem
 /varlistentry
+
+varlistentry
+term
+
optionformat:FMT/option
+/term
+listitem
+paragenerates output
+according to the format
+specification given in
+the FMT string. See the
+OUTPUT FORMAT STRINGS
+section for details
+/para
+/listitem
+/varlistentry
 /variablelist
 /listitem
 /varlistentry
@@ -878,6 +893,64 @@
 /refsect1
 
 refsect1
+titleOutput Format Strings/title
+
+paraAn output format string provides precise control how 
journal
+data records are formatted for output. A format string 
consists of
+mixture of literal text and variables to be substituted with 
journal
+data records. A variable takes the general form/para
+
+programlisting$(NAME:TYPE:LEN)/programlisting
+
+paraThe NAME component corresponds to any journal entry field
+(eg MESSAGE, _SYSTEMD_UNIT, CODE_FUNC, etc). The TYPE component
+determines the data format to use for printing the value. If
+omitted, it defaults to a sensible format for the NAME of the
+field. The LEN component places an upper limit on the length of
+strings being printed, beyond which they will be ellipsized.
+The valid data types for TYPE are:/para
+
+variablelist
+varlistentry
+termstring/term
+listitemparadisplayed if a printable 
string. If the value
+contains non-printable characters

Re: [systemd-devel] Delaying (SSH) key generation until the urandom pool is initialized

2014-04-30 Thread Daniel P. Berrange
On Tue, Apr 29, 2014 at 08:43:38PM +0200, Florian Weimer wrote:
 The message at 
 https://mail.gnome.org/archives/ostree-list/2014-February/msg00010.html
 contains two boot traces from virtual machines which show that the
 SSH key is generated before the kernel pool is sufficiently seeded.

I'm wondering if the VMs that ostree is creating are being given a
virtio-rng device ? If not that would probably be a good idea to
enable to allow them to get entropy. VMs are generally starved of
entropy even beyond the initial boot up stage, so a virtual RNG is
generally useful.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Delaying (SSH) key generation until the urandom pool is initialized

2014-04-30 Thread Daniel P. Berrange
On Wed, Apr 30, 2014 at 02:10:56PM +0200, Florian Weimer wrote:
 On 04/30/2014 01:14 PM, Daniel P. Berrange wrote:
 On Tue, Apr 29, 2014 at 08:43:38PM +0200, Florian Weimer wrote:
 The message at 
 https://mail.gnome.org/archives/ostree-list/2014-February/msg00010.html
 contains two boot traces from virtual machines which show that the
 SSH key is generated before the kernel pool is sufficiently seeded.
 
 I'm wondering if the VMs that ostree is creating are being given a
 virtio-rng device ? If not that would probably be a good idea to
 enable to allow them to get entropy. VMs are generally starved of
 entropy even beyond the initial boot up stage, so a virtual RNG is
 generally useful.
 
 Interesting suggestion.  I just used virt-manager to create the VM.
 I don't see any trace for rng or random in the domain XML file.
 If it is supported, I think it should be enabled by default.

I'm told that it isn't turned on by default, but you can add it to
a VM post-install. Since it feeds VMs from the host's /dev/random
or /dev/hwrng, there was a question mark as to whether it was right
to enable it by default or not, and if so what kind of rate limiting
might be wanted by default. 

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [systemd][cgroup in container] problem with cgroup hierarchy in container

2014-03-07 Thread Daniel P. Berrange
On Thu, Mar 06, 2014 at 07:54:05PM +0100, Lennart Poettering wrote:
 On Thu, 06.03.14 16:55, Dariusz Michaluk (d.micha...@samsung.com) wrote:
 
  
  On 05.03.2014 19:16, Lennart Poettering wrote:
  nspawn and libvirt-lxc mostly follow the same code paths and register
  via machined... So it's weird that different things happen. Somehow the
  systemd instance inside the container must be confused about the cgroup
  it is running in...
  
  Next few cents. I noticed that when I run lxc-libvirt container I
  get warning Failed to install release agent, ignoring: No such file
  or directory, which does not occur when I use nspawn.
 
 Oh!
 
 Hmm, thta suggests that libvirt-lxc might not mount the naked cgroupfs
 tree to /sys/fs/cgroup/systemd, but only a subdirectory. This of course
 might cause the weird setup that the host tree is duplicated for the
 container!
 
 Unfortunately it is not possible to only mount a subtree of the cgroup
 hierarchy into the container, since then the data from /proc/self/cgroup
 won't match /sys/fs/cgroup/systemd anymore... Also, the root of the
 cgroup trees has slightly different semantics and more properties than
 the children.
 
 Is this the default setup of libvirt-lxc for those dirs? I figure we
 should talk to Daniel to get that changed...

Yeah that was setup that way a while ago, but I forgot this would
invalidate /proc/self/cgroup information. It was a bit of a poor
mans attempt at securing cgroups, but really it is just a waste
of time unless user namespaces are available. Can someone file a
bug against libvirt for this and we'll look at not doing this.

 Each container really needs to see the full tree. The best thing
 possible to make sure that the containers can't muck with anything
 outside of the tree is to mount the upper parts read-only with a bind
 mount, but other than that i don't see that we could do anything
 there...

User namespaces are the best bet here. Once th root UID is remapped
the container won't be able to move themselves out of their subtree.


Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Build warnings for ARM due to -Wcast-align

2014-02-20 Thread Daniel P. Berrange
On Thu, Feb 20, 2014 at 05:21:22PM +0100, Lennart Poettering wrote:
 On Thu, 20.02.14 17:03, Daniel Mack (dan...@zonque.org) wrote:
 
  Hi,
  
  When cross-compiling the current git HEAD for ARM using gcc 4.8.2, I see
  ~160 warnings similar to this one:
  
  src/core/unit.c: In function 'unit_get_exec_runtime':
  src/core/unit.c:2851:17: warning: cast increases required alignment of
  target type [-Wcast-align]
   return *(ExecRuntime**) ((uint8_t*) u + offset);
   ^
  
  The full build log is here:
  
http://paste.fedoraproject.org/78944/92912005
  
  Unaligned memory access is indeed unsupported by some older instruction
  cores. The kernel can fix up in situations where such unaligned access
  occurs, but that's of course expensive and slow.
  
  However, systemd does not actually do unaligned memory access at runtime
  (at least I haven't seen any when booting up PXA3xx hardware). The
  warning is simply about the type of pointer arithmetic that casts to and
  from uint8_t*.
  
  And because it's practically impossible to fix the things the compiler
  complains about here anyway, I propose removing -Wcast-align from the
  CFLAGS in configure.ac.
  
  Any opinions?
 
 I am fine with that. I am personally only running things on x86, so it
 never showed up for me. The usual solution for cast issues is to use
 some union-based type conversion, but in the case above this is not
 really nicely possible. Hence, let's drop it, unless somebody has a
 better solution...

I think cast align warnings are fairly useful since many things it
can show turn out to be genuine bugs, so not entirely desirable to
disable them altogether. In libvirt we just mark the few cases which
are false positives with a pragma

  #define VIR_WARNINGS_NO_CAST_ALIGN \
_Pragma (GCC diagnostic push) \
_Pragma (GCC diagnostic ignored \-Wcast-align\)

  #define VIR_WARNINGS_RESET \
_Pragma (GCC diagnostic pop)


And then just mark it thus

  VIR_WARNINGS_NO_CAST_ALIGN
  ...code with false positive
  VIR_WARNINGS_RESET

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Howto run systemd within a linux container

2014-02-06 Thread Daniel P. Berrange
On Wed, Feb 05, 2014 at 11:44:33PM +0100, Richard Weinberger wrote:
 Hi!
 
 We're heavily using Linux containers in our production environment.
 As modern Linux distributions move forward to systemd have to make sure that
 systemd works within our containers.
 
 Sadly we're facing issues with cgroups.
 Our testbed consists of openSUSE 13.1 with Linux 3.13.1 and libvirt 1.2.1.
 
 In a plain setup systemd stops immediately because it is unable to
 create the cgroup hierarchy.
 Mostly because the container uid 0 is in a user namespace and has no
 rights to do that.

FYI I have succesfully run Fedora 19 with systemd inside a container
with libvirt LXC, however, I did *not* enable user namespaces. Every
time I try user namespaces I find some other bug in either the kernel
or libvirt, so I wouldn't be surprised if yet more breakage has
occurred in user namepsaces :-(


 Next try, fool systemd by mounting a tmpfs to /sys/fs/cgroup/systemd/.
 This seems to work. openSUSE boots, I can start/stop services...
 Shutdown hangs forever, had no time to investigate so far.
 
 But is this tmpfs hack the correct way to run systemd in a container?
 I really don't think so.

Yeah that really shouldnt' be needed. When libvirt runs a container it
creates a cgroup just for that container to run in, and systemd should
be able to create its hierarchy under that location.

That said, I wonder if libvirt is perhaps forgetting to chown() the
cgroup to the UID/GID you've mapped for the root user. That would
certainly prevent systemd using it and could cause the sort of pain
you see.


Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Howto run systemd within a linux container

2014-02-06 Thread Daniel P. Berrange
On Thu, Feb 06, 2014 at 04:33:22PM +0100, Greg KH wrote:
 On Thu, Feb 06, 2014 at 10:55:01AM +, Daniel P. Berrange wrote:
  On Wed, Feb 05, 2014 at 11:44:33PM +0100, Richard Weinberger wrote:
   Hi!
   
   We're heavily using Linux containers in our production environment.
   As modern Linux distributions move forward to systemd have to make sure 
   that
   systemd works within our containers.
   
   Sadly we're facing issues with cgroups.
   Our testbed consists of openSUSE 13.1 with Linux 3.13.1 and libvirt 1.2.1.
   
   In a plain setup systemd stops immediately because it is unable to
   create the cgroup hierarchy.
   Mostly because the container uid 0 is in a user namespace and has no
   rights to do that.
  
  FYI I have succesfully run Fedora 19 with systemd inside a container
  with libvirt LXC, however, I did *not* enable user namespaces. Every
  time I try user namespaces I find some other bug in either the kernel
  or libvirt, so I wouldn't be surprised if yet more breakage has
  occurred in user namepsaces :-(
 
 Those bugs should now be fixed, if you don't enable the option, how are
 we supposed to know what is left to be done?  :)

I have in fact been building my own kernels for Fedora with user namespaces
enabled to debug / test this and have reported all the bugs I found so far.
Just saying that with the track record of bugs since the userns code first
merged, I wouldn't be surprised if there were still more things to iron
out as we try more real world apps like systemd.

Regads,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider renaming -.slice

2014-01-16 Thread Daniel P. Berrange
On Thu, Jan 16, 2014 at 11:27:42AM +0100, Holger Schurig wrote:
 Oh, I confused that with the old /etc/systemd/systemd-journald.conf
 file, which was renamed.
 
 Kay, I only meant to special case the /, e.g. let
 home-kay-data.mount be it like it is, but rename - to root, so
 that it is root.mount and root.slice.

This would still break applications like libvirt which expect the
current naming conventions. These naming conventions must be
considered to be part of the stable API for apps and so cannot
be changed once included in an official release.


Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] getty : how to run getty on every ttyX

2013-12-16 Thread Daniel P. Berrange
On Fri, Dec 13, 2013 at 05:20:19PM +0100, Lennart Poettering wrote:
 On Fri, 13.12.13 16:15, Lennart Poettering (lenn...@poettering.net) wrote:
 
   We had discussed this back at Linux Plumbers last year, and at the time
   you had suggested that rather than create /dev/ttyN symlinks we should
   instead do something like  /dev/containerttyN instead, and set a
   'container_tty' variable containing a list of all those device names
   so that systemd can discover them sensibly. We never got around to
   doing this from the libvirt side, and AFAIK systemd hasn't done anything
   on its side either. So is this still a suitable way forward ?
  
  Yeah, I am pretty sure that's what we should do. I figure I should hack
  that up. I'll work on it now.
 
 Committed. systemd-getty-generator will now look for $container_ttys
 set as an environment variable for PID 1. If that is set it will split
 the string up on whitespaces and start a getty on all ptys
 referenced. Note that this only supports ptys, not any other ttys. 
 
 Example:
 
 container_ttys=pts/5 pts/8 pts/15
 
 when pass to PID 1 will spawn three additional gettys on ptys 5, 8 and
 15.
 
 Note that this *really* only supports ptys, not any other kinds of ttys,
 sinc for those we require propery device enumeration and notification
 and we don't have those in containers... I still chose to name this
 $container_ttys rather than $container_ptys, so that maybe one day we
 can extend it should devices like this ever get virtualized.
 
 This will be in systemd 209.

I've tested this with libvirt and it worked except for one small edge
case.

Say libvirt creates 3 consoles /dev/pts/0, /dev/pts/1 and /dev/pts/2.
Now we set  container_ttys=pts/0 pts/1 pts/2 Systemd starts up 3
agetty processes - one of each of these.

The /dev/console device, however, is also a link to /dev/pts/0
and so systemd starts up a agetty process for that too.

Now we have 2 agetty processes fighting over /dev/pts/0 which ends
in tears

Is this something that systemd should detect  cope with, or should we
document that the 'container_ttys' env *must exclude* any tty associated
with the /dev/console device ?

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] getty : how to run getty on every ttyX

2013-12-16 Thread Daniel P. Berrange
On Mon, Dec 16, 2013 at 05:33:12PM +0100, Lennart Poettering wrote:
 On Mon, 16.12.13 12:03, Daniel P. Berrange (berra...@redhat.com) wrote:
 
   Note that this *really* only supports ptys, not any other kinds of ttys,
   sinc for those we require propery device enumeration and notification
   and we don't have those in containers... I still chose to name this
   $container_ttys rather than $container_ptys, so that maybe one day we
   can extend it should devices like this ever get virtualized.
   
   This will be in systemd 209.
  
  I've tested this with libvirt and it worked except for one small edge
  case.
  
  Say libvirt creates 3 consoles /dev/pts/0, /dev/pts/1 and /dev/pts/2.
  Now we set  container_ttys=pts/0 pts/1 pts/2 Systemd starts up 3
  agetty processes - one of each of these.
  
  The /dev/console device, however, is also a link to /dev/pts/0
  and so systemd starts up a agetty process for that too.
  
  Now we have 2 agetty processes fighting over /dev/pts/0 which ends
  in tears
  
  Is this something that systemd should detect  cope with, or should we
  document that the 'container_ttys' env *must exclude* any tty associated
  with the /dev/console device ?
 
 I am tempted to say that we should do the latter, it's quite difficult
 to figure out when they point to the same (for example, because people
 use a bind mount rather than a symlink), and the roles of the console
 and the other $container_ttys is quite different during boot if we want
 to avoid printing logs over the getty and so on...
 
 I added this to the wiki text now.

Ok, sounds good. I'll update libvirt to take account of this.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] getty : how to run getty on every ttyX

2013-12-13 Thread Daniel P. Berrange
On Fri, Dec 13, 2013 at 04:06:56PM +0100, Lennart Poettering wrote:
 On Fri, 13.12.13 16:34, Gao feng (gaof...@cn.fujitsu.com) wrote:
 
   As we know, systemd only forks getty on ttyX when we press ctrl + alt + 
   FX.
   I whould like to let systemd forks server gettys on all of tty deivces 
   by default.
   this is very useful in container environment, since we can't use 
   ctrl+alt+FX
   to trigger getty in container.
   
   ...do containers even have such devices?
   
  
  pts device ;)
 
 This will not work. Unlike VT ttys which exist continously and
 perpetuously on a system ptys only exist when an application allocates
 them, like for example xterm or ssh. However, for them its xterm's or
 ssh's job to spawn a shall as backend.
 
   Anyway, just enable more instances of getty@.service for all devices you 
   need, just like getty@tty1.service is started by default.
   
   The autostart that you mention is part of logind and all it does is just 
   start the same services via systemd, no magic.
   
  
  getty@tty1.service under /etc/systemd/system/getty.target.wants/ is linked 
  to /usr/lib/systemd/system/getty@.service,
  so I create getty@tty2.service which links to 
  /usr/lib/systemd/system/getty@.service too. is this right?
  
  In libvirt lxc, the ttyX actually is pts devices.
  
  [root@localhost getty.target.wants]# ll
  total 0
  lrwxrwxrwx 1 root root 38 Dec 13 02:49 getty@tty1.service - 
  /usr/lib/systemd/system/getty@.service
  lrwxrwxrwx 1 root root 38 Dec 13 03:22 getty@tty2.service - 
  /usr/lib/systemd/system/getty@.service
  
  seems like in my container, agetty listens on /dev/console, not tty1 or tty2
  /sbin/agetty --noclear --keep-baud console 115200 38400 9600
  
  it seems getty-generator does the extra job.
 
 /dev/tty1, /dev/tty2, ... make no sense in containers as there is no
 virtual console. 

For each console device that you list in the container configuration
with libvirt, it will allocate a /dev/pts/NNN device, and add a symlink
from /dev/ttyN to the /dev/pts/NNN slave.  What we'd like is for guest
OS to be able to setup  agetty processes on any console device libvirt
has configured for the container automatically.

We had discussed this back at Linux Plumbers last year, and at the time
you had suggested that rather than create /dev/ttyN symlinks we should
instead do something like  /dev/containerttyN instead, and set a
'container_tty' variable containing a list of all those device names
so that systemd can discover them sensibly. We never got around to
doing this from the libvirt side, and AFAIK systemd hasn't done anything
on its side either. So is this still a suitable way forward ?

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] getty : how to run getty on every ttyX

2013-12-13 Thread Daniel P. Berrange
On Fri, Dec 13, 2013 at 05:20:19PM +0100, Lennart Poettering wrote:
 On Fri, 13.12.13 16:15, Lennart Poettering (lenn...@poettering.net) wrote:
 
   We had discussed this back at Linux Plumbers last year, and at the time
   you had suggested that rather than create /dev/ttyN symlinks we should
   instead do something like  /dev/containerttyN instead, and set a
   'container_tty' variable containing a list of all those device names
   so that systemd can discover them sensibly. We never got around to
   doing this from the libvirt side, and AFAIK systemd hasn't done anything
   on its side either. So is this still a suitable way forward ?
  
  Yeah, I am pretty sure that's what we should do. I figure I should hack
  that up. I'll work on it now.
 
 Committed. systemd-getty-generator will now look for $container_ttys
 set as an environment variable for PID 1. If that is set it will split
 the string up on whitespaces and start a getty on all ptys
 referenced. Note that this only supports ptys, not any other ttys. 
 
 Example:
 
 container_ttys=pts/5 pts/8 pts/15
 
 when pass to PID 1 will spawn three additional gettys on ptys 5, 8 and
 15.
 
 Note that this *really* only supports ptys, not any other kinds of ttys,
 sinc for those we require propery device enumeration and notification
 and we don't have those in containers... I still chose to name this
 $container_ttys rather than $container_ptys, so that maybe one day we
 can extend it should devices like this ever get virtualized.
 
 This will be in systemd 209.

Great, that all sounds good to me.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] machines get killed when scopes are destroyed

2013-11-18 Thread Daniel P. Berrange
On Mon, Nov 18, 2013 at 03:03:18AM +0100, Zbigniew Jędrzejewski-Szmek wrote:
 v0lZy reported on IRC that his qemu machines get killed when shutting
 down the host. libvirt-guests.service is designed to suspend them
 during shutdown, but when it was run, the guests were all already dead.
 
 And indeed, each qemu is running inside a scope, which is not
 connected by any dependencies to either systemd-machine.service, or
 libvirt-guests.service. libvirt-guests.service does not depend on
 systemd-machine.service either. This means that when shutdown is
 ordered, the scopes will stopped in parallel to other
 libvirt-guests.service, and depending on timing, qemus will be just
 killed with SIGTERM.
 
 For this whole thing to work correctly, we need to ensure that
 scopes are not terminated prematurely. If we introduced a target
 like libvirt-ready.target, and made libvirt-guests.service be
 After=libvirt-ready.target, and made all the scopes be
 Before=libvirt-ready.target, I think the vms would have a chance
 to shutdown properly. But that's pretty complicated.
 And I'm not even sure how to do that properly. Any better
 ideas?

I don't have an answer for you, but just want to ask that you file a
bug against libvirt for this problem. This is an unintended regression
in libvirt functionality with the switch to systemd scopes.

http://libvirt.org/bugs.html

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] netns: unix: only allow to find out unix socket in same net namespace

2013-08-21 Thread Daniel P. Berrange
On Wed, Aug 21, 2013 at 11:51:53AM +0200, Kay Sievers wrote:
 On Wed, Aug 21, 2013 at 9:22 AM, Gao feng gaof...@cn.fujitsu.com wrote:
  On 08/21/2013 03:06 PM, Eric W. Biederman wrote:
 
  I suspect libvirt should simply not share /run or any other normally
  writable directory with the host.  Sharing /run /var/run or even /tmp
  seems extremely dubious if you want some kind of containment, and
  without strange things spilling through.
 
 Right, /run or /var cannot be shared. It's not only about sockets,
 many other things will also go really wrong that way.

Libvirt already allows the app defining the container config to
set private mounts for any directory including /run and /var.

If an admin or app wants to run systemd inside a container, it is
their responsibility to ensure they setup the filesystem in a
suitable manner. Libvirt is not going to enforce use of a private
/run or /var, since that's a policy decision for a specific
use case.


Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] udev within a container

2013-07-26 Thread Daniel P. Berrange
On Fri, Jul 26, 2013 at 05:16:16PM +0200, Kay Sievers wrote:
 On Fri, Jul 26, 2013 at 5:09 PM, Rob Spanton rspan...@zepler.net wrote:
  I would like to run some processes inside a container that interact with
  udev.  Ideally udev would be within the same container as those
  processes, as then I could also have udev rules that started other
  things within that container too...
 
  However, as far as I can tell, it's not possible to run udev within a
  container -- is this correct?  Is there a magical workaround that I
  haven't found!?
 
 There is no real support to run udev inside a container. It might be
 possible to hack around that, but it's nothing that seems convincing
 so far. We just run containers without any udev setup, but with a
 minimal pre-setup /dev.

Furthermore, in the absence of any devices namespace in the kernel,
it would be a security flaw to allow a process inside the container
(whether udevd or something else) the permission to mknod. So you must
always pre-populate /dev and remove CAP_MKNOD instead of running udev,
if you want any security.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Error handling problems with systemd-machined

2013-07-24 Thread Daniel P. Berrange
I'm working on integrating libvirt with systemd-machined for cgroups
setup and hitting a number of problems

The first was that v205 ignores all parameters passed though as scope
properties in the DBus CreateMachine call. So I upgraded to v206 which
seems to have fixed that.

When something goes wrong with the CreateMachine DBus call though all I
ever seem to get back is  Input/output error.

After strace'ing systemd-machined I find the real error

recvmsg(5, {msg_name(0)=NULL, 
msg_iov(1)=[{l\1\0\1\334\0\0\0\2\0\0\0\277\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/machine1\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop.machine1\0\0\0\0\0\0\0\0\2\1s\0
 
\0\0\0org.freedesktop.machine1.Manager\0\0\0\0\0\0\0\0\3\1s\0\r\0\0\0CreateMachine\0\0\0\10\1g\0\fsayssusa(sv)\0\0\0\0\0\0\0\7\1s\0\6\0\0\0:1.130\0\0\t\0\0\0lxc-busy2\0\0\0\20\0\0\0\335\247\271G\10F\27Y(s\0177]\367\327\353\v\0\0\0libvirt-lxc\0\t\0\0\0container\0\0\0\210:\0\0\0\0\0\0\0\0\0\0\204\0\0\0\0\0\0\0\5\0\0\0Slice\0\1s\0\0\0\0\16\0\0\0/machine.slice\0\0\0\0\0\0\r\0\0\0CPUAccounting\0\1b\0\0\0\0\1\0\0\0\0\0\0\0\21\0\0\0BlockIOAccounting\0\1b\0\0\0\0\1\0\0\0\20\0\0\0MemoryAccounting\0\1b\0\1\0\0\0,
 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 428
sendmsg(5, {msg_name(0)=NULL, 
msg_iov(2)=[{l\1\0\1D\1\0\0\n\0\0\0\255\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/systemd1\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop.systemd1\0\0\0\0\0\0\0\0\2\1s\0
 
\0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\3\1s\0\22\0\0\0StartTransientUnit\0\0\0\0\0\0\10\1g\0\7ssa(sv)\0\0\0\0,
 192}, 
{\32\0\0\0machine-lxc\\x2dbusy2.scope\0\0\4\0\0\0fail\0\0\0\0\24\1\0\0\5\0\0\0Slice\0\1s\0\0\0\0\r\0\0\0machine.slice\0\0\0\0\0\0\0\v\0\0\0Description\0\1s\0\0\23\0\0\0Container
 lxc-busy2\0\0\0\0\0\17\0\0\0TimeoutStopUSec\0\1t\0\0 
\241\7\0\0\0\0\0\4\0\0\0PIDs\0\2au\0\0\0\0\4\0\0\0\210:\0\0\5\0\0\0Slice\0\1s\0\0\0\0\16\0\0\0/machine.slice\0\0\0\0\0\0\r\0\0\0CPUAccounting\0\1b\0\0\0\0\1\0\0\0\0\0\0\0\21\0\0\0BlockIOAccounting\0\1b\0\0\0\0\1\0\0\0\20\0\0\0MemoryAccounting\0\1b\0\1\0\0\0,
 324}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 516
recvmsg(5, {msg_name(0)=NULL, 
msg_iov(1)=[{l\3\1\1+\0\0\0\265\1\0\0]\0\0\0\6\1s\0\6\0\0\0:1.126\0\0\4\1s\0$\0\0\0org.freedesktop.systemd1.InvalidName\0\0\0\0\5\1u\0\n\0\0\0\10\1g\0\1s\0\0\7\1s\0\4\0\0\0:1.1\0\0\0\0\0\0\0Unit
 name /machine.slice is not valid.\0, 2048}], msg_controllen=0, 
msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 155
sendmsg(3, {msg_name(0)=NULL, 
msg_iov(4)=[{PRIORITY=3\nSYSLOG_FACILITY=4\nCODE_FILE=src/machine/machine.c\nCODE_LINE=246\nCODE_FUNCTION=machine_start_scope\nSYSLOG_IDENTIFIER=systemd-machined\n,
 144}, {MESSAGE=, 8}, {Failed to start machine scope: Unit name 
/machine.slice is not valid., 69}, {\n, 1}], msg_controllen=0, msg_flags=0}, 
MSG_NOSIGNAL) = 222
sendmsg(5, {msg_name(0)=NULL, 
msg_iov(2)=[{l\3\1\1\27\0\0\0\v\0\0\0O\0\0\0\6\1s\0\6\0\0\0:1.130\0\0\4\1s\0\\0\0\0org.freedesktop.DBus.Error.IOError\0\0\0\0\0\0\5\1u\0\2\0\0\0\10\1g\0\1s\0\0,
 96}, {\22\0\0\0Input/output error\0, 23}], msg_controllen=0, msg_flags=0}, 
MSG_NOSIGNAL) = 119


So machined is getting a useful error back from systemd

  Unit name /machine.slice is not valid.

and syslog'ing that error, and then sending back the dbus client a useless
Input/output error message :-(

Once I fixed the unit name to removing the leading '/', I hit a second
error

recvmsg(5, {msg_name(0)=NULL, 
msg_iov(1)=[{l\3\1\0014\0\0\0\301\1\0\0]\0\0\0\6\1s\0\6\0\0\0:1.126\0\0\4\1s\0#\0\0\0org.freedesktop.systemd1.UnitExists\0\0\0\0\0\5\1u\0\f\0\0\0\10\1g\0\1s\0\0\7\1s\0\4\0\0\0:1.1\0\0\0\0/\0\0\0Unit
 machine-lxc\\x2dbusy2.scope already exists.\0, 2048}], msg_controllen=0, 
msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 164

  Unit machine-lxc\\x2dbusy2.scope already exists

But neither machinectl list or systemctl --full show any such machine
or unit existing. It seems like when it reported the bogus slice name,
it did not fully clean up the transient scope unit it created. This is
then blocking further attempts to create the same transient scope.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] systemd shutdown vs ostree

2013-07-24 Thread Daniel P. Berrange
On Sat, Jul 20, 2013 at 06:50:13PM -0400, Colin Walters wrote:
 So OSTree sets up systemd inside a chroot - /usr is a read-only bind
 mount, and /var is a bind mount outside the root to a shared location.
 Furthermore, /sysroot points to the real root.
 
 Since last time we discussed this:
 http://lists.freedesktop.org/archives/systemd-devel/2012-September/006668.html
 I now use this service inside dracut:
 https://git.gnome.org/browse/ostree/tree/src/dracut/ostree-prepare-root.service
 Which executes:
 https://git.gnome.org/browse/ostree/tree/src/switchroot/ostree-prepare-root.c
 
 Then finally we do dracut's normal systemctl switch-root, and everything
 continues as normal.  I haven't had to patch the systemd codebase at all
 for this.
 
 The problem is that on shutdown, systemd will synthesize usr.mount and
 var.mount from /proc/self/mountinfo, but it can't really unmount them
 until the same point as the rootfs.  Because these units fail to
 unmount, the normal shutdown process wedges.
 
 I can shutdown fine with systemctl --force poweroff, but then I don't
 get plymouth integration etc.
 
 One way to fix this might be to somehow tell systemd to just ignore
 these mount points during shutdown.  Or possibly, switch back to the
 initramfs and unmount them from there.
 
 The ugly thing about switching back to the initramfs is that it requires
 unpacking it from the cpio blob again, which requires /boot to be
 mounted, only to run a few unmount syscalls, and then finally power off.
 
 But if there was a way to tell systemd to just ignore the mounts, then
 we'd drop into the final poweroff SIGTERM/SIGKILL/umount spree like
 sysvinit did, and things would work.
 
 Anyone else doing bind mount tricks like this?

A while back had a similar-ish kind of problem with LXC, when the original
FS had something mounted at say /foo/bar/wizz, and then libvirt bind mounted
something at /foo, making /foo/bar/wizz inaccessible. systemd would
still see these over-mounted mounts and fail to unmount them at shutdown.
I fixed libvirt LXC to remove all sub-mounts before bind mounting the
new thing at /foo, so not sure if the problems I saw with systemd would
still exist or not.

There is also a change proposed for the kernel namespaces yesterday to
make it possible to stop a process inside a container from unmounting
things that wasn't originally mounted inside the namespace. So if that
is merged, systemd inside a container wouldn't be able to assume it
has the privileges to unmount all filesystems it can see.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Error handling problems with systemd-machined

2013-07-24 Thread Daniel P. Berrange
On Wed, Jul 24, 2013 at 02:13:30PM +0100, Daniel P. Berrange wrote:
 I'm working on integrating libvirt with systemd-machined for cgroups
 setup and hitting a number of problems

A further discovery - if I pass MemoryAccounting=yes as a scope
property, then the process gets immediately killed by the OOM
killer

Jul 24 14:30:26 localhost systemd[1]: Starting Container lxc-busy3.
Jul 24 14:30:26 localhost systemd[1]: Started Container lxc-busy3.
Jul 24 14:30:26 localhost systemd-machined[14756]: New machine lxc-busy3.
Jul 24 14:30:26 localhost kernel: [ 4326.760834] libvirt_lxc invoked 
oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
Jul 24 14:30:26 localhost kernel: [ 4326.760839] libvirt_lxc cpuset=/ 
mems_allowed=0
Jul 24 14:30:26 localhost kernel: [ 4326.760841] Pid: 26196, comm: libvirt_lxc 
Not tainted 3.9.0-0.rc1.git0.1.fc19.x86_64 #1
Jul 24 14:30:26 localhost kernel: [ 4326.760843] Call Trace:
Jul 24 14:30:26 localhost kernel: [ 4326.760852]  [810d2da6] ? 
cpuset_print_task_mems_allowed+0x96/0xc0
Jul 24 14:30:26 localhost kernel: [ 4326.760857]  [8163cc20] 
dump_header+0x7a/0x1b3
Jul 24 14:30:26 localhost kernel: [ 4326.760860]  [8113242e] 
oom_kill_process+0x1be/0x310
Jul 24 14:30:26 localhost kernel: [ 4326.760864]  [811913d5] 
__mem_cgroup_try_charge+0xad5/0xb20
Jul 24 14:30:26 localhost kernel: [ 4326.760866]  [81191c80] ? 
mem_cgroup_charge_common+0x120/0x120
Jul 24 14:30:26 localhost kernel: [ 4326.760869]  [81191be6] 
mem_cgroup_charge_common+0x86/0x120
Jul 24 14:30:26 localhost kernel: [ 4326.760871]  [8119349b] 
mem_cgroup_newpage_charge+0x4b/0xb0
Jul 24 14:30:26 localhost kernel: [ 4326.760874]  [8115954c] 
handle_pte_fault+0x71c/0xa30
Jul 24 14:30:26 localhost kernel: [ 4326.760877]  [81217039] ? 
ext4_file_write+0x99/0x3f0
Jul 24 14:30:26 localhost kernel: [ 4326.760880]  [815223e2] ? 
__sys_recvmsg+0x112/0x290
Jul 24 14:30:26 localhost kernel: [ 4326.760882]  [8115a671] 
handle_mm_fault+0x291/0x660
Jul 24 14:30:26 localhost kernel: [ 4326.760887]  [816498e1] 
__do_page_fault+0x171/0x4f0
Jul 24 14:30:26 localhost kernel: [ 4326.760890]  [811d7bd1] ? 
fsnotify+0x241/0x320
Jul 24 14:30:26 localhost kernel: [ 4326.760892]  [81649c6e] 
do_page_fault+0xe/0x10
Jul 24 14:30:26 localhost kernel: [ 4326.760894]  [816493aa] 
do_async_page_fault+0x2a/0xa0
Jul 24 14:30:26 localhost kernel: [ 4326.760896]  [81646388] 
async_page_fault+0x28/0x30
Jul 24 14:30:26 localhost kernel: [ 4326.760899] Task in 
/machine.slice/machine-lxc\x2dbusy3.scope killed as a result of limit of 
/machine.slice/machine-lxc\x2dbusy3.scope
Jul 24 14:30:26 localhost kernel: [ 4326.760901] memory: usage 0kB, limit 0kB, 
failcnt 7
Jul 24 14:30:26 localhost kernel: [ 4326.760920] memory+swap: usage 0kB, limit 
9007199254740991kB, failcnt 0
Jul 24 14:30:26 localhost kernel: [ 4326.760921] kmem: usage 0kB, limit 
9007199254740991kB, failcnt 0
Jul 24 14:30:26 localhost kernel: [ 4326.760923] Memory cgroup stats for 
/machine.slice/machine-lxc\x2dbusy3.scope: cache:0KB rss:0KB mapped_file:0KB 
inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB 
unevictable:0KB
Jul 24 14:30:26 localhost kernel: [ 4326.760931] [ pid ]   uid  tgid total_vm   
   rss nr_ptes swapents oom_score_adj name
Jul 24 14:30:26 localhost kernel: [ 4326.760956] [26196] 0 2619629225   
  1732  560 0 libvirt_lxc
Jul 24 14:30:26 localhost kernel: [ 4326.760958] Memory cgroup out of memory: 
Kill process 26196 (libvirt_lxc) score 0 or sacrifice child
Jul 24 14:30:26 localhost kernel: [ 4326.761064] Killed process 26196 
(libvirt_lxc) total-vm:116900kB, anon-rss:3256kB, file-rss:3672kB
Jul 24 14:30:26 localhost kernel: [ 4326.776462] virbr0: port 2(veth0) entered 
disabled state
Jul 24 14:30:26 localhost kernel: [ 4326.777526] device veth0 left promiscuous 
mode
Jul 24 14:30:26 localhost kernel: [ 4326.777548] virbr0: port 2(veth0) entered 
disabled state
Jul 24 14:30:26 localhost avahi-daemon[431]: Withdrawing workstation service 
for veth1.
Jul 24 14:30:26 localhost avahi-daemon[431]: Withdrawing workstation service 
for veth0.
Jul 24 14:30:26 localhost systemd-machined[14756]: Machine lxc-busy3 terminated.

It looks like when passing MemoryAccount=yes, then systemd is accidentally
initializing the cgroup memory limit to 0 kb, with obvious results.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Error handling problems with systemd-machined

2013-07-24 Thread Daniel P. Berrange
On Wed, Jul 24, 2013 at 02:36:44PM +0100, Daniel P. Berrange wrote:
 On Wed, Jul 24, 2013 at 02:13:30PM +0100, Daniel P. Berrange wrote:
  I'm working on integrating libvirt with systemd-machined for cgroups
  setup and hitting a number of problems
 
 A further discovery - if I pass MemoryAccounting=yes as a scope
 property, then the process gets immediately killed by the OOM
 killer

[snip]

 It looks like when passing MemoryAccount=yes, then systemd is accidentally
 initializing the cgroup memory limit to 0 kb, with obvious results.

It can be reproduced with simple '.slice' units defined outside of
systemd-machined too

# cat /etc/systemd/system/machine-demo.slice
[Unit]
Description=Demo slice

[Slice]
CPUAccounting=yes
MemoryAccounting=yes


# systemctl start machine-demo.slice

# cat 
/sys/fs/cgroup/memory/machine.slice/machine-demo.slice/memory.limit_in_bytes
0

Regards,
Daniel
--
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Error handling problems with systemd-machined

2013-07-24 Thread Daniel P. Berrange
On Wed, Jul 24, 2013 at 05:59:48PM +0200, Lennart Poettering wrote:
  When something goes wrong with the CreateMachine DBus call though all I
  ever seem to get back is  Input/output error.
  
  After strace'ing systemd-machined I find the real error
  
  recvmsg(5, {msg_name(0)=NULL, 
  msg_iov(1)=[{l\1\0\1\334\0\0\0\2\0\0\0\277\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/machine1\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop.machine1\0\0\0\0\0\0\0\0\2\1s\0
   
  \0\0\0org.freedesktop.machine1.Manager\0\0\0\0\0\0\0\0\3\1s\0\r\0\0\0CreateMachine\0\0\0\10\1g\0\fsayssusa(sv)\0\0\0\0\0\0\0\7\1s\0\6\0\0\0:1.130\0\0\t\0\0\0lxc-busy2\0\0\0\20\0\0\0\335\247\271G\10F\27Y(s\0177]\367\327\353\v\0\0\0libvirt-lxc\0\t\0\0\0container\0\0\0\210:\0\0\0\0\0\0\0\0\0\0\204\0\0\0\0\0\0\0\5\0\0\0Slice\0\1s\0\0\0\0\16\0\0\0/machine.slice\0\0\0\0\0\0\r\0\0\0CPUAccounting\0\1b\0\0\0\0\1\0\0\0\0\0\0\0\21\0\0\0BlockIOAccounting\0\1b\0\0\0\0\1\0\0\0\20\0\0\0MemoryAccounting\0\1b\0\1\0\0\0,
   2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 
  428
  sendmsg(5, {msg_name(0)=NULL, 
  msg_iov(2)=[{l\1\0\1D\1\0\0\n\0\0\0\255\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/systemd1\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop.systemd1\0\0\0\0\0\0\0\0\2\1s\0
   
  \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\3\1s\0\22\0\0\0StartTransientUnit\0\0\0\0\0\0\10\1g\0\7ssa(sv)\0\0\0\0,
   192}, 
  {\32\0\0\0machine-lxc\\x2dbusy2.scope\0\0\4\0\0\0fail\0\0\0\0\24\1\0\0\5\0\0\0Slice\0\1s\0\0\0\0\r\0\0\0machine.slice\0\0\0\0\0\0\0\v\0\0\0Description\0\1s\0\0\23\0\0\0Container
   lxc-busy2\0\0\0\0\0\17\0\0\0TimeoutStopUSec\0\1t\0\0 
  \241\7\0\0\0\0\0\4\0\0\0PIDs\0\2au\0\0\0\0\4\0\0\0\210:\0\0\5\0\0\0Slice\0\1s\0\0\0\0\16\0\0\0/machine.slice\0\0\0\0\0\0\r\0\0\0CPUAccounting\0\1b\0\0\0\0\1\0\0\0\0\0\0\0\21\0\0\0BlockIOAccounting\0\1b\0\0\0\0\1\0\0\0\20\0\0\0MemoryAccounting\0\1b\0\1\0\0\0,
   324}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 516
  recvmsg(5, {msg_name(0)=NULL, 
  msg_iov(1)=[{l\3\1\1+\0\0\0\265\1\0\0]\0\0\0\6\1s\0\6\0\0\0:1.126\0\0\4\1s\0$\0\0\0org.freedesktop.systemd1.InvalidName\0\0\0\0\5\1u\0\n\0\0\0\10\1g\0\1s\0\0\7\1s\0\4\0\0\0:1.1\0\0\0\0\0\0\0Unit
   name /machine.slice is not valid.\0, 2048}], msg_controllen=0, 
  msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 155
  sendmsg(3, {msg_name(0)=NULL, 
  msg_iov(4)=[{PRIORITY=3\nSYSLOG_FACILITY=4\nCODE_FILE=src/machine/machine.c\nCODE_LINE=246\nCODE_FUNCTION=machine_start_scope\nSYSLOG_IDENTIFIER=systemd-machined\n,
   144}, {MESSAGE=, 8}, {Failed to start machine scope: Unit name 
  /machine.slice is not valid., 69}, {\n, 1}], msg_controllen=0, 
  msg_flags=0}, MSG_NOSIGNAL) = 222
  sendmsg(5, {msg_name(0)=NULL, 
  msg_iov(2)=[{l\3\1\1\27\0\0\0\v\0\0\0O\0\0\0\6\1s\0\6\0\0\0:1.130\0\0\4\1s\0\\0\0\0org.freedesktop.DBus.Error.IOError\0\0\0\0\0\0\5\1u\0\2\0\0\0\10\1g\0\1s\0\0,
   96}, {\22\0\0\0Input/output error\0, 23}], msg_controllen=0, 
  msg_flags=0}, MSG_NOSIGNAL) = 119
  
  
  So machined is getting a useful error back from systemd
  
Unit name /machine.slice is not valid.
  
  and syslog'ing that error, and then sending back the dbus client a useless
  Input/output error message :-(
 
 Yeah, we really suck at handing out good errors. But usually should
 should have gotten an (equally useless) EINVAL in most cases.
 
  Once I fixed the unit name to removing the leading '/', I hit a second
  error
  
  recvmsg(5, {msg_name(0)=NULL, 
  msg_iov(1)=[{l\3\1\0014\0\0\0\301\1\0\0]\0\0\0\6\1s\0\6\0\0\0:1.126\0\0\4\1s\0#\0\0\0org.freedesktop.systemd1.UnitExists\0\0\0\0\0\5\1u\0\f\0\0\0\10\1g\0\1s\0\0\7\1s\0\4\0\0\0:1.1\0\0\0\0/\0\0\0Unit
   machine-lxc\\x2dbusy2.scope already exists.\0, 2048}], msg_controllen=0, 
  msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 164
  
Unit machine-lxc\\x2dbusy2.scope already exists
  
  But neither machinectl list or systemctl --full show any such machine
  or unit existing. It seems like when it reported the bogus slice name,
  it did not fully clean up the transient scope unit it created. This is
  then blocking further attempts to create the same transient scope.
 
 Hmm, that's interesting. What does systemctl status say for the unit
 in question when this happens? Could you paste?

# systemctl status 'machine-lxc\x2dbusy4.scope'
machine-lxc\x2dbusy4.scope
   Loaded: stub (/run/systemd/system/machine-lxc\x2dbusy4.scope; static)
   Active: inactive (dead)

 Kay had some issues where the kernel's release_agent wouldn't be called
 on recent kernels, but I never had issues with that...

If systemd is complaining about the bogus slice name /machine.slice
is it possible that it has returned this error, before it ever placed
the init PID into the cgroup? The kernel release_agent would never
trigger if there was no process any cgroup to exit, and thus the
slice may not get cleaned up.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- 

Re: [systemd-devel] Error handling problems with systemd-machined

2013-07-24 Thread Daniel P. Berrange
On Wed, Jul 24, 2013 at 05:10:59PM +0100, Daniel P. Berrange wrote:
 On Wed, Jul 24, 2013 at 05:59:48PM +0200, Lennart Poettering wrote:
   Once I fixed the unit name to removing the leading '/', I hit a second
   error
   
   recvmsg(5, {msg_name(0)=NULL, 
   msg_iov(1)=[{l\3\1\0014\0\0\0\301\1\0\0]\0\0\0\6\1s\0\6\0\0\0:1.126\0\0\4\1s\0#\0\0\0org.freedesktop.systemd1.UnitExists\0\0\0\0\0\5\1u\0\f\0\0\0\10\1g\0\1s\0\0\7\1s\0\4\0\0\0:1.1\0\0\0\0/\0\0\0Unit
machine-lxc\\x2dbusy2.scope already exists.\0, 2048}], 
   msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 164
   
 Unit machine-lxc\\x2dbusy2.scope already exists
   
   But neither machinectl list or systemctl --full show any such machine
   or unit existing. It seems like when it reported the bogus slice name,
   it did not fully clean up the transient scope unit it created. This is
   then blocking further attempts to create the same transient scope.
  
  Hmm, that's interesting. What does systemctl status say for the unit
  in question when this happens? Could you paste?
 
 # systemctl status 'machine-lxc\x2dbusy4.scope'
 machine-lxc\x2dbusy4.scope
Loaded: stub (/run/systemd/system/machine-lxc\x2dbusy4.scope; static)
Active: inactive (dead)
 
  Kay had some issues where the kernel's release_agent wouldn't be called
  on recent kernels, but I never had issues with that...
 
 If systemd is complaining about the bogus slice name /machine.slice
 is it possible that it has returned this error, before it ever placed
 the init PID into the cgroup? The kernel release_agent would never
 trigger if there was no process any cgroup to exit, and thus the
 slice may not get cleaned up.

FYI, I can reproduce this with systemd-nspawn too


# systemd-nspawn --slice /machine.slice -D /mnt/demo/ -M foo  /bin/sh
Spawning namespace container on /mnt/demo (console is /dev/pts/5).
Init process in the container running as PID 32057.
Failed to register machine: Input/output error
Container failed with error code 251.

Run that multiple times and you'll see the 2nd time machined gets the
error about pre-existing unit.


Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] udevadm settle hangs due to veths in seperate network namespaces

2013-07-12 Thread Daniel P. Berrange
On Fri, Jul 12, 2013 at 06:00:42PM +0200, Kay Sievers wrote:
 On Fri, Jul 12, 2013 at 5:00 PM, Daniel P. Berrange berra...@redhat.com 
 wrote:
  On Fri, Jul 12, 2013 at 02:51:10PM +0100, Daniel P. Berrange wrote:
  We're hitting a problem in libvirt where 'udevadm settle' will get stuck
  in a loop until it eventually times out. Eventually we realized this
  happens when we have any LXC containers active with veth devices in a
  separate network namespace.
 
  Incidentally, I recall reading something by (iirc) Lennart saying that
  apps really should use 'udevadm settle' at all.\
 
 You mean *not*, I guess.

Opps. yes.

 There are still valid uses of settle for command line tools, and that
 will be likely valid in the future too. There is no simple replacement
 for this barrier to be implemented by simple command line tools.
 Letting then subscribe to hotplug would ask for too much in quite a
 few cases.
 
 No advanced subsystem or service though should rely or model around
 settle and make assumptions about everything is there now, tools
 should subscribe to udev events and after that enumerate the current
 devices.
 
 Things that pull-in settle at bootup are kind of broken, that is the
 aspect of seetle you heard from Lennart rightfully complaining, I
 guess.
 
  Libvirt uses it in a
  couple of places, all related to code which obtains lists of storage
  devices
 
 Which makes sense according to the current state of affairs. Storage
 tools are only slowly catching up with the reality of devices coming
 and going all the time on today's systems. They get fixed, and things
 look at least better today than they have been, but settle is still
 needed for some operations.
 
   - After adding a disk partition in parted, we use it to wait for
 the /dev/sdXXNNN device nodes to all show up
 
 Primary device node creation (not symlinks) is synchronous since a
 couple of years. Devtmps does that for us. The ioctl to add a part
 table entry, re-read the part table will not return until devtmpfs has
 created the device nodes.
 
 The udev symlinks though might only be available after a settle call.
 
   - After logging into an iscsi target with iscsiadm, we use it to
 wait for all the /dev/sdXXX devices nodes associated with the
 iSCSI target to appear.
 
   - After triggering a SCSI HBA rescan via sysfs, we use it to wait
 for all the /dev/sdXXX devices nodes associated with the SCI HBA
 to appear
 
   - After creating an NPIV virtual HBA via sysfs, we use it to wait
 for all the /dev/sdXXX devices nodes associated with the vHBA
 to appear
 
 As said, this should all be covered on more recent systems.
 
   - After activating an LVM volume group, we use it to wait for all
 the /dev/VGNAME/ device nodes to appear
 
   - After deleting an LVM  volume we use it to wait for the device
 node to be removed
 
   - After adding an LVM  volume we use it to wait for the device
 node to be added
 
 LVM is a story on its own, it's pretty complex, and it slowly gets
 fixed over time. With the very recent changes it might integrate nicer
 now. I guess there are still situations though where settle is needed
 and the simplest solution.
 
 All of that applies only to the command line tools again, not for
 bootup related services, or full-blown storage management services. It
 is not ok for them to relay on settle.
 
  You can see a pattern there - after doing some action related to
  storage, we need to synchronize wrt the creation/deletion of device
  nodes in /dev, otherwise we miss out LUNs when we scan for the list
  of device nodes associated with a HBA/VolGroup/etc. Any suggestions
  for alternative techniques / approaches here ?
 
 I think it's fine and is needed for libvirt to use settle. At least as
 long as it calls the command line tools. There is no generally
 available storage interface on Linux which would solve all these
 problems for libvirt, and I don't think you should declare these
 problems as libvirt problems. Using settle to get a barrier for the
 tools you need to use which themselves cannot handle async setup and
 hotplug sounds fine to me.
 
 Many of the issues though might already be history with devtmpfs, at
 least when the primary nodes (and not the symlinks) are used.

Unfortunately we do make use of the /dev/disk/by- paths in order
to get paths which are stable across hosts and/or reboots, but not
always. So perhaps I'll look at avoiding use of 'settle' in cases
where we don't need the symlinks  the commands are synchronous.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org

Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Daniel P. Berrange
On Mon, Jun 24, 2013 at 03:27:15PM +0200, Lennart Poettering wrote:
 On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote:
 
  1. I put all the entire world into a separate, highly constrained
  cgroup.  My real-time code runs outside that cgroup.  This seems to
  exactly what slices are for, but I need kernel threads to go in to
  the constrained cgroup.  Will systemd support this?
 
 I am not sure whether the ability to move kernel threads into cgroups
 will stay around at all, from the kernel side. Tejun, can you comment
 on this?

KVM uses the vhost_net device for accelerating guest network I/O
paths. This device creates a new kernel thread on each open(),
and that kernel thread is attached to the cgroup associated
with the process that open()d the device.

If systemd allows for a process to be moved between cgroups, then
it must also be capable of moving any associated kernel threads to
the new cgroup at the same time. This co-placement of vhost-net
threads with the KVM process, is very critical for I/O performance
of KVM networking.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] systemd-nspawn/LXC containers pam login failure

2013-05-09 Thread Daniel P. Berrange
Following the suggestion in the systemd-nspawn manpage I populated
a mini Fedora 19 chroot, on a Fedora 19 host

  # yum -y --releasever=19 --nogpg --installroot=/srv/mycontainer \
--disablerepo='*' --enablerepo=fedora \
install systemd passwd yum fedora-release vim-minimal
  # chroot /srv/mycontainer passwd
  # systemd-nspawn -bD /srv/mycontainer

Systemd boots up nicely  presents a login prompt, but it is impossible
to actually login, PAM always denying the attempts.

Debugging this, there seem to be two issues

 1. pam_loginuid.so tries to write to /proc/self/loginuid but is denied
by the kernel.

My kernel has CONFIG_AUDIT_LOGINUID_IMMUTABLE=y which means once a
loginuid is set (in this case from my ssh session into the host),
it can't be changed (eg by the 'login' process inside the container).
From the KConfig comment, this appears to have been a new feature
built explicitly for systemd based hosts.

The loginuid appears to be inherited across fork/exec so, AFAICT,
the only way to avoid this is to spawn the container from something
which does not already have a loginuid set, eg systemd itself or
some other process not associated with a login session.

Not being able to spawn containers from a login session on the host
is kind of a PITA for development / debuging :-(

Seems we need to find a way to have systemd-nspawn ensure that the
'init' process inside the container does not have a 'loginuid' set,
even if the thing starting the container does. On the flipside, it
seems this would violate the kernel security design for this feature ?

If that were the case, then the pam_loginuid module might need to
be made a no-op inside containers.

 2. The audit_log_acct_message() method which is called by pretty
much any PAM module returns EPERM

There is no actual syscall returning EPERM here. The EPERM
appears to be coming back inside the netlink reply message
from the kernel audit subsystem. Since pretty much every PAM
module sends audit messages, this causes them all to return
fatal errors, failing the login attempt

The _pam_audit_writelog() method does have code to ignore
EPERM, but it only does so if  'getuid() != 0'. The container
login process has uid == 0, so EPERM is treated as fatal. The
easy (but not neccessarily correct) fix is to change

diff -rup Linux-PAM-1.1.6.orig/libpam/pam_audit.c 
Linux-PAM-1.1.6.new/libpam/pam_audit.c
--- Linux-PAM-1.1.6.orig/libpam/pam_audit.c 2012-08-15 
12:08:43.0 +0100
+++ Linux-PAM-1.1.6.new/libpam/pam_audit.c  2013-05-09 
10:17:48.679403471 +0100
@@ -46,7 +46,7 @@ _pam_audit_writelog(pam_handle_t *pamh,
   pamh-audit_state |= PAMAUDIT_LOGGED;

   if (rc  0) {
-  if (rc == -EPERM  getuid() != 0)
+  if (rc == -EPERM)
   return 0;
   if (errno != old_errno) {
   old_errno = errno;

but I'd rather like to understand why the kernel audit netlink
layer is replying with EPERM in the first place. The container
has CAP_AUDIT_WRITE capability.

Instead of removing the 'getuid() != 0' check, another option
would be to augment it to also check /proc/1/environ for any
'container' env variable.


If I remove the pam_loginuid module and also apply that above audit
patch to PAM, then I can successfuly login to a container launched
by systemd-nspawn. It would obviously be preferrable to figure out
what needs to be done to make this work out of the box though.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] systemd-nspawn/LXC containers pam login failure

2013-05-09 Thread Daniel P. Berrange
On Thu, May 09, 2013 at 03:32:09PM +0200, Lennart Poettering wrote:
 On Thu, 09.05.13 11:38, Daniel P. Berrange (berra...@redhat.com) wrote:
 
  Following the suggestion in the systemd-nspawn manpage I populated
  a mini Fedora 19 chroot, on a Fedora 19 host
  
# yum -y --releasever=19 --nogpg --installroot=/srv/mycontainer \
  --disablerepo='*' --enablerepo=fedora \
  install systemd passwd yum fedora-release vim-minimal
# chroot /srv/mycontainer passwd
# systemd-nspawn -bD /srv/mycontainer
  
  Systemd boots up nicely  presents a login prompt, but it is impossible
  to actually login, PAM always denying the attempts.
 
 Yeah, this is a known problem. We generally suggest to turn off audit
 by booting with audit=0 on the kernel cmdline for now:
 
 https://fedoraproject.org/wiki/Features/SystemdLightweightContainers
 
 I guess I should add a comment about this to nspawn's man page too.
 
 The audit folks are working on adding container awareness to the audit
 subsystem in the kernel (which basically means that audit messages carry
 the outside PID of PID1 of the container, so that auditd can track this
 properly). Currently audit is completely confused by PID
 namespacing. Also, we want them to fix for us that opening a PID
 namespace resets loginuid in the container to -1. We have discussed this
 several times with them, and they wanted to something about it, but so
 far nothing happened. But we'll have another meeting about this next
 week, so I can put some pressure on this.

Did you file any BZs against the kernel for this ?  If not I'll sort
out some BZs to track these problems.

   2. The audit_log_acct_message() method which is called by pretty
  much any PAM module returns EPERM
  
  There is no actual syscall returning EPERM here. The EPERM
  appears to be coming back inside the netlink reply message
  from the kernel audit subsystem. Since pretty much every PAM
  module sends audit messages, this causes them all to return
  fatal errors, failing the login attempt
  
  The _pam_audit_writelog() method does have code to ignore
  EPERM, but it only does so if  'getuid() != 0'. The container
  login process has uid == 0, so EPERM is treated as fatal. The
  easy (but not neccessarily correct) fix is to change
  
  diff -rup Linux-PAM-1.1.6.orig/libpam/pam_audit.c 
  Linux-PAM-1.1.6.new/libpam/pam_audit.c
  --- Linux-PAM-1.1.6.orig/libpam/pam_audit.c 2012-08-15 
  12:08:43.0 +0100
  +++ Linux-PAM-1.1.6.new/libpam/pam_audit.c  2013-05-09 
  10:17:48.679403471 +0100
  @@ -46,7 +46,7 @@ _pam_audit_writelog(pam_handle_t *pamh,
 pamh-audit_state |= PAMAUDIT_LOGGED;
  
 if (rc  0) {
  -  if (rc == -EPERM  getuid() != 0)
  +  if (rc == -EPERM)
 return 0;
 if (errno != old_errno) {
 old_errno = errno;
 
 I tried to get a patch like this into PAM actually, but Steve (of
 course) said nononono! He's really married to the idea that audit breaks
 everything on any kind of error... This is kinda sad though, as
 otherwise this would have allowed us to turn off auditing in the
 container completely by removing CAP_AUDIT_CONTROL and CAP_AUDIT_WRITE
 of the container...

I feared that might be the response from PAM maintainers :-(

 I guess libvirt-lxc is in a slightly better situation here regarding
 audit, since it never tries to spawn a container as child of a login
 session, hence loginuid will not be sealed off yet...

If libvirtd has been started from systemd then yes. Of course during
development I just run libvirtd from my source tree directly, so
still hit the problem :-)

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] systemd-nspawn/LXC containers pam login failure

2013-05-09 Thread Daniel P. Berrange
On Thu, May 09, 2013 at 03:32:09PM +0200, Lennart Poettering wrote:
 On Thu, 09.05.13 11:38, Daniel P. Berrange (berra...@redhat.com) wrote:
 
  Following the suggestion in the systemd-nspawn manpage I populated
  a mini Fedora 19 chroot, on a Fedora 19 host
  
# yum -y --releasever=19 --nogpg --installroot=/srv/mycontainer \
  --disablerepo='*' --enablerepo=fedora \
  install systemd passwd yum fedora-release vim-minimal
# chroot /srv/mycontainer passwd
# systemd-nspawn -bD /srv/mycontainer
  
  Systemd boots up nicely  presents a login prompt, but it is impossible
  to actually login, PAM always denying the attempts.
 
 Yeah, this is a known problem. We generally suggest to turn off audit
 by booting with audit=0 on the kernel cmdline for now:
 
 https://fedoraproject.org/wiki/Features/SystemdLightweightContainers
 
 I guess I should add a comment about this to nspawn's man page too.
 
 The audit folks are working on adding container awareness to the audit
 subsystem in the kernel (which basically means that audit messages carry
 the outside PID of PID1 of the container, so that auditd can track this
 properly). Currently audit is completely confused by PID
 namespacing. Also, we want them to fix for us that opening a PID
 namespace resets loginuid in the container to -1. We have discussed this
 several times with them, and they wanted to something about it, but so
 far nothing happened. But we'll have another meeting about this next
 week, so I can put some pressure on this.

Quite by accident I discovered that if you tell systemd-nspawn to
create a new network namespace, you no longer hit the EPERM issues
with sending audit messages. This is because the kernel only listens
for audit messages in the initial network namespace. libaudit catches
ECONNREFUSED  and turns into a no-op returning success, meaning that
PAM now works.

So if you use   systemd-nspawn --private-network, and make sure it
is launched by systemd itself not from yuour shell, then the standard
PAM config will 'just work'

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Systemd and cgroups

2013-04-10 Thread Daniel P. Berrange
On Wed, Apr 10, 2013 at 12:43:56PM +0300, Kevin Wilson wrote:
 Hello,
 I have a question about systemd and cgroups:
 mount | grep cgroups shows that only one entry has name=systemd.
 and is mounted on /sys/fs/cgroup/systemd . (see below the full output
 of mount | grep cgroups
 
 Is it true that all other cgroup entry shown by mount | grep cgroups
 were not mounted by systemd (and may be unmounted without directly
 causing problems is systemd)?

If some 3rd party application has mounted cgroups controllers before
systemd starts, it will honour that setup. If they were not already
mounted, then systemd itself will mount all the resource controllers
that are compiled into the kernel.

Systemd will only actually create sub-dirs in those controllers
that are listed in the 'DefaultControllers' setting of systemd.conf,
which defaults to 'cpu'.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] ExecRestart

2012-12-20 Thread Daniel P. Berrange
On Wed, Dec 19, 2012 at 11:46:13PM +0100, Lennart Poettering wrote:
 On Wed, 28.11.12 22:41, Brandon Black (blbl...@gmail.com) wrote:
 
  The daemon's fast restart code does all of the expensive startup
  operations in the new daemon first (e.g. parsing large data input), then
  signals the existing daemon to shut itself down, waits for it to release
  its critical resources (e.g. sockets, pidfile), and finally takes over
  those resources and finishes starting itself.  Basically it's using the
  overlap to avoid long service downtimes during that initial parsing phase
  (and if that parsing fails, it leaves the old daemon running to boot).

[snip]

 Or in other words:
 
 I am pretty sure that we should not alter the current restart logic, and
 should not introduce ExecRestart=. However, we really should think about
 either introducing ExecReexec= or somehow making ExecReload= useful for
 reexec-style reloading, too. But I haven't made my mind up on this, how
 this could look like.

FWIW, as previously mentioned, I'd love to see an explicitly supported
way to trigger a re-exec of a daemon. Currently I'm just relying on the
ability to send a custom signal to libvirt's virtlockd daemon. The problem
is that sysadmins would need to learn a different signal number for each
project's daemon. So I think there's value to admins in having a standard
way to trigger this via sysadmin.  Personally I think this should also be
separate from ExecReload which is merely used to refresh configuration
files.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] licence: remove references to old FSF address

2012-12-17 Thread Daniel P. Berrange
On Sun, Dec 16, 2012 at 10:23:23PM +, Sami Kerola wrote:
 Bug: https://bugs.freedesktop.org/show_bug.cgi?id=57206

 diff --git a/src/gudev/gudevclient.h b/src/gudev/gudevclient.h
 index b425d03..23bfce6 100644
 --- a/src/gudev/gudevclient.h
 +++ b/src/gudev/gudevclient.h
 @@ -3,19 +3,18 @@
 - * You should have received a copy of the GNU Lesser General Public
 - * License along with this library; if not, write to the
 - * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
 - * Boston, MA 02111-1307, USA.
 + * You should have received a copy of the GNU Library General Public
 + * License along with this library; if not, write to the Free Software
 + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301 
  USA

FWIW, in libvirt we decided that chasing the FSF's office relocations
was wasting everyone's time, so switched the last paragraph to link
to a web URL instead which will hopefully be more permanent...

   * You should have received a copy of the GNU Lesser General Public
   * License along with this library.  If not, see
   * http://www.gnu.org/licenses/.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] I have switched libvirt-sandbox containers to use multi-user.target

2012-11-20 Thread Daniel P. Berrange
On Tue, Nov 20, 2012 at 09:50:39AM -0500, Daniel J Walsh wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 11/20/2012 09:36 AM, Daniel P. Berrange wrote:
  On Tue, Nov 20, 2012 at 08:52:51AM -0500, Daniel J Walsh wrote:
  -BEGIN PGP SIGNED MESSAGE- Hash: SHA1
  
  On 11/19/2012 07:41 PM, Lennart Poettering wrote:
  On Fri, 16.11.12 15:06, Daniel J Walsh (dwa...@redhat.com) wrote:
  
  Isn't there a way to shut off systemV init scripts altogether, it
  just so happens that we hit one on my machine.  But in the field a
  customer could have an init script and then setup containers and
  systemd will attempt to start it. I want a way to say don't run SysV
  Init scripts altogether.
  
  Hmm, there is currently no option for that.
  
  A semi-dirty trick might be to over-bind-mount /etc/rc.d with something
   empty?
  
  Lennart
  
  What run levels would get executed?  I would prefer to mount over the
  empty run levels and allow an admin to be able to turn on a SysV init
  script.
  
  I'm not convinced we need to support that explicitly. If an admin wants to
  support execution of some ad-hoc script they can easily make a system unit
  that uses the various ExecXXX directives to invoke their arbitrary shell
  scripts.
  
  Daniel
  
 
 
 I was thinking more that if they wanted to execute
 
 chkconfig within the container, the right thing would happen, which I get by
 mounting empty dirs over /etc/rc.d/rc.[0-6]d
 
 Similar to us allowing the admin to execute
 
 systemctl enable foobar.service
 
 within the container.

IMHO supporting legacy commands like chkconfig is a non-goal for
libvirt-sandbox. It is brand new functionality designed around
closely integrating with systemd, and I don't think we should
pollute it with code for legacy / dieing init systems.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Client logging to journald without libsystemd-journal.so

2012-11-08 Thread Daniel P. Berrange
I recently introduced support for libvirt logging to journald. Initially I
had intended to use libsystemd-journal.so for the logging, however, in the
end I made libvirt directly communicate with sendmsg().

First, I wanted to confirm two interface stability issues.

 - Is the client app - journald logging protocol considered to be
   ABI stable ?
 - Is the /run/systemd/journal/socket path considered to be stable ?

Second, I wanted to mention why we couldn't use libsystemd-journal.so
ourselves.

The first problem is that there is no sd_journal_open/close API call
to setup the file descriptor. The library uses a one time atomic
global initialize to open its file descriptor which is then cached
until exit() or execve() (it has SOCK_CLOEXEC set).

The problem is that when libvirt does fork() to create client processes,
one of the things it does is to iterate from 0 - sysconf(_SC_OPEN_MAX),
closing every file descriptor, except those in its whitelist.

Now I know there is the school of thought that says this is a bad idea,
and that all code should correctly set O_CLOEXEC for all file descriptors.
While a nice idea in theory, unfortunately this is not really practical
for us in reality because there are too many 3rd party libraries we use
which don't do this correctly. Not least because traditional UNIX APIs
don't allow for atomically creating an FD with O_CLOEXEC set. So we're
stuck with closing all FDs after fork() for a good long time yet.

There are two things libsystemd-journal could do to help apps in this
scenario. Either provide a way for apps to query the cached journal
logging file descriptor, allowing them to explicitly leave it open.
Alternatively provide explicit API to call to re-open the FD, which
they could call after fork(). Possibly other solutions too, like
requiring an explicit close/open like syslog though that has its own
set of problems.

The second blocker problem was figuring out a way to send log messages
using only APIs declared async-signal safe. Again this is so that we
can safely send log messages inbetween fork() and execve() which only
permits async signal safe APIs. The sd_journal_send() API can't be
used since it relies on vasprintf() which can allocate using malloc.

The sd_journal_sendv() API is pretty close to what we'd want, but
the way you have to format the iovec doesn't quite work. IIUC, it
requires that each iovec contains a single formatted log item
string KEY=VALUE. Populating data in such a way is inconvenient
for libvirt. For libvirt it was easier for us to use two iovec
elements for each log item, KEY= and VALUE, so that we can
avoid doing the data copy implied by filling a single string with
KEY=VALUE.

The upshot is that we ended up filling an iovec[] ourselves, taking
care of escaping '\n', and then directly sending it to journald.

As long as the wire format and UNIX socket path are considered ABI
stable by systemd devs, I'm fairly happy with the libvirt code as
it. I just mention these issues in case you think it is desirable
to add further libsystemd-journal.so APIs to make life easier for
other applications doing logging in the future.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Client logging to journald without libsystemd-journal.so

2012-11-08 Thread Daniel P. Berrange
On Thu, Nov 08, 2012 at 04:56:03PM -0500, Colin Walters wrote:
 Sorry about the tone in the last message, it was unnecessary.  There's
 just some history here dating from the libxml2 days...

No worries, no offence taken :-)

 On Thu, 2012-11-08 at 17:38 +0100, Daniel P. Berrange wrote:
 
  Yeah, we've looked at  borrowed code from GLib in a few cases
  now, notably threads and atomic ops. I've previously looked at
  GLib's process spawning code, but didn't notice this particular
  item. Originally we did have an API fairly similar to the
  g_spawn_async_with_pipes API, but it is proved fairly cumbersome
  to use, so we've put together a much more flexible API now [1].
 
 Yeah, I've been working on a new one:
 https://bugzilla.gnome.org/show_bug.cgi?id=672102
 
  Possible, though I feel it is a little nasty, not least because when
  when journald then uses SCM_CREDS to find out the sender identity it
  will be getting the wrong pid and potentially wrong uid/gid too.
 
 This is an interesting case...conceptually it's true that it's a new
 pid, but I think it's a lot more useful usually what *code* it's
 running; ordinarily, that'd be an executable.  But here we're running
 code from the parent just before executing a new child. 
 
 Pretty much any error in fork-before-exec should be fatal, right?  So in
 the case where you're logging an error (e.g. setuid()
 failed, prctl() failed), the pid is going to be irrelevant anyways
 since the process will soon exit.

Hmm, good point.

 The uid/gid - yes, but on the other hand, the uid associated with
 the message will be the one that's conceptually in control at the
 moment.

It is a tricky question really. If the code failed because it did
not have permission to open the file, and the log contains the uid
of the parent process, this could mislead the person analysing. At
the same time I see your point that the uid/gid/pid should refer to
the process in control which is the parent.

 Regardless though of the approach taken (log from parent, log from
 forked-before-exec'd child), it'd probably be good to include some
 standard structured field saying that the code is being run in a child
 setup.  PREEXEC=1? 

If the log is sent from the child, you'd really want to also include
the PID of the parent process, to allow the log messages to be directly
correlated. A shame SCM_CREDS doesn't directly provide the parent-PID
too

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] shutdown: do reboot() for openvz container

2012-09-13 Thread Daniel P. Berrange
On Thu, Sep 13, 2012 at 12:30:00AM +0200, Lennart Poettering wrote:
 On Thu, 13.09.12 00:25, Kay Sievers (k...@vrfy.org) wrote:
 
  
  On Wed, Sep 12, 2012 at 11:54 PM, Lennart Poettering
  lenn...@poettering.net wrote:
   On Wed, 12.09.12 11:51, Daniel P. Berrange (berra...@redhat.com) wrote:
  
   NB when libvirt starts an LXC container, it first checks to see whether
   the kernel has the container aware reboot() support. If it does not,
   then it removes CAP_SYS_REBOOT from the container, to prevent any
   accidental whole system reboot. The sf.net LXC tools do the same thing.
  
   How do you check that? A version check or can you actually detect this
   feature explicitly?
  
  Returning EINVAL is also an easy way to check if this feature is supported
  by the kernel when invoking another 'reboot' option like CAD.
  
  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=cf3f89214ef6a33fad60856bc5ffd7bb2fc4709b
 
 But that's from inside the container. But LXC would need that from
 outside the container?

Oh you just need a quick clone() + reboot() pair to figure that out. See
the lxcContainerHasReboot() and lxcContainerRebootChild() methods in
the libvirt lxc_container.c file:

  
http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;hb=HEAD#l107

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] shutdown: do reboot() for openvz container

2012-09-12 Thread Daniel P. Berrange
On Wed, Sep 12, 2012 at 02:47:48PM +0400, Kir Kolyshkin wrote:
 On 09/11/2012 05:24 AM, Lennart Poettering wrote:
 On Fri, 24.08.12 16:22, Kir Kolyshkin (k...@openvz.org) wrote:
 
 Proper handling of reboot() syscall issued from the inside of a container
 was always supported by OpenVZ kernels. More to say, OpenVZ relies on the 
 fact
 that container calls reboot in order to distinguish between shutdown and
 reboot-- in the latter case container is being restarted.
 
 This patch brings the reboot() back for OpenVZ container.
 Turns out the normal Linux containers understand reboot() just fine
 too.
 
 Please note though that the problem with reboot() wrt upstream containers
 was really nasty -- calling reboot inside container resulted in
 rebooting the whole system, not just the container.

NB when libvirt starts an LXC container, it first checks to see whether
the kernel has the container aware reboot() support. If it does not,
then it removes CAP_SYS_REBOOT from the container, to prevent any
accidental whole system reboot. The sf.net LXC tools do the same thing.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] systemd coverity

2012-08-23 Thread Daniel P. Berrange
On Thu, Aug 23, 2012 at 03:04:23PM +0200, Zbigniew Jędrzejewski-Szmek wrote:
 On 08/23/2012 02:36 PM, Lennart Poettering wrote:
  maybe we should add macros like:
  
  #define _cleanup_free_ __attribute__((cleanup(freep)))
  #define _cleanup_fclose_ __attribute__((cleanup(fclosep)))
  
  What do you think?
 I personally think that this would be a welcome change like the
 #pragma once cleanup.
 
 __attribute__(cleanup) goes all the way back to gcc 3.3, so there's
 little reason not to use it.
 
 
 On a related topic:
 maybe gotos should be used more often for structured cleanup.
 This might make the code slightly shorted, and also help avoid
 mistakes.
 
 diff --git src/journal/sd-journal.c src/journal/sd-journal.c
 index 0f7c02c..5e73a94 100644
 --- src/journal/sd-journal.c
 +++ src/journal/sd-journal.c
 @@ -1418,37 +1418,33 @@ static sd_journal *journal_new(int flags, const
 char *path) {
 
  if (path) {
  j-path = strdup(path);
 -if (!j-path) {
 -free(j);
 -return NULL;
 -}
 +if (!j-path)
 +goto free_1;
  }
 
  j-files = hashmap_new(string_hash_func, string_compare_func);
 -if (!j-files) {
 -free(j-path);
 -free(j);
 -return NULL;
 -}
 +if (!j-files)
 +goto free_2;
 
  j-directories_by_path = hashmap_new(string_hash_func,
 string_compare_func);
 -if (!j-directories_by_path) {
 -hashmap_free(j-files);
 -free(j-path);
 -free(j);
 -return NULL;
 -}
 +if (!j-directories_by_path)
 +goto free_3;
 
  j-mmap = mmap_cache_new();
 -if (!j-mmap) {
 -hashmap_free(j-files);
 -hashmap_free(j-directories_by_path);
 -free(j-path);
 -free(j);
 -return NULL;
 -}
 +if (!j-mmap)
 +goto free_4;
 
  return j;
 +
 +free_4:
 +hashmap_free(j-files);
 +free_3:
 +hashmap_free(j-directories_by_path);
 +free_2:
 +free(j-path);
 +free_1:
 +free(j);

If you make sure that your pointer vars are all initialized to NULL,
and that all free functions accept NULL, then you can collapse all
those separate labels into one. This is much nicer, because then
you don't need to go about re-numbering if you need to insert another
goto in the middle of the function.


Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Re-exec()ing services for 'systemctl restart' ?

2012-08-09 Thread Daniel P. Berrange
On Wed, Aug 08, 2012 at 07:07:38PM +0200, Lennart Poettering wrote:
 On Mon, 06.08.12 16:52, Daniel P. Berrange (berra...@redhat.com) wrote:
 
  For libvirt, we (will soon) have a daemon (virtlockd) which maintains
  exclusive fcntl() based locks on disk images/devices, on behalf of both
  libvirtd and any running QEMU or LXC instances. This is a safety critical
  daemon (hence separate from libvirtd), to the extent that if the daemon
  stops / crashes, the entire host should be immediately fenced using a
  kernel watchdog and/or hardware power control device.
  
  We still want to be able to restart this daemon during RPM upgrades to
  newer versions, but we can't use a normal stop+start sequence, because
  that will loose locks for any active VMs. Thus the daemon has the ability
  to re-exec() itself triggered by SIGUSR1, preserving its critical state.
  I've read the manpages for .service, .exec, etc but I've not seen any
  reference to changing config such that
  
# systemctl restart virtdlockd.service
  
  will simply send SIGUSR1 to the process, instead of stopping it and then
  starting it again. Obviously I could make the RPM %post send SIGUSR1
  directly and ignore systemctl, but that doesn't help admins who just
  expect to use systemctl. So I want to know if there is a recommended
  way to handle this kind of use case ?
 
 This is fundamentally difficult to implement, simply because restarting
 a service also means that the services binding to it need restarting
 too. And the ordering of that gets impossible if the stop/start sequence
 is atomic because it is done internally in the service, and cannot be
 split into two steps that we can order freely against each other.
 
 So, I fear we cannot really add this for you. As Kay suggested
 systemctl kill is probably your best choice here, or maybe systemctl
 reload.

Ok, thanks for explaining the issues - I think I'll just use systemctl kill
for now.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Re-exec()ing services for 'systemctl restart' ?

2012-08-06 Thread Daniel P. Berrange
For libvirt, we (will soon) have a daemon (virtlockd) which maintains
exclusive fcntl() based locks on disk images/devices, on behalf of both
libvirtd and any running QEMU or LXC instances. This is a safety critical
daemon (hence separate from libvirtd), to the extent that if the daemon
stops / crashes, the entire host should be immediately fenced using a
kernel watchdog and/or hardware power control device.

We still want to be able to restart this daemon during RPM upgrades to
newer versions, but we can't use a normal stop+start sequence, because
that will loose locks for any active VMs. Thus the daemon has the ability
to re-exec() itself triggered by SIGUSR1, preserving its critical state.
I've read the manpages for .service, .exec, etc but I've not seen any
reference to changing config such that

  # systemctl restart virtdlockd.service

will simply send SIGUSR1 to the process, instead of stopping it and then
starting it again. Obviously I could make the RPM %post send SIGUSR1
directly and ignore systemctl, but that doesn't help admins who just
expect to use systemctl. So I want to know if there is a recommended
way to handle this kind of use case ?

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Re-exec()ing services for 'systemctl restart' ?

2012-08-06 Thread Daniel P. Berrange
On Mon, Aug 06, 2012 at 06:04:11PM +0200, Kay Sievers wrote:
 On Mon, Aug 6, 2012 at 5:52 PM, Daniel P. Berrange berra...@redhat.com 
 wrote:
  For libvirt, we (will soon) have a daemon (virtlockd) which maintains
  exclusive fcntl() based locks on disk images/devices, on behalf of both
  libvirtd and any running QEMU or LXC instances. This is a safety critical
  daemon (hence separate from libvirtd), to the extent that if the daemon
  stops / crashes, the entire host should be immediately fenced using a
  kernel watchdog and/or hardware power control device.
 
  We still want to be able to restart this daemon during RPM upgrades to
  newer versions, but we can't use a normal stop+start sequence, because
  that will loose locks for any active VMs. Thus the daemon has the ability
  to re-exec() itself triggered by SIGUSR1, preserving its critical state.
  I've read the manpages for .service, .exec, etc but I've not seen any
  reference to changing config such that
 
# systemctl restart virtdlockd.service
 
  will simply send SIGUSR1 to the process, instead of stopping it and then
  starting it again. Obviously I could make the RPM %post send SIGUSR1
  directly and ignore systemctl, but that doesn't help admins who just
  expect to use systemctl. So I want to know if there is a recommended
  way to handle this kind of use case ?
 
 $ systemctl reload ... ?

I thought about reload, but using that to re-exec the daemon seemed
a little evil to me, since it was really for just reloading config
files.

 or with the signal speficied:
 
 $ systemctl kill ...

True, I guess I could use that.

I'm infering from your response that there's no way to customize what
'restart' does, as you can do with stop/start/reload/etc.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Systemd usage wrt libvirt-sandbox

2012-03-01 Thread Daniel P. Berrange
The libvirt-sandbox project[1] is providing an API and command line tools for
constructing application sandboxes. It uses either LXC or KVM virtualization
via libvirt, to confine execution of an application binary, giving it a
read-only view of the host root filesystem, with custom writable areas
grafted onto selected paths. eg if running httpd inside a sandbox, we give
it a private /etc/httpd and /var/www, etc.

The idea is to get the security isolation benefits of virtualization
technology, without the administrative burden of extra OS installs
that it normally entails. As such the only processes running inside
each sandbox are the application being confined, and a minimal custom
init binary provided by libvirt-sandbox itself.

As we expand our use cases though, particularly to cover the secure
containers feature[2] in Feora 17, it is clear that if we're not
careful, our miniml libvirt-sandbox-init-common binary is going
turn into a poor mans' copy of systemd. We want to avoid that, and
instead actually make use of systemd directly.

Since the sandbox shares the same root filesystem as the host, we
can't simply exec 'systemd' as is. We'll need to setup a few custom
writable mounts, where we write out custom units / targets, and
let systemd keep any state.

So I'm trying to figure out just what is the absolute minimal setup we
can configure for systemd. Our primary target for development is to
sandbox apache. So I'd like to figure out what minimal config / directory
structure I need to create to run systemd and have it only run apache,
and a login shell (for debug inside the sandbox).

I'm guessing that I can perhaps get away with setting up an override
of the host's /etc/systemd, and writing out custom basic.target
and default.target unit files, which merely running httpd.unit and
a shell ?

Regards,
Daniel

[1] http://berrange.com/tags/libvirt-sandbox/
http://libvirt.org/git/?p=libvirt-sandbox.git;a=summary
https://fedoraproject.org/wiki/Features/VirtSandbox

[2] https://fedoraproject.org/wiki/Features/SecureContainers
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] RFC: Cooperating in the cgroup tree

2011-08-18 Thread Daniel P. Berrange
On Fri, Aug 19, 2011 at 01:25:16AM +0200, Lennart Poettering wrote:
 Heya,
 
 I put together a short recommendations document explaining how
 applications making use of cgroups should try to behave in the cgroupfs
 trees. Since the trees are shared resources everybody should behave
 nicely and not muck with everybody else's cgroups.
 
 http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups
 
 I hope these rules are something everybody involved with cgroup client
 side management can agree to.
 
 For now, I'd just like to ask for comments on this on this ML. I'll
 publish this on a wider scale later on.
 
 So, is there anything I forgot? Anything you are missing on this list?
 Happy to hear your thoughts and ideas!

I think this looks like a good set of rules for app developers
to follow.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel