Re: [systemd-devel] udev virtio by-path naming
On Wed, Mar 01, 2017 at 07:28:46PM +0100, Viktor Mihajlovski wrote: > On 01.03.2017 16:58, Daniel P. Berrange wrote: > > given a basic Fedora 25 guest, with a virtio-mmio disk added as per the > > guide above... > > > > looking at device > > '/devices/platform/a003e00.virtio_mmio/virtio3/block/vda': > > KERNEL=="vda" > > SUBSYSTEM=="block" > > DRIVER=="" > > ATTR{alignment_offset}=="0" > > ATTR{badblocks}=="" > > ATTR{cache_type}=="write back" > > ATTR{capability}=="50" > > ATTR{discard_alignment}=="0" > > ATTR{ext_range}=="256" > > ATTR{inflight}==" 00" > > ATTR{range}=="16" > > ATTR{removable}=="0" > > ATTR{ro}=="0" > > ATTR{serial}=="" > > ATTR{size}=="2097152" > > ATTR{stat}==" 940 4208 28500 > > 0 > >00 100 280" > > > > looking at parent device '/devices/platform/a003e00.virtio_mmio/virtio3': > > KERNELS=="virtio3" > > SUBSYSTEMS=="virtio" > > DRIVERS=="virtio_blk" > > ATTRS{device}=="0x0002" > > > > ATTRS{features}=="00101011011111 > > 00" > > ATTRS{status}=="0x0007" > > ATTRS{vendor}=="0x554d4551" > > > > looking at parent device '/devices/platform/a003e00.virtio_mmio': > > KERNELS=="a003e00.virtio_mmio" > > SUBSYSTEMS=="platform" > > DRIVERS=="virtio-mmio" > > ATTRS{driver_override}=="(null)" > Since I can't do that on my box, would you be so kind to run > ls -l /dev/disk/by-path > If it returns ids like > virtio-pci-a003e00.virtio_mmio[-partn] > my suggested patch should be OK for ARM in that it will produce ids in > the format > platform-a003e00.virtio_mmio[-partn] Ok, my guest has 4 disks - sda - virtio-scsi, over virtio-pci transport - sdb - virtio-scsi, over virtio-mmio transport - vda - virtio-scsi, over virtio-pci transport - vdb - virtio-scsi, over virtio-mmio transport with systemd 231 I get these links platform-3f00.pcie-pci-:00:01.1-virtio-pci-:02:00.0-scsi-0:0:0:0 -> ../../sda platform-3f00.pcie-pci-:00:01.3-virtio-pci-:04:00.0 -> ../../vda virtio-pci-a003c00.virtio_mmio -> ../../vdb virtio-pci-a003e00.virtio_mmio-scsi-0:0:0:0 -> ../../sdb after applying your patch I get these links: platform-3f00.pcie-pci-:00:01.1-virtio-pci-:02:00.0-scsi-0:0:0:0 -> ../../sda platform-3f00.pcie-pci-:00:01.3-virtio-pci-:04:00.0 -> ../../vda platform-3f00.pcie-pci-:02:00.0-scsi-0:0:0:0 -> ../../sda platform-3f00.pcie-pci-:04:00.0 -> ../../vda platform-a003c00.virtio_mmio -> ../../vdb platform-a003e00.virtio_mmio-scsi-0:0:0:0 -> ../../sdb virtio-pci-a003c00.virtio_mmio -> ../../vdb virtio-pci-a003e00.virtio_mmio-scsi-0:0:0:0 -> ../../sdb So that appears to be working as designed - the 4 backcompat symlinks are still there, and the new symlinks all live under the platform- prefix and don't have a bogus 'pci' in the name for mmio links Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] udev virtio by-path naming
On Wed, Mar 01, 2017 at 03:58:12PM +, Daniel P. Berrange wrote: > On Wed, Mar 01, 2017 at 04:02:53PM +0100, Viktor Mihajlovski wrote: > > If wanted, I can take a stab at virtio-mmio, but would need the output > > of udevadm -a /dev/vda from a virtio-mmio system. > > Presumably you mean 'udevadm info -a /dev/vda' ? That reports the following, > given a basic Fedora 25 guest, with a virtio-mmio disk added as per the > guide above... > > looking at device '/devices/platform/a003e00.virtio_mmio/virtio3/block/vda': BTW, the hex digits in here are the virtio mmio address which changes per device eg if i have 3 virtio-mmio backed disks, I get looking at device '/devices/platform/a003a00.virtio_mmio/virtio3/block/vda': looking at device '/devices/platform/a003c00.virtio_mmio/virtio4/block/vdb': looking at device '/devices/platform/a003e00.virtio_mmio/virtio5/block/vdc': Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] udev virtio by-path naming
On Wed, Mar 01, 2017 at 04:02:53PM +0100, Viktor Mihajlovski wrote: > On 01.03.2017 04:30, Zbigniew Jędrzejewski-Szmek wrote: > > On Tue, Feb 28, 2017 at 09:47:42AM +0100, Viktor Mihajlovski wrote: > >> One could argue about back-level compatibility, but virtio by-path > >> naming has changed multiple times. We have seen virtio-pci-virtio > >> (not predictable), pci- and virtio-pci- already. It > >> might be a good time now to settle on a common approach for all > >> virtio types. > >> > >> For the reasons above, I'd vote for -, which > >> would work for PCI and CCW, not sure about ARM MMIO though. > > > > It seems that there's agreement that - is the right > > approach. > > > > Ideally we would keep the virtio-pci- links as they appear > > right now, for backwards compatibility, just for the pci devices, and > > mark them as deprecated (dunno where, maybe just in NEWS), and add the > > code to make the links. > > > > I haven't looked at the code, maybe we just do this with the right > > udev rule, and also stick the deprecation comment there? > > > > Zbyszek > > > I've posted a github pull request [1], and would appreciate review > feedback. As I am lacking an ARM setup, it would also be nice if someone > with ARM skills could have a look as well. FYI you can install ARM7 guests on an x86_64 host, using pre-built Fedora images https://fedoraproject.org/wiki/QA:Testcase_Virt_ARM_on_x86 NB, this will install the guest using virtio-pci. So if you want to see virtio-mmio in action, you'll need to edit the libvirt XML config afterwards to add another disk, eg > If wanted, I can take a stab at virtio-mmio, but would need the output > of udevadm -a /dev/vda from a virtio-mmio system. Presumably you mean 'udevadm info -a /dev/vda' ? That reports the following, given a basic Fedora 25 guest, with a virtio-mmio disk added as per the guide above... looking at device '/devices/platform/a003e00.virtio_mmio/virtio3/block/vda': KERNEL=="vda" SUBSYSTEM=="block" DRIVER=="" ATTR{alignment_offset}=="0" ATTR{badblocks}=="" ATTR{cache_type}=="write back" ATTR{capability}=="50" ATTR{discard_alignment}=="0" ATTR{ext_range}=="256" ATTR{inflight}==" 00" ATTR{range}=="16" ATTR{removable}=="0" ATTR{ro}=="0" ATTR{serial}=="" ATTR{size}=="2097152" ATTR{stat}==" 940 4208 285000 00 100 280" looking at parent device '/devices/platform/a003e00.virtio_mmio/virtio3': KERNELS=="virtio3" SUBSYSTEMS=="virtio" DRIVERS=="virtio_blk" ATTRS{device}=="0x0002" ATTRS{features}=="00101011011111 00" ATTRS{status}=="0x0007" ATTRS{vendor}=="0x554d4551" looking at parent device '/devices/platform/a003e00.virtio_mmio': KERNELS=="a003e00.virtio_mmio" SUBSYSTEMS=="platform" DRIVERS=="virtio-mmio" ATTRS{driver_override}=="(null)" looking at parent device '/devices/platform': KERNELS=="platform" SUBSYSTEMS=="" DRIVERS=="" Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] udev virtio by-path naming
On Mon, Feb 20, 2017 at 04:14:32PM +0100, Lennart Poettering wrote: > On Mon, 20.02.17 15:34, Viktor Mihajlovski (mihaj...@linux.vnet.ibm.com) > wrote: > > > But then, I find this naming scheme somewhat weird. > > A virtio disk shows up as a regular PCI function on the PCI > > bus side by side with other (non-virtio) devices. The naming otoh > > suggests that virtio-pci is a subsystem of its own, which is simply > > incorrect from a by-path perspective. > > > > Using just the plain PCI path id is actually sufficient to identify > > a virtio disk by its path. This would be in line with virtio > > network interface path names which use the plain PCI naming. > > > > One could argue about back-level compatibility, but virtio by-path > > naming has changed multiple times. We have seen virtio-pci-virtio > > (not predictable), pci- and virtio-pci- already. It > > might be a good time now to settle on a common approach for all > > virtio types. > > > > For the reasons above, I'd vote for -, which > > would work for PCI and CCW, not sure about ARM MMIO though. > > Opinions? Virtio MMIO devices are identified by a unique control register base address. eg 0x3000. So I think - would work fine to all cases PCI, CCW & MMIO. Certainly it is moire correct than hardcoding virtio-pci as a prefix - that's just plain broken for non-PCI transports. > So, to make this clear, we in systemd are kinda interested in > splitting out these virtio helpers into some external project > maintained by virtio peopl. We as systemd/udev maintainers have very > little understanding of the underlying technology, so we can't really > be any good maintainers of this, and we can't really comment on this > stuff, in particular when it gets more exotic, like the CCW stuff. > > Even better would be if the kernel would do the naming on its own, and > maybe just provide us with a sysattr on the relevant devices that we > can read to determine the path from, so that we don#t have to maintain > this at all in userspace. That way, the driver folks on the kernel > side can use any naming they like without ever having to patch this > into systemd or udev. > > This is similar to SCSI stuff and all things like that: the more > exotic it gets the less place this really has in systemd, we are not > the right maintainers for this. And given that this is all nicely > pluggable (you can ship your own udev extensions externally very > easily), there's really no reason for this to be in systemd/udev. The other post about ptp-kvm rules reminded me that I wanted to respond to this mail too. The problem with splitting these rules out into a separate project is that there's no other existing place that they would live. The "virtio people" as a group merely write specifications. The actual implementation of those specs is done by multiple other independant groups - QEMU (for host side, though other host side impls exist too) and Linux (for guest side). The udev rules are Linux guest support pieces, but of course Linux itself doesn't distribute udev rules - it delegated that job to the udev package hence why they are here currently. So I don't see that pushing the rules out of the udev repo would be beneficial to people building VMs. > Anyway, I fear you're going to have a hard time involving us in a > technical discussions about the issue you are raising, since quite > frankly we have no clue about virtio... Could it be as simple as having a couple of people nominated as the technical point of contact for the virtio rules, who can be CC'd to get answers any questions that may need answering ? I don't have time to actively monitor systemd pull requests for changes affecting virtio, but I'd be ok with being pinged if issues come up that need assistance & can pull in other virt experts where needed. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] udev rules: add udev rule to create /dev/ptp_kvm
On Mon, Feb 27, 2017 at 11:50:59PM -0300, Marcelo Tosatti wrote: > On Sun, Feb 26, 2017 at 09:52:18PM +0100, Lennart Poettering wrote: > > On Thu, 23.02.17 22:20, Marcelo Tosatti (mtosa...@redhat.com) wrote: > > > > > > > > Its necessary to specify the KVM PTP device name in userspace. > > > > > > In case a network card with PTP device is assigned to the guest, > > > it might be the case that KVM PTP gets /dev/ptp0 instead of /dev/ptp1. > > > > > > Fix a device name for the KVM PTP device. > > > > What's the symlink precisely good for, can you elaborate? > > You want to configure Chrony to use PTP in the guest to sync with the > host. > > You need to add a entry to /etc/chrony.conf pointing to "/dev/ptp0", > the ptp_kvm device. > > However, it might be the case that a PCI assigned device has a PTP > clock, and it can be registered as "/dev/ptp0" and ptp_kvm as > "/dev/ptp1". > > > Also, what's the benefit of shipping this upstream? Why not ship that > > rule with kvm? > > qemu-kvm package? Sure i can do that, but then all distributions > have to do the same with their own packages. qemu-kvm is installed in the host OS only, but this rule needs to be set in the guest OS, unless you want to bundle it in with qemu-guest-agent RPM, but that's not really a directly related package, so we'd liekly have to create a new package for this and try and get distros to ensure it is installed in all guest OS. We've had qemu-guest-agent for years now and we've still not got all distros installing it. So not shipping this kind of rule with udev means that it'll almost certainly end up being missing in the majority of guest installs for many years to come. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] systemd: add RDTCacheReservation= option to support CAT (Cache Allocation Technology)
On Fri, Jan 06, 2017 at 01:51:17PM -0200, Marcelo Tosatti wrote: > On Fri, Jan 06, 2017 at 05:26:36PM +0200, Mantas Mikulėnas wrote: > > On Fri, Jan 6, 2017 at 3:59 PM, Marcelo Tosattiwrote: > > > > > > > > > > > Cache Allocation Technology is a feature on selected recent Intel Xeon > > > processors which allows control over L3 cache allocation. > > > > > > Kernel support has been merged to the upstream kernel, via a filesystem > > > resctrlfs. > > > > > > On top of that, a userspace utility, resctrltool has been written > > > to facilitate writing applications and using the filesystem > > > interface (see the rationale at > > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1300792.html). > > > > > > This patch adds a new option to systemd, RDTCacheReservation, > > > to allow configuration of CAT via resctrltool. > > > > > > See the first hunk of the patch for a description of the option > > > > > > This really doesn't look pretty, neither the approach nor the > > implementation... > > Suggestions to improve the code or the approach are welcome. > > > Is the option actually so complex that calling resctrltool is the only way > > to adjust it? What about writing to the resctrlfs directly? > > You'll have to deal with the issues that resctrltool deals with, > namely: > > 1) Filesystem locking. > 2) Reading in every directory and the default > directory. > 3) Converting the reservation request to proper sizes. > 4) Converting: > type=both --> type=data/type=code > > type=data/type=code --> type=both > > 4) Finding free space for the reservation. > 5) Adjusting the default group reservation. > > Since this steps must be performed by every user of > CAT (including libvirt which plans to execute resctrltool > as well), it was decided its better to maintain this logic > in a centralized place. Errr, no, that's not correct - Libvirt is certainly not going to spawn some python program to do this. If there's no C library API for this, libvirt will simply implement all the logic itself. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] deny access to GPU devices
On Mon, Nov 14, 2016 at 12:35:17PM +0100, Lennart Poettering wrote: > On Sat, 12.11.16 07:43, Topi Miettinen (toiwo...@gmail.com) wrote: > > > On 11/11/16 20:09, Lennart Poettering wrote: > > > I have no idea what "slurm" is, but do note that the "devices" cgroup > > > controller has no future, it is unlikely to ever become available in > > > cgroupsv2. > > > > This is unwelcome news, I think it is a simple and well contained MAC > > that has been available in systems without a full blown MAC like SELinux > > and with systemd support it has been very easy to set up. What will > > happen to DevicePolicy, DeviceAllow etc. directives? Or will systemd > > stick to cgroupsv1 forever? > > No, our plan is to switch to cgroupsv2 as default as quickly as we > can. Where "quickly as we can" means mostly: the "cpu" controllers is > ported to cgroupsv2 in vanilla kernels. > > The thing with the "devices" cgroup controller is that it is not about > resource control, but about access control, and hence should not live > in "cgroups" at all, but in some other framework. "cgroups" is all > about dynamic resource control and accounting, but "devices" doesn't > fit that at all, hence it should move elsewhere. > > We'll keep DeviceAllow/DevicePolicy around for now, and there's a TODO > list item to implement at least the "m" part of it via seccomp, as a > second level of protection that will still work even if cgroupsv2 is > used. I think in the long run it might make sense to also do the "rw" > part of it somehow in the kernel, via some new kernel subsystem, but > we'll have to see if and how this will be implemented. Since there is support for stackable LSM's now, I could see the cgroup devices ACL feature being replaced with a new LSM. I imagine if stackable LSMs had been supported back in cgroup v1 days, it probably would have been done that way in the first place instead of adding MAC to cgroups. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [libvirt] How to make udev not touch my device?
On Fri, Nov 11, 2016 at 05:01:40PM +0100, Michal Sekletar wrote: > On Fri, Nov 11, 2016 at 2:20 PM, Daniel P. Berrange <berra...@redhat.com> > wrote: > > > What kind of issues ? > > General problem with manually created device nodes is that udev and > systemd do not know about them. Device units do not exist for these > device nodes. Hence these device units can not be a dependency of some > other unit. Typical example is manually created device node referenced > from /etc/fstab. Then corresponding mount unit is bound to a device > that never shows up and hence it always fails to mount even tough > device node is there. Ok, that sounds irrelevant to libvirt's usage wrt QEMU, so I don't see any problem for us here. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [libvirt] How to make udev not touch my device?
On Fri, Nov 11, 2016 at 02:15:38PM +0100, Michal Sekletar wrote: > On Mon, Nov 7, 2016 at 1:20 PM, Daniel P. Berrange <berra...@redhat.com> > wrote: > > > So if libvirt creates a private mount namespace for each QEMU and mounts > > a custom /dev there, this is invisible to udev, and thus udev won't/can't > > mess with permissions we set in our private /dev. > > > > For hotplug, the libvirt QEMU would do the same as the libvirt LXC driver > > currently does. It would fork and setns() into the QEMU mount namespace > > and run mknod()+chmod() there, before doing the rest of its normal hotplug > > logic. See lxcDomainAttachDeviceMknodHelper() for what LXC does. > > We try to migrate people away from using mknod and messing with /dev/ > from user-space. For example, we had to deal with non-trivial problems > wrt. mknod and Veritas storage stack in the past (most of these issues What kind of issues ? > remain unsolved to date). I don't like to hear that you plan to get > into /dev management business in libvirt too. I am judging based on > past experiences, nevertheless, I don't like this plan. Libvirt is already doing this for its LXC driver, populating a private /dev with only the devices permitted for the container in question. > Also, managing separate mount namespace for each qemu process and > forking helper that joins the namespace to do some work seems quite > complex too. Again, libvirt is already doing this for LXC so its not any great burden. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [libvirt] How to make udev not touch my device?
On Mon, Nov 07, 2016 at 01:11:14PM +0100, Michal Privoznik wrote: > On 07.11.2016 10:17, Daniel P. Berrange wrote: > > On Fri, Nov 04, 2016 at 08:47:34AM +0100, Michal Privoznik wrote: > >> Hey udev developers, > >> > >> I'm a libvirt developer and I've been facing an interesting issue > >> recently. Libvirt is a library for managing virtual machines and as such > >> allows basically any device to be exposed to a virtual machine. For > >> instance, a virtual machine can use /dev/sdX as its own disk. Because of > >> security reasons we allow users to configure their VMs to run under > >> different UID/GID and also SELinux context. That means that whenever a > >> VM is being started up, libvirtd (our daemon we have) relabels all the > >> necessary paths that QEMU process (representing VM) can touch. > >> However, I'm facing an issue that I don't know how to fix. In some cases > >> QEMU can close & reopen a block device. However, closing a block device > >> triggers an event and hence if there is a rule that sets a security > >> label on a device the QEMU process is unable to reopen the device again. > >> > >> My question is, whet we can do to prevent udev from mangling with our > >> security labels that we've set on the devices? > >> > >> One of the ideas our lead developer had was for libvirt to set some kind > >> of udev label on devices managed by libvirt (when setting up security > >> labels) and then whenever udev sees such labelled device it won't touch > >> it at all (this could be achieved by a rule perhaps?). Later, when > >> domain is shutting down libvirt removes that label. But I don't think > >> setting an arbitrary label on devices is supported, is it? > > > > Having thought about this over the weekend, I'm strongly inclined to > > just take udev out of the equation by starting a new mount namespace > > for each QEMU we launch and setting up a custom /dev containing just > > the devices we need. This will be both a security improvement and > > avoid the udev races, with no complex code required in libvirt and > > will work for libvirt all the way back to RHEL6 > > How would this work with device hotplug, i.e. I start a domain with some > set of devices. Then I bring up an iSCSI target (which appears under > /dev) and how does one 'transfer' the device into the new namespace? > BTW: can you elaborate more one udev-namespace relations? Doesn't udev > run in the namespaces too? A single process can only ever be in a single namespace at any point in time and udev only ever runs in the initial namespaces. When running containers you never have udev inside them, and udev certainly doesn't interact with arbitrary namespaces created by other applications for their own purposes. So if libvirt creates a private mount namespace for each QEMU and mounts a custom /dev there, this is invisible to udev, and thus udev won't/can't mess with permissions we set in our private /dev. For hotplug, the libvirt QEMU would do the same as the libvirt LXC driver currently does. It would fork and setns() into the QEMU mount namespace and run mknod()+chmod() there, before doing the rest of its normal hotplug logic. See lxcDomainAttachDeviceMknodHelper() for what LXC does. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [libvirt] How to make udev not touch my device?
On Fri, Nov 04, 2016 at 08:47:34AM +0100, Michal Privoznik wrote: > Hey udev developers, > > I'm a libvirt developer and I've been facing an interesting issue > recently. Libvirt is a library for managing virtual machines and as such > allows basically any device to be exposed to a virtual machine. For > instance, a virtual machine can use /dev/sdX as its own disk. Because of > security reasons we allow users to configure their VMs to run under > different UID/GID and also SELinux context. That means that whenever a > VM is being started up, libvirtd (our daemon we have) relabels all the > necessary paths that QEMU process (representing VM) can touch. > However, I'm facing an issue that I don't know how to fix. In some cases > QEMU can close & reopen a block device. However, closing a block device > triggers an event and hence if there is a rule that sets a security > label on a device the QEMU process is unable to reopen the device again. > > My question is, whet we can do to prevent udev from mangling with our > security labels that we've set on the devices? > > One of the ideas our lead developer had was for libvirt to set some kind > of udev label on devices managed by libvirt (when setting up security > labels) and then whenever udev sees such labelled device it won't touch > it at all (this could be achieved by a rule perhaps?). Later, when > domain is shutting down libvirt removes that label. But I don't think > setting an arbitrary label on devices is supported, is it? Having thought about this over the weekend, I'm strongly inclined to just take udev out of the equation by starting a new mount namespace for each QEMU we launch and setting up a custom /dev containing just the devices we need. This will be both a security improvement and avoid the udev races, with no complex code required in libvirt and will work for libvirt all the way back to RHEL6 Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Fedora 25, cgroups V2 and systemd roadmap
On Tue, Oct 11, 2016 at 11:30:40PM +0300, Kevin Wilson wrote: > Hello, Daniel, > > > We don't want to support out of tree kernel patches, > > This sounds very reasonable, I don't have anything against this policy. > > Still, I wonder: are you ruling out implementing "hybrid mode" (like > Lennart uses in systemd) for libvirt? I mean a mode where you will use > the 3 currently supported cgroup V2 controllers for libvirt (memory, > io and pids; actually I don't know if you use the cgroups pids at all > in libvirt, it is a new controller; BTW - do you ? ). And using other > controllers (besides io, memory and pids) from cgroup V1 A controller can only be used in one mode at any time - so while libvirt could potentially support using some in v1 mode and some in v2 mode, it only works if the OS distro has actually setup those controllers in v1 mode - if they're in v2 mode, we'd be forced to use them in v2 mode. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Fedora 25, cgroups V2 and systemd roadmap
On Mon, Oct 10, 2016 at 05:30:35PM +, Jóhann B. Guðmundsson wrote: > On 10/10/2016 04:46 PM, Lennart Poettering wrote: > > > I still hope that Fedora can go the Facebook route, and just patch the > > stuff in, and ignore the fight going on in the kernel community. > > That wont fly by the kernel sub community in Fedora in which they are doing > whatever they can not having to carry out of tree patches and wind up in the > same scenario they have been in with "Secure Boot" for the past what 3 - 5 > years now. > > I'm pretty sure that every downstream distribution has already realized that > the longer they carry patch or patches that exist out of tree, the harder > they get to maintain without extra support as in additional manpower in > maintaining the kernel for that distribution and will also chose not to > carry that patches. Yeah, it won't really fly from libvirt POV either. We don't want to support out of tree kernel patches, because history has shown that causes long term pain in the (fairly likely) event that the patches gets changed before finally merging. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Fedora 25, cgroups V2 and systemd roadmap
On Mon, Oct 10, 2016 at 02:43:33PM +0200, Lennart Poettering wrote: > On Mon, 10.10.16 14:31, Kevin Wilson (wkev...@gmail.com) wrote: > > > Hello, systemd developers, > > So we have now 3 V2 cgroups controller in the kernel (pids, memory and io). > > The CPU controller as of now is not merged in and is available only in > > an out of tree git repo (due to some debate over > > it with kernel scheduler developers). Not sure that it will be merged > > in the next 2 months. > > > > Fedora 25 is to be released in a month and a half, on 15 of November. > > https://fedoraproject.org/wiki/Releases/25/Schedule > > My questions are: > > what are the intentions regarding using cgroup v2 in systemd in F25 > > as the default instead of using cgroup V1? > > Is the absence of the CPU controller is a reason for not having > > cgroup V2 as a default in F25 ? and if so, why ? > > I'd like to switch this over sooner rather than later in Fedora, but I > figure we can't do that, unless relevant other upstreams can deal with > the new hierarchy too. I figure on Fedora, that'd be at least libvirt > and Docker that need to be updated for this. > > I figure we should start turning this on in Rawhide, and see what > breaks, and then revert before the release. > > Before we can tell Docker/libvirt to port their stuff over I figure we > also need one more addition in the systemd API for this: next to > Delegate=yes|no (which we already have) we probably need to add > DelegateController= taking a list of all controllers to > delegate. Right now we delegate all controllers, but I figure that > should be configurable, since turning on a controller might have > effects people don't expect (in particular for the cpu hierarchy). From the libvirt POV, getting the CPU controller support merged is a blocking item, otherwise we have a major feature regression. We have a policy of not supporting code that is out of tree, as we don't like getting burnt when changes are inevitably made after it is finally accepted. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Fedora 25, cgroups V2 and systemd roadmap
On Mon, Oct 10, 2016 at 02:31:55PM +0300, Kevin Wilson wrote: > Hello, systemd developers, > So we have now 3 V2 cgroups controller in the kernel (pids, memory and io). > The CPU controller as of now is not merged in and is available only in > an out of tree git repo (due to some debate over > it with kernel scheduler developers). Not sure that it will be merged > in the next 2 months. > > Fedora 25 is to be released in a month and a half, on 15 of November. > https://fedoraproject.org/wiki/Releases/25/Schedule > My questions are: > what are the intentions regarding using cgroup v2 in systemd in F25 > as the default instead of using cgroup V1? > Is the absence of the CPU controller is a reason for not having > cgroup V2 as a default in F25 ? and if so, why ? Ignoring the question of Fedora switching, more generally, if any OS were to switch to cgroup v2 right now it would break a number of applications that use cgroups v1 today. v2 is not a plain no-op drop-in replacement for v1, as they have pretty different rules around the hierarchy management. Applications that create/manage cgroups properties/dirs need to be manually ported, not merely systemd itself. The absence of CPU controller support would also be a functional regression for some applications, effectively preventing use of cgroup v2 even if they were ported. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o-http://search.cpan.org/~danberr/ :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] machined: after CPU offline then online, vcpupin KVM guest failed to start
On Fri, Aug 05, 2016 at 12:33:21PM +0200, Dr. Werner Fink wrote: > On Fri, Aug 05, 2016 at 11:07:50AM +0200, Lennart Poettering wrote: > > On Thu, 04.08.16 16:19, Cedric Bosdonnat (cbosdon...@suse.com) wrote: > > > > > Hi Lennart and Werner, > > > > > > On Wed, 2016-08-03 at 16:56 +0200, Lennart Poettering wrote: > > > > On Wed, 03.08.16 14:46, Dr. Werner Fink (werner at suse.de) wrote: > > > > > problem with v228 (and I guess this is also later AFAICS from logs of > > > > > current git) that repeating CPU hotplug events (offline/online). The > > > > > root cause is that cpuset.cpus become not restored by machined. > > > > > Please note that libvirt can not do this as it is not allowed to do > > > > > so. > > > > > > > > This is a limitation of the kernel cpuset interface, and it's one of > > > > the reasons we do not expose cpusets at all in systemd right > > > > now. Thankfully, there's an alternative to cpusets, which is the CPU > > > > affinity controls exposed via CPUAffinity= in systemd, which do much > > > > of the same, but have less borked semantics. > > > > > > > > We'd like to support cpusets directly in systemd, but we don't do this > > > > as long as the kernel interfaces are as borked as they are. For > > > > example, cpusets are flushed out entirely currently when the system > > > > goes through a suspend/resume cycle. > > > > > > > > If libvirt has hook-ups with cpuset, then it bypasses systemd for > > > > that. > > > > > > I guess by CPU affinity you mean sched_setaffinity and friends. If that is > > > the case, then this is constrained by cpuset too as mentioned here: > > > > > > http://www.mjmwired.net/kernel/Documentation/cpusets.txt#53 > > > > > > As long as the machine.slice cpuset isn't restored after onlining a CPU > > > again, > > > then libvirt won't be able to set either the affinity or the cpuset if it > > > contains that CPU. > > > > > > May be the kernel's behaviour is weird and can be discussed, but libvirt > > > can't > > > do anything on that bug. > > > > Yeah, to make this clear: I do not blame libvirt for this borkedness > > at all. I blame the kernel. > > Hmmm ... IMHO it is useless to pass the buck from kernel to user space > as well do the same from user space back to kernel. I've an open bug > from a customer and this bug requires a solution. AFAICS libvirt can > not do this but machined could do. It is not simply a problem wrt to virtual machines, it affects any application which is using the cpuset controller - VMs is just one such user. So it would be inappropriate todo it in machined. Fixing it in userspace is complicated by the fact that different levels or branches in the cgroup hiearchy are managed by different applications, with no single application having a single world view. Even if systemd itsef did have support for the cpuset controller, it would still not have a global view of all cgroups, as applications can be created further child cgroups below the groups managed by systemd, which systemd doesn't track. Trying to restore correct cpuaffinity after hotplug would thus require that multiple userspace applications all be aware of the problem and contain logic to fix their part of the hierarchy. This is further complicated by the ordering constraints that would require top levels to be fixed before child levels. Bearing all this in mind, fixing it in userspace is an incredibly hard problem which will always be liable to race conditions between applications. The only choices that are practical are a) not use the cpuset controller at all, or b) fix the kernel so that it maintains 2 distinct bitmaps, one for the set of online CPus, and one for the configured affinity in the cpuset, and thus avoid throwing away data on CPU unplug/plug. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Confining ALL processes to a CPUs/RAM via cpuset controller
On Wed, Jul 20, 2016 at 03:29:30PM +0200, Lennart Poettering wrote: > On Wed, 20.07.16 12:53, Daniel P. Berrange (berra...@redhat.com) wrote: > > > For virtualized hosts it is quite common to want to confine all host OS > > processes to a subset of CPUs/RAM nodes, leaving the rest available for > > exclusive use by QEMU/KVM. Historically people have used the "isolcpus" > > kernel arg todo this, but last year that had its semantics changed, so > > that any CPUs listed there also get excluded from load balancing by the > > schedular making it quite useless in general non-real-time use cases > > where you still want QEMU threads load-balanced across CPUs. > > > > So the only option is to use the cpuset cgroup controller to confine > > procosses. AFAIK, systemd does not have an explicit support for the cpuset > > controller at this time, so I'm trying to work out the "optimal" way to > > achieve this behind systemd's back while minimising the risk that future > > systemd releases will break things. > > Yes, we don't support this as of now, but we'd like to. The thing > though is that the kernel interface for it is pretty borked as it is > right now, and until that's not fixed we are unlikely going to support > this in systemd. (And as I understood Tejun the mem vs. cpu thing in > cpuset is probably not going to stay the way it is either) > > But note that the non-cgroup CPUAffinity= setting should be good > enough for many use cases. Are you sure that isn't sufficient for you? > > Also note that systemd supports setting a system-wide CPUAffinity= for > itself during early boot, thus leaving all unlisted CPUs free for > specific services where you use CPUAffinity= to change this default. Ah, interesting, I didn't notice you could set that globally. > > The key factor here is use of "Before" to ensure this gets run immediately > > after systemd switches root out of the initrd, and before /any/ long lived > > services are run. This lets us set cpuset placement on systemd (pid 1) > > itself and have that inherited by everything it spawns. I felt this is > > better than trying to move processes after they have already started, > > because it ensures that any memory allocations get taken from the right > > NUMA node immediately. > > > > Empirically this approach seems to work on Fedora 23 (systemd 222) and > > RHEL 7 (systemd 219), but I'm wondering if there's any pitfalls that I've > > not anticipated here. > > Yes, PID 1 was moved to the special scope unit init.scope as mentioned > above (in preparation for cgroupsv2 where inner cgroups can never > contain PIDs). This is likely going to break then. cgroupsv2 is likely to break many things once distros switch over, so I assume that wouldn't be done in a minor update - only a major new distro release so, not so concerning. > But again, I have the suspicion that CPUAffinity= might already > suffice for you? Yep, it looks like it should suffice for most people, unless they also wish to have memory node restrictions enforced from boot. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Confining ALL processes to a CPUs/RAM via cpuset controller
For virtualized hosts it is quite common to want to confine all host OS processes to a subset of CPUs/RAM nodes, leaving the rest available for exclusive use by QEMU/KVM. Historically people have used the "isolcpus" kernel arg todo this, but last year that had its semantics changed, so that any CPUs listed there also get excluded from load balancing by the schedular making it quite useless in general non-real-time use cases where you still want QEMU threads load-balanced across CPUs. So the only option is to use the cpuset cgroup controller to confine procosses. AFAIK, systemd does not have an explicit support for the cpuset controller at this time, so I'm trying to work out the "optimal" way to achieve this behind systemd's back while minimising the risk that future systemd releases will break things. As an example I have a host with 3 NUMA nodes, 12 CPUS and want to have all non-QEMU processes running on CPUs 0 & 1, leaving 3-11 available for QEMU machines So far my best solution looks like this: $ cat /etc/systemd/system/cpuset.service [Unit] Description=Restrict CPU placement DefaultDependencies=no Before=sysinit.target slices.target basic.target lvm2-lvmetad.service systemd-journald.service systemd-udevd.service [Service] Type=oneshot KillMode=none RemainAfterExit=yes ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/machine.slice ExecStartPre=/bin/bash -c '/usr/bin/echo "0-2" > /sys/fs/cgroup/cpuset/system.slice/cpuset.cpus' ExecStartPre=/bin/bash -c '/usr/bin/echo "0" > /sys/fs/cgroup/cpuset/system.slice/cpuset.mems' ExecStartPre=/bin/bash -c '/usr/bin/echo "3-11" > /sys/fs/cgroup/cpuset/machine.slice/cpuset.cpus' ExecStartPre=/bin/bash -c '/usr/bin/echo "0-2" > /sys/fs/cgroup/cpuset/machine.slice/cpuset.mems' ExecStartPost=/bin/bash -c '/usr/bin/echo 1 > /sys/fs/cgroup/cpuset/system.slice/tasks' ExecStopPost=/usr/bin/rmdir /sys/fs/cgroup/cpuset/system.slice ExecStart=/bin/true [Install] WantedBy=multi-user.target The key factor here is use of "Before" to ensure this gets run immediately after systemd switches root out of the initrd, and before /any/ long lived services are run. This lets us set cpuset placement on systemd (pid 1) itself and have that inherited by everything it spawns. I felt this is better than trying to move processes after they have already started, because it ensures that any memory allocations get taken from the right NUMA node immediately. Empirically this approach seems to work on Fedora 23 (systemd 222) and RHEL 7 (systemd 219), but I'm wondering if there's any pitfalls that I've not anticipated here. Conceptually I'm aiming for "Before=*" to say it should run before everything, but explicitly listing this set of units appears to be best I can do/ Any thoughts / feedback / suggestions welcome on how to improve this. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Utility for persistent alternative driver binding
On Tue, Dec 08, 2015 at 09:14:17AM -0500, Charles (Chas) Williams wrote: > On Tue, 2015-12-08 at 11:34 +0200, Panu Matilainen wrote: > > Hmm, got a pointer? I dont think PCI slots change between reboots > > without physically swapping hardware, the "ethX-problem" comes from the > > order of device discovery being unstable across boots, which is a > > different issue and not relevant for this case. > > With virtual machines the ordering of PCI devices isn't always the same. > This is especially true with OpenStack which regenerates the > configurations on the fly. The only thing that seems to be consistent > is the MAC address. The MAC address is not guaranteed to be unique of course - you can have multiple NICs with the same MAC provided they're connected to different subnets. For OpenStack we're working on a feature that allows the user booting an OpenStack instance to associate an arbitrary tag with each device they have on their VM. The info about the tags and the currently assigned device addresses is then exposed to the guest OS via the metadata service and/or config drive and/or firmware. There will be a utility that can read this metadata and then register tags against the corresponding devices in the udev database. The idea is that people building OpenStack compatible cloud images can configure their image to look for the device in udev with the desired user "tag" string. That way they don't need to care about specific device addresses directly. More info on OpenStack plans is here: http://specs.openstack.org/openstack/nova-specs/specs/mitaka/approved/virt-device-role-tagging.html Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Utility for persistent alternative driver binding
On Tue, Dec 08, 2015 at 09:46:13AM -0500, Charles (Chas) Williams wrote: > On Tue, 2015-12-08 at 14:21 +0000, Daniel P. Berrange wrote: > > On Tue, Dec 08, 2015 at 09:14:17AM -0500, Charles (Chas) Williams wrote: > > > On Tue, 2015-12-08 at 11:34 +0200, Panu Matilainen wrote: > > > > Hmm, got a pointer? I dont think PCI slots change between reboots > > > > without physically swapping hardware, the "ethX-problem" comes from the > > > > order of device discovery being unstable across boots, which is a > > > > different issue and not relevant for this case. > > > > > > With virtual machines the ordering of PCI devices isn't always the same. > > > This is especially true with OpenStack which regenerates the > > > configurations on the fly. The only thing that seems to be consistent > > > is the MAC address. > > > > The MAC address is not guaranteed to be unique of course - you can have > > multiple NICs with the same MAC provided they're connected to different > > subnets. > > Typically not done but yes it can happen. I was referring specifically > to the OpenStack case though where it causes the most trouble for me. > > > For OpenStack we're working on a feature that allows the user booting > > an OpenStack instance to associate an arbitrary tag with each device > > they have on their VM. The info about the tags and the currently assigned > > device addresses is then exposed to the guest OS via the metadata service > > and/or config drive and/or firmware. There will be a utility that can read > > this metadata and then register tags against the corresponding devices in > > the udev database. > > > > The idea is that people building OpenStack compatible cloud images can > > configure their image to look for the device in udev with the desired > > user "tag" string. That way they don't need to care about specific > > device addresses directly. More info on OpenStack plans is here: > > > > > > http://specs.openstack.org/openstack/nova-specs/specs/mitaka/approved/virt-device-role-tagging.html > > That would be a huge step forward but I need this to work when someone > hotplugs an interface to a running image. The config drive doesn't work > here since it wouldn't have the tag information. The metadata service > might work (although I think cloud-init blackholes the metadata service > after boot). Exposing this via the PCI device with some ACPI mechanism > would be nice. Yep, config drive is obviously useless for hotplug, but metadata service is usable in the short term. There is QEMU work to let us expose data via the firmware, but not sure help hotplug really. The would really be a virtio based filesystem like virtio-9p except without the suckiness of 9p - a future virtio-nfs might be best. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] RFC: removing initctl support
On Thu, Sep 24, 2015 at 03:51:16PM +0200, Tomasz Torcz wrote: > On Thu, Sep 24, 2015 at 03:01:21PM +0200, Lennart Poettering wrote: > > That stackexchange link lists a pile of garbage. We have an official > > API to check whether the system is booted with systemd: > > sd_booted(). It's documented here: > > > > http://www.freedesktop.org/software/systemd/man/sd_booted.html > > > > And we even document on that man page what precisely it does > > internally (which is equivalent to access("/run/systemd/system/", > > F_OK) >= 0) and suggest people to reimplement that simple check in the > > language of their choice, even in shell... That way, they don't even > > have to link against libsystemd. > > And then there is this sabotage: > > „This check is already broken, because uselessd creates this directory too” > > uselessd% git grep 'mkdir.*/run/systemd/system' > src/core/main-no-init.c:mkdir_label("/run/systemd/system", 0755); > src/core/mount-setup.c:mkdir_label("/run/systemd/system", 0755); > src/core/unit.c:mkdir_p("/run/systemd/system", 0755); Well if it wants to claim to be systemd, then it is responsible for providing the same API/ABIs as systemd. IOW it must respond to SIGRTMIN+4 in the same way, etc. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] RFC: removing initctl support
On Tue, Sep 22, 2015 at 02:31:25AM +0200, Lennart Poettering wrote: > Heya! > > Since a long time systemd has been shipping with two-way compat > support for /dev/initctl, and I am tempted to remove it. Before I do > so, I'd like some input on the relevance of this interface: > > a) there's support in systemctl to reboot the system by sending the >right bytes to /dev/initctl as fallback, so that you can reboot a >sysvinit system with "systemctl reboot". > > b) There's a mini-daemon "systemd-initctl.service" that is >fifo-activated on /dev/initctl, and forwards reboot requests from >old sysvinit clients to systemd. > > Both of this was supposed to help transition between sysvinit and > systemd systems: if you mix sysvinit clients with a systemd init > system and vice versa, you can still use the the tools to reboot the > other system. > > I'd claim the interface is borderline useless: the only operation you > can actually readlly properly dispatch with it is rebooting, and > reloading PID1. And that's pretty much it. > > We never even really used this stuff on Fedora properly (since we > actually transitioned from Upstart, not sysvinit, and we never had the > same level of compat for that...). > > This code has been bitrotting for a while, and nobody really cared. > > And most importantly: the entire protocol use by sysvinit via > /dev/initctl is deeply flawed, since it sends messages over > /dev/initctl that are not a divisor of PIPE_SIZE in length. Thus, if > PID 1 didn't read messages quick enough the messages queued could be > half-written and be partially interleaved with another client's > messages, and there is no way the system can ever recover from that. > > Thus, I'd really like to kill this. Does anybody care about it, and > can give me a strong enough reason to keep this anyway? The libvirt virDomainShutdown|Reboot APIs for triggering controlled shutdown/reboots of guest OS have support for using /dev/initctl with containers, as it was the lowest common denominator that easily worked across systemd, sysvinit & upstart. We could add further code to use a systemd specific interface if needed, so it wouldn't be the end of the world of /dev/initctl was removed, but it'd be nice to not have todo that. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] RFC: removing initctl support
On Tue, Sep 22, 2015 at 12:48:21PM +0200, Lennart Poettering wrote: > On Tue, 22.09.15 11:41, Daniel P. Berrange (berra...@redhat.com) wrote: > > > > One more addendum to the original mail: > > > > > > We already declared the interface "obsolete" in the docs, which makes > > > me particularly keen on dropping it... > > > > I guess one thing is that even if support for /dev/intctl in systemd, > > it is an optional unit file, so libvirt probably needs to deal with > > the SIGRTMIN+4 stuff anyway, for case where the person building > > the container has that unit file disabled. So from that POV, deleting > > it won't make current situation that much worse. > > Also, we support builds with all legacy cruft disabled, which is > something where inictl currently is not disabled, but if we kept it > should really be disable under... So I figure even if we keep the > general support in this is not an interface you can rely on... > > To make the SIGRTMIN+4 reliable all you need to do is check first if > /run/systemd/system/ exists. Ok, that's easy enough. So no objection from libvirt if you remove /dev/initctl in future releases Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] RFC: removing initctl support
On Tue, Sep 22, 2015 at 12:32:25PM +0200, Lennart Poettering wrote: > On Tue, 22.09.15 10:11, Daniel P. Berrange (berra...@redhat.com) wrote: > > > > And most importantly: the entire protocol use by sysvinit via > > > /dev/initctl is deeply flawed, since it sends messages over > > > /dev/initctl that are not a divisor of PIPE_SIZE in length. Thus, if > > > PID 1 didn't read messages quick enough the messages queued could be > > > half-written and be partially interleaved with another client's > > > messages, and there is no way the system can ever recover from that. > > > > > > Thus, I'd really like to kill this. Does anybody care about it, and > > > can give me a strong enough reason to keep this anyway? > > > > The libvirt virDomainShutdown|Reboot APIs for triggering controlled > > shutdown/reboots of guest OS have support for using /dev/initctl with > > containers, as it was the lowest common denominator that easily worked > > across systemd, sysvinit & upstart. > > Ah, I see... But I wasn't aware Upstart even implemented that... Maybe it wasn't actually upstart, but one of the other init systems. I just recall getting a patch from Debian folks to support it via the /run/initctl path, rather than /dev, and assumed that was upstart related. > > We could add further code to use a systemd specific interface if > > needed, so it wouldn't be the end of the world of /dev/initctl was > > removed, but it'd be nice to not have todo that. > > A simple fall back could be to send SIGRTMIN+4 to PID 1, if > /dev/initctl is not around. Yep, though we'd have to actually check that PID 1 is systemd, since if you run a container with a non-init program as PID 1, we don't want to be sending it SIGRTMIN+4 :-) > One more addendum to the original mail: > > We already declared the interface "obsolete" in the docs, which makes > me particularly keen on dropping it... I guess one thing is that even if support for /dev/intctl in systemd, it is an optional unit file, so libvirt probably needs to deal with the SIGRTMIN+4 stuff anyway, for case where the person building the container has that unit file disabled. So from that POV, deleting it won't make current situation that much worse. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] How to set time from Perl
On Mon, Sep 07, 2015 at 04:23:42PM +0200, Manuel Reimer wrote: > Hello, > > if I run the following code on an intel based platform, then I don't have > any problems: > > use Net::DBus; > my $bus = Net::DBus->system(); > my $logind = $bus->get_service('org.freedesktop.timedate1'); > my $manager = $logind->get_object('/org/freedesktop/timedate1', > 'org.freedesktop.timedate1'); > $manager->SetTime($time * 100, 0, 0); > > The variable "$time" is in seconds. > > If I run this to an ARM based system, then I get the folowing time: > > # date > Thu Jan 1 01:00:02 CET 1970 > > Does someone have an idea why this doesn't work? By "ARM system" do you mean 32-bit ArmV7, or 64-bit AArch64 ? Based on the behaviour you describe, I'm thinking you are most likely on 32-bit ArmV7. Perl integers on 32-bit are only 32-bit in length, and the SetTime() method needs a 64-bit integer, since it is representing the time in microseconds. So when you do $time * 100 you are probably getting integer truncation. Net::DBus can deal with 64-bit integers, but you need to provide them as the Perl string type, not integer type, so Net::DBus XS module can do a safe conversion to 64-bit without truncation. So instead of doing $manager->SetTime($time * 100, 0, 0); try doing $manager->SetTime($time . "00", 0, 0); which will conmvert $time to string type, and then append 6 zeros. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [ANNOUNCE] Git development moved to github
On Tue, Jun 02, 2015 at 04:34:03PM +0200, Martin Pitt wrote: David Herrmann [2015-06-02 13:06 +0200]: Our preferred way to send future patches is the github way. This means sending pull-requests to the github repo. Furthermore, all feature patches should go through pull-requests and should get reviewed pre-commit. This applies to everyone. Exceptions are non-controversial patches like typos and obvious bug-fixes. Makes sense. On the operational level, should we use the automatically merge feature of git hub once approving? On the plus side it's very convenient, but you'll get one Merge commit for every PR (which is often just one commit), so we'd almost double the entries in git log. Or can github be told to not do that? Merging manually is quite a bit of work, as you have to add a new remote every time, fetch that, and pull from it. But it does keep a cleaner git log history. FWIW, 'git log --no-merges' displays the clean history when merges are present. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] dynamic uid allocation (was: [PATCH] loopback setup in unprivileged containers)
On Tue, Feb 03, 2015 at 06:05:00PM +0100, Lennart Poettering wrote: On Tue, 03.02.15 16:34, Serge Hallyn (serge.hal...@ubuntu.com) wrote: the UID/GID on entire filesystem sub-trees given to containers with userns is a real unpleasant thing to have to deal with. I'd not want Of course you would *not* want to take a stock rootfs where uid == 0 and shift that into the container, as that would give root in the container a chance to write root-owned files on the host to leverage later in a convoluted attack :) Is this really a problem? I mean, the only way how this could be exploitable is if people make the container hierarchy accessible to other users, but that should be easy to prohibit by making the container's parent dir 0700, which we already do for nspawn's container in /var/lib/machines... The only other risk I can see here is that if people use traditional ext4 quota, then the container's disk usage will be added to the host's usage. But that's easy to avoid, by simply never placing container images and the host on the same quota device... Also, in the case of systemd-nspawn we strongly emphasize usage with loopback devices. In that case there's no vulnerability at all, since the device is completely seperate from the host fs, and it will only be mounted in the container, but not in the host... NB, that the container filesystem is visible via /proc/$PID/root, but I agree with you in general. I don't see a reason to avoid the scenario Serge mentioned. Indeed I think it is important that we explicitly support it, because ultimately I think we need to be able to take any arbitrary disk image and safely boot it in either a container or virtual machine. ie we should not have to build custom images just for containers - any such need should be considered a failure of the technology / impl IMHO. We might want to come up with a containers concensus that container rootfs's are always shipped with uid range 0-65535 - 10-165535. That still leaves a chance for container A (mapped to 20-265535) to write valid setuid-root binary for container B (mapped to 30-365535), which isn't possible otherwise. But that's better than doing so for host-root. Well, ultimately I'd recommend an automatism like this for container managers: a) if not otherwise configured, let's give each container their own 16bit of uids. This would mean each 32bit uid could be neatly split into the upper 16bit that would become a container id, plus the lower 16bit for the actual virtual UID. b) we will never set up UID ranges orthogonal from GID ranges. c) when a container image is started, the container manager first checks the UID/GID owner of the root of the root file system. It masks the lower 16bit away, and only looks for the upper 16bit. d) It will then look for an unused container id (which means, an unused range of 64K UIDs), and then shifts the offset it identified following c) to this new container id. With that in place it doesn't really matter which base people use in their containers, the container manager would do the right thing, and shift everything into the right place. Paranoid people could ship their container images shifted to some ID of their choice, and lazy folks could just ship their container images with base 0, but then must make sure they don't give anybody else access to the hierarchy, and don't confuse quota... Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] dynamic uid allocation (was: [PATCH] loopback setup in unprivileged containers)
On Tue, Feb 03, 2015 at 03:41:22PM +0100, Lennart Poettering wrote: On Tue, 30.12.14 06:49, Simon Peeters (peeters.si...@gmail.com) wrote: 2014-12-29 14:14 GMT+00:00 Tom Gundersen t...@jklm.no: On Mon, Dec 29, 2014 at 2:34 PM, Lennart Poettering lenn...@poettering.net wrote: snip I am open to adding support for this, but I think the allocation of the UID ranges should really happen automatically, and not be something the admin has to manually assign. Which means we'd enter dynamic UID allocation terroritory, and that opens a huge can of worms... Would we not also need to support explicit assignment, in case someone has a preexisting image they want to match in a specific way? In that case we could start off without the dynamic allocation and add that later. It certainly would make testing a lot simpler if we had userns support sooner rather than later (at least in the case of netlink it appears to be quite a mess). Inspired by this topic I wrote a quick'n'dirty uid allocator[1] this allocator manages the upper 2G uid's, which using Matthias Urlichs example of 2048 uid's per container, still allows for 1M containers. It curently can't persist these allocations, but that is on my 0.0.1 todolist. Hmm, so, I thought a lot about this in the past weeks. I think the way I'd really like to see this work in the end is that we never have to persist the UID mappings. This could work if the kernel would provide us with the ability to bind mount a file system into the container applying a fixed UID shift. That way, the shifted UIDs would never hit the actual disk, and hence we wouldn't have to persist their mappings. Instead on each container startup we'd look for a new UID range, and release it entirely when the container shuts down. The bind mount with UID shift would then shift the UIDs up, the userns stuff would shift it down from inside the container again. Of course, this all depends on whether the kernel will get an extension to apply uid shifts to bind mounts. I hear they want to provide this, but let's see. I would dearly love to see that happen. Having to recursively change the UID/GID on entire filesystem sub-trees given to containers with userns is a real unpleasant thing to have to deal with. I'd not want the filesystem UID shift to only apply to bind mounts though. It is not uncommon to use a disk image[1] for a container's filesystem, so being able to request a UID shift on *any* filesystem mount is pretty desirable, rather than having to mount the image and then bind mount it onto itself just to apply the UID shift. Regards, Daniel [1] Using a separate disk image per container means a container can't DOS other containers by exhausting inodes for example with $millions of small files. -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] perl-Net-DBus + new interactive authorization
On Mon, Jan 12, 2015 at 11:37:12AM +, Colin Guthrie wrote: Angelo Naselli wrote on 12/01/15 10:30: Il 12/01/2015 10:16, Colin Guthrie ha scritto: Angelo Naselli wrote on 11/01/15 17:15: FWIW i rebuilt it in mageia 4 and libdbus1_3-1.6.18-1.8.mga4 I haven't any issues of course. Using it as user for StartUnit/StopUnit for instance i got a only a different exception org.freedesktop.DBus.Error.AccessDenied: Rejected send message while using as root worked as before even if i used new api. This is expected unless you have also backported cauldron systemd to MGA4. I think we discussed before that the interactive authorisation stuff was only added in more recent versions of systemd, so this is entirely what I'd expect here. eh eh eh, mine was only to say that even if i haven't disabled anything and compiled all against the old library i didn't see any crash or regression, just a different exception for the same -not working- thing. But i haven't test anything of course :) Oh, right! Gotcha, so this is just about the comment regarding the conditional compilation against older libdbus. Sorry for misinterpreting and thanks for testing that. OK, I'll have a further look at it to see if anything special is needed, perhaps the perl binding stuff just works without too much faff here for non-present APIs (I'm certainly not an expert with this stuff!). I think you'll just need some #ifdef magic in the DBus.xs file to deal with the new APIs being missing. Perhaps just write a stub function in the DBus.xs that just raises a suitable perl error (see _croak_error source in DBus.xs for example on raising errors) Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] perl-Net-DBus + new interactive authorization
On Mon, Jan 12, 2015 at 12:04:42PM +, Colin Guthrie wrote: Daniel P. Berrange wrote on 12/01/15 11:40: On Mon, Jan 12, 2015 at 11:37:12AM +, Colin Guthrie wrote: Angelo Naselli wrote on 12/01/15 10:30: Il 12/01/2015 10:16, Colin Guthrie ha scritto: Angelo Naselli wrote on 11/01/15 17:15: FWIW i rebuilt it in mageia 4 and libdbus1_3-1.6.18-1.8.mga4 I haven't any issues of course. Using it as user for StartUnit/StopUnit for instance i got a only a different exception org.freedesktop.DBus.Error.AccessDenied: Rejected send message while using as root worked as before even if i used new api. This is expected unless you have also backported cauldron systemd to MGA4. I think we discussed before that the interactive authorisation stuff was only added in more recent versions of systemd, so this is entirely what I'd expect here. eh eh eh, mine was only to say that even if i haven't disabled anything and compiled all against the old library i didn't see any crash or regression, just a different exception for the same -not working- thing. But i haven't test anything of course :) Oh, right! Gotcha, so this is just about the comment regarding the conditional compilation against older libdbus. Sorry for misinterpreting and thanks for testing that. OK, I'll have a further look at it to see if anything special is needed, perhaps the perl binding stuff just works without too much faff here for non-present APIs (I'm certainly not an expert with this stuff!). I think you'll just need some #ifdef magic in the DBus.xs file to deal with the new APIs being missing. Perhaps just write a stub function in the DBus.xs that just raises a suitable perl error (see _croak_error source in DBus.xs for example on raising errors) Perhaps, but I think in this case it would be better to simply silently ignore the error as this is more of a nice additional feature rather than a core part. I think if someone wrote some perl code that took advantage of this, they would prefer it would just work as expected rather than have any need to push up conditional checks into the calling perl code. Sure, if it is semantically reasonable from an app's POV for it to be a no-op on old DBus, that's fine too. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] journalctl: allow customizable output formats
On Wed, Oct 08, 2014 at 11:53:38PM +0200, Lennart Poettering wrote: On Mon, 22.09.14 16:33, Daniel P. Berrange (berra...@redhat.com) wrote: The current '--output FORMAT' argument defines a number of common output formats, but there are some useful cases it does cover. In particular when reading application logs it is often desirable to display the code file name, line number and function name. Rather than defining yet more fixed output formats, this patch introduces user defined output formats. The format string is an arbitrary string which contains a mixture of literal text and variable subsistitions. Each variable name corresponds to a journal field name. A variable name can be optionally followed by a data type, and in the case of string types, a length limit. Hmm, hmm, hmm. I am quite afraid about inventing a new template language for this. I can see the usecase though, and I sympasize with it. I am particularly afraid of the entire type thing. The fact that the journal is more or less typeless is after all by design: i really didn't want to invent a new type system. Adding this to the formatter now, kinda feels like adding it after all, but through the backdoor... So, I am not against this in general, but I'd really be careful with the language we define here, and try to make this as similar to an existing language (like the python/java one Zbigniew mentioned) as we can. Or even better, we already have a very limited formatting language in place, which is the specifier logic, that can replace %i, %f and such things in unit files, maybe we can build on this, and allow specifiers to take a field name as parameter. Then, if we really need formatters for different field types, we could just give them high-level characters or so? Hmm, also, we already have a really bad formatter in place for the journal catalog files (which only replaces @foo@ by the value of field foo). We should probably use the same code for this new journalctl formatter and the catalog code. In fact the catalog formatter could really use some improvement... Ok, I didn't know about the catalog files until now, so I'll investigate that and see what I can do about unifying the code for these two options. Do you consider the catalog file format to be part of the stable ABI ? ie, do we need to preserve support for @foo@ if we took the %s{FOO} approach ? Maybe something like this: journalctl -O %t %s{CODE_FILE}:%s{CODE_LINE} %d{_SOURCE_REALTIME_TIMESTAMP} or something like that, where %t would simply map to the timestamp, and %s{} maps to a field name, and %d{} the same, but reformats the field as timestamp, assuming it is a UNIX timestamp formatted as number... Or something like that... Yep, that would work for me. I'll cook up another patch to demonstrate that approach along with catalog support. I'm about to be travelling for KVM Forum / LinuxCon so probably won't get a chance to send an updated patch for a week or two. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Should user mode linux register with machined?
On Fri, Oct 10, 2014 at 06:44:03PM +0200, Lennart Poettering wrote: On Wed, 17.09.14 10:24, Richard Weinberger (richard.weinber...@gmail.com) wrote: On Wed, Sep 17, 2014 at 1:09 AM, Zbigniew Jędrzejewski-Szmek zbys...@in.waw.pl wrote: On Tue, Sep 16, 2014 at 05:31:05PM +0200, Thomas Meyer wrote: Hi, I wrote a small patch for user-mode linux to register with machined by calling CreateMachine. Is this a good idea to do so? Yes, this sounds useful. After all is just another mechanism of virtualization, and in this case can be treated similarly to containers and vms. I still want a sane reason and a usecase for that. Can someone please educate me? :-) Please note that also qemu does not register itself to systemd. libvirt does. I think going down this path makes also sense for UML as libvirt has a UML driver too. qemu and the UML ELF image are the low level building blocks. Managers like libvirt should register the virtual machines created by LXC, UML, qemu, etc.. to systemd. It's a bit more complex. While UML, qemu, kvm, currently don't, LXC, systemd-nspawn and libvirt-lxc all do talk directly to machined. (Note that LXC and libvirt-lxc are separate codebases, the latter is *not* a wrapper around the former). Libvirt registers both LXC QEMU/KVM guests with machined. We don't currently register UML guests with machined, but that is simply because UML isn't really a high priority target for people anymore and so hasn't been updated to use libvirt's cgroup/systemd integration support. From the libvirt POV i'd be happy to see patches to make it register with machined. I'm not sure that standalone UML binaries need to directly integrate/register with systemd - I tend to view it as a job for whatever is managing UML to decide todo that. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] journalctl: allow customizable output formats
On Fri, Oct 03, 2014 at 02:13:51AM +0200, Zbigniew Jędrzejewski-Szmek wrote: On Mon, Sep 22, 2014 at 04:33:28PM +0100, Daniel P. Berrange wrote: The current '--output FORMAT' argument defines a number of common output formats, but there are some useful cases it does cover. In particular when reading application logs it is often desirable to display the code file name, line number and function name. Rather than defining yet more fixed output formats, this patch introduces user defined output formats. Hi, I think this makes sense. But I think that the format strings you propose are damn ugly :). Using %() for variables seems too heavy. Also, journal fields are all text, so I don't think that specifying the type is useful. Well there are two virtual fields which are timestamps which the existing hardcoded output modes convert into a date string in various ways. I want the format strings we define here to be able to express the semantics of the current hardcoded output modes, so this neccessitates a way to ask for various date formats. Also although the physically stored journal fields are strings per the journal API storage backend, they can be simple string versions of other data types. eg an application defined journal field could be used to store an integer, floating point, boolean, etc. It would be natural for the app to use many decimal places if storing a floating point value in the journal, so being able to give data types in the output mode lets us alter the precision displayed when extracting it again. Of course my patch didn't try todo this, it only deal with dates. Maybe we could adopt the {} format from Java and Python, as implemented in Python [1]. It has a fairly rich and consistent field formatting language. We would care only about the part relevant to strings, at least in the beginning. I'll see what I can cook up along these lines, but the existing python language is focused on C data types and doesn't directly provide types for the various date formats to support, so we can't use it 100% as-is. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] journalctl: allow customizable output formats
On Mon, Sep 22, 2014 at 12:43:28PM -0400, Daurnimator wrote: On 22 September 2014 11:33, Daniel P. Berrange berra...@redhat.com wrote: The current '--output FORMAT' argument defines a number of common output formats, but there are some useful cases it does cover. In particular when reading application logs it is often desirable to display the code file name, line number and function name. Rather than defining yet more fixed output formats, this patch introduces user defined output formats. The format string is an arbitrary string which contains a mixture of literal text and variable subsistitions. Each variable name corresponds to a journal field name. A variable name can be optionally followed by a data type, and in the case of string types, a length limit. As an opposing point of view, I've been accomplishing this by piping output through a script that parses and displays JSON. I rather this style of composability than passing format strings to journalctl itself. Sure you could do that, but it is really madness to expect anyone who just wants to display a slightly different set of fields to write a script to parse JSON and re-write it. When I have end users doing troubleshooting of libvirt for bug reports, I want to be able to just tell them to run a direct journalctl command to collect data I need, not have to write or download some extra script to parse JSON. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Cannot get Shutdown Script to Run (Libvirt Virtual Machine Shutdown)
On Sun, Sep 21, 2014 at 11:40:03PM -0400, Alexander Groleau wrote: Hello systemd users, I have been trying desperately for weeks to get my simple shutdown script for a Libvirt guest to run before libvirtd is shut down, without success. Essentially, I need the libvirt-windows.sh script to run before the libvirtd service is terminated (which occurs right after systemd-logind outputs its reboot message). How can I get my script into this initial section of daemon shutdowns, at the top? Any reason you've created your own shutdown script instead of using the libvirt-guests.service script that libvirt includes ? To get the ordering right, we have a number of rules: - libvirtd.service contains Before=libvirt-guests.service - libvirt-guests.service contains After=libvirtd.service - The guest scope unit contain After=libvirtd.service and Before=libvirt-guests.service It was the two rules aginst the .scope units that we found to be the key part to making shutdown work, whereby guests get stopped gracefully before the libvirtd daemon is stopped. The .scope units do not have any file on disk, they are generated on the fly as libvirt talks to systemd-machined, so you've no way to alter them to work with your custom shutdown script. Thus if you are not using the standard libvirt-guests.service, then you should at least use the name libvirt-guests.service for your own custom service. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [PATCH] journalctl: allow customizable output formats
The current '--output FORMAT' argument defines a number of common output formats, but there are some useful cases it does cover. In particular when reading application logs it is often desirable to display the code file name, line number and function name. Rather than defining yet more fixed output formats, this patch introduces user defined output formats. The format string is an arbitrary string which contains a mixture of literal text and variable subsistitions. Each variable name corresponds to a journal field name. A variable name can be optionally followed by a data type, and in the case of string types, a length limit. This is best illustrated with an example: $ journalctl -o format:%(__REALTIME_TIMESTAMP) \ [%(CODE_FILE):%(CODE_LINE):%(CODE_FUNC)] \ %(MESSAGE:string:80)\n _COMM=libvirtd -- Logs begin at Mon 2013-12-23 16:31:41 GMT, end at Mon 2014-09-22 16:13:00 BST. -- Dec 23 17:19:25 [util/virlog.c:877:virLogVMessage] libvirt version: 1.1.3.1, package: 2.fc20 (Fedora Project, 2013-11-17-23:28:43, ... Dec 23 17:19:25 [conf/storage_conf.c:854:virStoragePoolDefParseXML] XML error: unknown storage pool type btrfs Dec 23 17:19:30 [conf/domain_conf.c:12671:virDomainObjParseNode] XML error: unexpected root element domain, expecting domstatus Dec 23 17:24:45 [qemu/qemu_monitor.c:653:qemuMonitorIO] internal error: End of file from monitor Dec 23 20:12:00 [qemu/qemu_monitor.c:653:qemuMonitorIO] internal error: End of file from monitor -- Reboot -- Dec 23 21:06:14 [util/virlog.c:877:virLogVMessage] libvirt version: 1.1.3.1, package: 2.fc20 (Fedora Project, 2013-11-17-23:28:43, ... Dec 23 21:06:21 [conf/storage_conf.c:854:virStoragePoolDefParseXML] XML error: unknown storage pool type btrfs Signed-off-by: Daniel P. Berrange berra...@redhat.com --- man/journalctl.xml| 76 + src/journal-remote/journal-gatewayd.c | 11 +- src/journal/journalctl.c | 39 ++- src/shared/logs-show.c| 532 ++ src/shared/logs-show.h| 16 +- src/shared/output-mode.h | 1 + src/systemctl/systemctl.c | 20 +- 7 files changed, 615 insertions(+), 80 deletions(-) diff --git a/man/journalctl.xml b/man/journalctl.xml index acd75a6..bd8c2bd 100644 --- a/man/journalctl.xml +++ b/man/journalctl.xml @@ -375,6 +375,21 @@ /para /listitem /varlistentry + +varlistentry +term + optionformat:FMT/option +/term +listitem +paragenerates output +according to the format +specification given in +the FMT string. See the +OUTPUT FORMAT STRINGS +section for details +/para +/listitem +/varlistentry /variablelist /listitem /varlistentry @@ -878,6 +893,64 @@ /refsect1 refsect1 +titleOutput Format Strings/title + +paraAn output format string provides precise control how journal +data records are formatted for output. A format string consists of +mixture of literal text and variables to be substituted with journal +data records. A variable takes the general form/para + +programlisting$(NAME:TYPE:LEN)/programlisting + +paraThe NAME component corresponds to any journal entry field +(eg MESSAGE, _SYSTEMD_UNIT, CODE_FUNC, etc). The TYPE component +determines the data format to use for printing the value. If +omitted, it defaults to a sensible format for the NAME of the +field. The LEN component places an upper limit on the length of +strings being printed, beyond which they will be ellipsized. +The valid data types for TYPE are:/para + +variablelist +varlistentry +termstring/term +listitemparadisplayed if a printable string. If the value +contains non-printable characters
Re: [systemd-devel] Delaying (SSH) key generation until the urandom pool is initialized
On Tue, Apr 29, 2014 at 08:43:38PM +0200, Florian Weimer wrote: The message at https://mail.gnome.org/archives/ostree-list/2014-February/msg00010.html contains two boot traces from virtual machines which show that the SSH key is generated before the kernel pool is sufficiently seeded. I'm wondering if the VMs that ostree is creating are being given a virtio-rng device ? If not that would probably be a good idea to enable to allow them to get entropy. VMs are generally starved of entropy even beyond the initial boot up stage, so a virtual RNG is generally useful. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Delaying (SSH) key generation until the urandom pool is initialized
On Wed, Apr 30, 2014 at 02:10:56PM +0200, Florian Weimer wrote: On 04/30/2014 01:14 PM, Daniel P. Berrange wrote: On Tue, Apr 29, 2014 at 08:43:38PM +0200, Florian Weimer wrote: The message at https://mail.gnome.org/archives/ostree-list/2014-February/msg00010.html contains two boot traces from virtual machines which show that the SSH key is generated before the kernel pool is sufficiently seeded. I'm wondering if the VMs that ostree is creating are being given a virtio-rng device ? If not that would probably be a good idea to enable to allow them to get entropy. VMs are generally starved of entropy even beyond the initial boot up stage, so a virtual RNG is generally useful. Interesting suggestion. I just used virt-manager to create the VM. I don't see any trace for rng or random in the domain XML file. If it is supported, I think it should be enabled by default. I'm told that it isn't turned on by default, but you can add it to a VM post-install. Since it feeds VMs from the host's /dev/random or /dev/hwrng, there was a question mark as to whether it was right to enable it by default or not, and if so what kind of rate limiting might be wanted by default. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [systemd][cgroup in container] problem with cgroup hierarchy in container
On Thu, Mar 06, 2014 at 07:54:05PM +0100, Lennart Poettering wrote: On Thu, 06.03.14 16:55, Dariusz Michaluk (d.micha...@samsung.com) wrote: On 05.03.2014 19:16, Lennart Poettering wrote: nspawn and libvirt-lxc mostly follow the same code paths and register via machined... So it's weird that different things happen. Somehow the systemd instance inside the container must be confused about the cgroup it is running in... Next few cents. I noticed that when I run lxc-libvirt container I get warning Failed to install release agent, ignoring: No such file or directory, which does not occur when I use nspawn. Oh! Hmm, thta suggests that libvirt-lxc might not mount the naked cgroupfs tree to /sys/fs/cgroup/systemd, but only a subdirectory. This of course might cause the weird setup that the host tree is duplicated for the container! Unfortunately it is not possible to only mount a subtree of the cgroup hierarchy into the container, since then the data from /proc/self/cgroup won't match /sys/fs/cgroup/systemd anymore... Also, the root of the cgroup trees has slightly different semantics and more properties than the children. Is this the default setup of libvirt-lxc for those dirs? I figure we should talk to Daniel to get that changed... Yeah that was setup that way a while ago, but I forgot this would invalidate /proc/self/cgroup information. It was a bit of a poor mans attempt at securing cgroups, but really it is just a waste of time unless user namespaces are available. Can someone file a bug against libvirt for this and we'll look at not doing this. Each container really needs to see the full tree. The best thing possible to make sure that the containers can't muck with anything outside of the tree is to mount the upper parts read-only with a bind mount, but other than that i don't see that we could do anything there... User namespaces are the best bet here. Once th root UID is remapped the container won't be able to move themselves out of their subtree. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Build warnings for ARM due to -Wcast-align
On Thu, Feb 20, 2014 at 05:21:22PM +0100, Lennart Poettering wrote: On Thu, 20.02.14 17:03, Daniel Mack (dan...@zonque.org) wrote: Hi, When cross-compiling the current git HEAD for ARM using gcc 4.8.2, I see ~160 warnings similar to this one: src/core/unit.c: In function 'unit_get_exec_runtime': src/core/unit.c:2851:17: warning: cast increases required alignment of target type [-Wcast-align] return *(ExecRuntime**) ((uint8_t*) u + offset); ^ The full build log is here: http://paste.fedoraproject.org/78944/92912005 Unaligned memory access is indeed unsupported by some older instruction cores. The kernel can fix up in situations where such unaligned access occurs, but that's of course expensive and slow. However, systemd does not actually do unaligned memory access at runtime (at least I haven't seen any when booting up PXA3xx hardware). The warning is simply about the type of pointer arithmetic that casts to and from uint8_t*. And because it's practically impossible to fix the things the compiler complains about here anyway, I propose removing -Wcast-align from the CFLAGS in configure.ac. Any opinions? I am fine with that. I am personally only running things on x86, so it never showed up for me. The usual solution for cast issues is to use some union-based type conversion, but in the case above this is not really nicely possible. Hence, let's drop it, unless somebody has a better solution... I think cast align warnings are fairly useful since many things it can show turn out to be genuine bugs, so not entirely desirable to disable them altogether. In libvirt we just mark the few cases which are false positives with a pragma #define VIR_WARNINGS_NO_CAST_ALIGN \ _Pragma (GCC diagnostic push) \ _Pragma (GCC diagnostic ignored \-Wcast-align\) #define VIR_WARNINGS_RESET \ _Pragma (GCC diagnostic pop) And then just mark it thus VIR_WARNINGS_NO_CAST_ALIGN ...code with false positive VIR_WARNINGS_RESET Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Howto run systemd within a linux container
On Wed, Feb 05, 2014 at 11:44:33PM +0100, Richard Weinberger wrote: Hi! We're heavily using Linux containers in our production environment. As modern Linux distributions move forward to systemd have to make sure that systemd works within our containers. Sadly we're facing issues with cgroups. Our testbed consists of openSUSE 13.1 with Linux 3.13.1 and libvirt 1.2.1. In a plain setup systemd stops immediately because it is unable to create the cgroup hierarchy. Mostly because the container uid 0 is in a user namespace and has no rights to do that. FYI I have succesfully run Fedora 19 with systemd inside a container with libvirt LXC, however, I did *not* enable user namespaces. Every time I try user namespaces I find some other bug in either the kernel or libvirt, so I wouldn't be surprised if yet more breakage has occurred in user namepsaces :-( Next try, fool systemd by mounting a tmpfs to /sys/fs/cgroup/systemd/. This seems to work. openSUSE boots, I can start/stop services... Shutdown hangs forever, had no time to investigate so far. But is this tmpfs hack the correct way to run systemd in a container? I really don't think so. Yeah that really shouldnt' be needed. When libvirt runs a container it creates a cgroup just for that container to run in, and systemd should be able to create its hierarchy under that location. That said, I wonder if libvirt is perhaps forgetting to chown() the cgroup to the UID/GID you've mapped for the root user. That would certainly prevent systemd using it and could cause the sort of pain you see. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Howto run systemd within a linux container
On Thu, Feb 06, 2014 at 04:33:22PM +0100, Greg KH wrote: On Thu, Feb 06, 2014 at 10:55:01AM +, Daniel P. Berrange wrote: On Wed, Feb 05, 2014 at 11:44:33PM +0100, Richard Weinberger wrote: Hi! We're heavily using Linux containers in our production environment. As modern Linux distributions move forward to systemd have to make sure that systemd works within our containers. Sadly we're facing issues with cgroups. Our testbed consists of openSUSE 13.1 with Linux 3.13.1 and libvirt 1.2.1. In a plain setup systemd stops immediately because it is unable to create the cgroup hierarchy. Mostly because the container uid 0 is in a user namespace and has no rights to do that. FYI I have succesfully run Fedora 19 with systemd inside a container with libvirt LXC, however, I did *not* enable user namespaces. Every time I try user namespaces I find some other bug in either the kernel or libvirt, so I wouldn't be surprised if yet more breakage has occurred in user namepsaces :-( Those bugs should now be fixed, if you don't enable the option, how are we supposed to know what is left to be done? :) I have in fact been building my own kernels for Fedora with user namespaces enabled to debug / test this and have reported all the bugs I found so far. Just saying that with the track record of bugs since the userns code first merged, I wouldn't be surprised if there were still more things to iron out as we try more real world apps like systemd. Regads, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider renaming -.slice
On Thu, Jan 16, 2014 at 11:27:42AM +0100, Holger Schurig wrote: Oh, I confused that with the old /etc/systemd/systemd-journald.conf file, which was renamed. Kay, I only meant to special case the /, e.g. let home-kay-data.mount be it like it is, but rename - to root, so that it is root.mount and root.slice. This would still break applications like libvirt which expect the current naming conventions. These naming conventions must be considered to be part of the stable API for apps and so cannot be changed once included in an official release. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] getty : how to run getty on every ttyX
On Fri, Dec 13, 2013 at 05:20:19PM +0100, Lennart Poettering wrote: On Fri, 13.12.13 16:15, Lennart Poettering (lenn...@poettering.net) wrote: We had discussed this back at Linux Plumbers last year, and at the time you had suggested that rather than create /dev/ttyN symlinks we should instead do something like /dev/containerttyN instead, and set a 'container_tty' variable containing a list of all those device names so that systemd can discover them sensibly. We never got around to doing this from the libvirt side, and AFAIK systemd hasn't done anything on its side either. So is this still a suitable way forward ? Yeah, I am pretty sure that's what we should do. I figure I should hack that up. I'll work on it now. Committed. systemd-getty-generator will now look for $container_ttys set as an environment variable for PID 1. If that is set it will split the string up on whitespaces and start a getty on all ptys referenced. Note that this only supports ptys, not any other ttys. Example: container_ttys=pts/5 pts/8 pts/15 when pass to PID 1 will spawn three additional gettys on ptys 5, 8 and 15. Note that this *really* only supports ptys, not any other kinds of ttys, sinc for those we require propery device enumeration and notification and we don't have those in containers... I still chose to name this $container_ttys rather than $container_ptys, so that maybe one day we can extend it should devices like this ever get virtualized. This will be in systemd 209. I've tested this with libvirt and it worked except for one small edge case. Say libvirt creates 3 consoles /dev/pts/0, /dev/pts/1 and /dev/pts/2. Now we set container_ttys=pts/0 pts/1 pts/2 Systemd starts up 3 agetty processes - one of each of these. The /dev/console device, however, is also a link to /dev/pts/0 and so systemd starts up a agetty process for that too. Now we have 2 agetty processes fighting over /dev/pts/0 which ends in tears Is this something that systemd should detect cope with, or should we document that the 'container_ttys' env *must exclude* any tty associated with the /dev/console device ? Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] getty : how to run getty on every ttyX
On Mon, Dec 16, 2013 at 05:33:12PM +0100, Lennart Poettering wrote: On Mon, 16.12.13 12:03, Daniel P. Berrange (berra...@redhat.com) wrote: Note that this *really* only supports ptys, not any other kinds of ttys, sinc for those we require propery device enumeration and notification and we don't have those in containers... I still chose to name this $container_ttys rather than $container_ptys, so that maybe one day we can extend it should devices like this ever get virtualized. This will be in systemd 209. I've tested this with libvirt and it worked except for one small edge case. Say libvirt creates 3 consoles /dev/pts/0, /dev/pts/1 and /dev/pts/2. Now we set container_ttys=pts/0 pts/1 pts/2 Systemd starts up 3 agetty processes - one of each of these. The /dev/console device, however, is also a link to /dev/pts/0 and so systemd starts up a agetty process for that too. Now we have 2 agetty processes fighting over /dev/pts/0 which ends in tears Is this something that systemd should detect cope with, or should we document that the 'container_ttys' env *must exclude* any tty associated with the /dev/console device ? I am tempted to say that we should do the latter, it's quite difficult to figure out when they point to the same (for example, because people use a bind mount rather than a symlink), and the roles of the console and the other $container_ttys is quite different during boot if we want to avoid printing logs over the getty and so on... I added this to the wiki text now. Ok, sounds good. I'll update libvirt to take account of this. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] getty : how to run getty on every ttyX
On Fri, Dec 13, 2013 at 04:06:56PM +0100, Lennart Poettering wrote: On Fri, 13.12.13 16:34, Gao feng (gaof...@cn.fujitsu.com) wrote: As we know, systemd only forks getty on ttyX when we press ctrl + alt + FX. I whould like to let systemd forks server gettys on all of tty deivces by default. this is very useful in container environment, since we can't use ctrl+alt+FX to trigger getty in container. ...do containers even have such devices? pts device ;) This will not work. Unlike VT ttys which exist continously and perpetuously on a system ptys only exist when an application allocates them, like for example xterm or ssh. However, for them its xterm's or ssh's job to spawn a shall as backend. Anyway, just enable more instances of getty@.service for all devices you need, just like getty@tty1.service is started by default. The autostart that you mention is part of logind and all it does is just start the same services via systemd, no magic. getty@tty1.service under /etc/systemd/system/getty.target.wants/ is linked to /usr/lib/systemd/system/getty@.service, so I create getty@tty2.service which links to /usr/lib/systemd/system/getty@.service too. is this right? In libvirt lxc, the ttyX actually is pts devices. [root@localhost getty.target.wants]# ll total 0 lrwxrwxrwx 1 root root 38 Dec 13 02:49 getty@tty1.service - /usr/lib/systemd/system/getty@.service lrwxrwxrwx 1 root root 38 Dec 13 03:22 getty@tty2.service - /usr/lib/systemd/system/getty@.service seems like in my container, agetty listens on /dev/console, not tty1 or tty2 /sbin/agetty --noclear --keep-baud console 115200 38400 9600 it seems getty-generator does the extra job. /dev/tty1, /dev/tty2, ... make no sense in containers as there is no virtual console. For each console device that you list in the container configuration with libvirt, it will allocate a /dev/pts/NNN device, and add a symlink from /dev/ttyN to the /dev/pts/NNN slave. What we'd like is for guest OS to be able to setup agetty processes on any console device libvirt has configured for the container automatically. We had discussed this back at Linux Plumbers last year, and at the time you had suggested that rather than create /dev/ttyN symlinks we should instead do something like /dev/containerttyN instead, and set a 'container_tty' variable containing a list of all those device names so that systemd can discover them sensibly. We never got around to doing this from the libvirt side, and AFAIK systemd hasn't done anything on its side either. So is this still a suitable way forward ? Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] getty : how to run getty on every ttyX
On Fri, Dec 13, 2013 at 05:20:19PM +0100, Lennart Poettering wrote: On Fri, 13.12.13 16:15, Lennart Poettering (lenn...@poettering.net) wrote: We had discussed this back at Linux Plumbers last year, and at the time you had suggested that rather than create /dev/ttyN symlinks we should instead do something like /dev/containerttyN instead, and set a 'container_tty' variable containing a list of all those device names so that systemd can discover them sensibly. We never got around to doing this from the libvirt side, and AFAIK systemd hasn't done anything on its side either. So is this still a suitable way forward ? Yeah, I am pretty sure that's what we should do. I figure I should hack that up. I'll work on it now. Committed. systemd-getty-generator will now look for $container_ttys set as an environment variable for PID 1. If that is set it will split the string up on whitespaces and start a getty on all ptys referenced. Note that this only supports ptys, not any other ttys. Example: container_ttys=pts/5 pts/8 pts/15 when pass to PID 1 will spawn three additional gettys on ptys 5, 8 and 15. Note that this *really* only supports ptys, not any other kinds of ttys, sinc for those we require propery device enumeration and notification and we don't have those in containers... I still chose to name this $container_ttys rather than $container_ptys, so that maybe one day we can extend it should devices like this ever get virtualized. This will be in systemd 209. Great, that all sounds good to me. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] machines get killed when scopes are destroyed
On Mon, Nov 18, 2013 at 03:03:18AM +0100, Zbigniew Jędrzejewski-Szmek wrote: v0lZy reported on IRC that his qemu machines get killed when shutting down the host. libvirt-guests.service is designed to suspend them during shutdown, but when it was run, the guests were all already dead. And indeed, each qemu is running inside a scope, which is not connected by any dependencies to either systemd-machine.service, or libvirt-guests.service. libvirt-guests.service does not depend on systemd-machine.service either. This means that when shutdown is ordered, the scopes will stopped in parallel to other libvirt-guests.service, and depending on timing, qemus will be just killed with SIGTERM. For this whole thing to work correctly, we need to ensure that scopes are not terminated prematurely. If we introduced a target like libvirt-ready.target, and made libvirt-guests.service be After=libvirt-ready.target, and made all the scopes be Before=libvirt-ready.target, I think the vms would have a chance to shutdown properly. But that's pretty complicated. And I'm not even sure how to do that properly. Any better ideas? I don't have an answer for you, but just want to ask that you file a bug against libvirt for this problem. This is an unintended regression in libvirt functionality with the switch to systemd scopes. http://libvirt.org/bugs.html Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] netns: unix: only allow to find out unix socket in same net namespace
On Wed, Aug 21, 2013 at 11:51:53AM +0200, Kay Sievers wrote: On Wed, Aug 21, 2013 at 9:22 AM, Gao feng gaof...@cn.fujitsu.com wrote: On 08/21/2013 03:06 PM, Eric W. Biederman wrote: I suspect libvirt should simply not share /run or any other normally writable directory with the host. Sharing /run /var/run or even /tmp seems extremely dubious if you want some kind of containment, and without strange things spilling through. Right, /run or /var cannot be shared. It's not only about sockets, many other things will also go really wrong that way. Libvirt already allows the app defining the container config to set private mounts for any directory including /run and /var. If an admin or app wants to run systemd inside a container, it is their responsibility to ensure they setup the filesystem in a suitable manner. Libvirt is not going to enforce use of a private /run or /var, since that's a policy decision for a specific use case. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] udev within a container
On Fri, Jul 26, 2013 at 05:16:16PM +0200, Kay Sievers wrote: On Fri, Jul 26, 2013 at 5:09 PM, Rob Spanton rspan...@zepler.net wrote: I would like to run some processes inside a container that interact with udev. Ideally udev would be within the same container as those processes, as then I could also have udev rules that started other things within that container too... However, as far as I can tell, it's not possible to run udev within a container -- is this correct? Is there a magical workaround that I haven't found!? There is no real support to run udev inside a container. It might be possible to hack around that, but it's nothing that seems convincing so far. We just run containers without any udev setup, but with a minimal pre-setup /dev. Furthermore, in the absence of any devices namespace in the kernel, it would be a security flaw to allow a process inside the container (whether udevd or something else) the permission to mknod. So you must always pre-populate /dev and remove CAP_MKNOD instead of running udev, if you want any security. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Error handling problems with systemd-machined
I'm working on integrating libvirt with systemd-machined for cgroups setup and hitting a number of problems The first was that v205 ignores all parameters passed though as scope properties in the DBus CreateMachine call. So I upgraded to v206 which seems to have fixed that. When something goes wrong with the CreateMachine DBus call though all I ever seem to get back is Input/output error. After strace'ing systemd-machined I find the real error recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{l\1\0\1\334\0\0\0\2\0\0\0\277\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/machine1\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop.machine1\0\0\0\0\0\0\0\0\2\1s\0 \0\0\0org.freedesktop.machine1.Manager\0\0\0\0\0\0\0\0\3\1s\0\r\0\0\0CreateMachine\0\0\0\10\1g\0\fsayssusa(sv)\0\0\0\0\0\0\0\7\1s\0\6\0\0\0:1.130\0\0\t\0\0\0lxc-busy2\0\0\0\20\0\0\0\335\247\271G\10F\27Y(s\0177]\367\327\353\v\0\0\0libvirt-lxc\0\t\0\0\0container\0\0\0\210:\0\0\0\0\0\0\0\0\0\0\204\0\0\0\0\0\0\0\5\0\0\0Slice\0\1s\0\0\0\0\16\0\0\0/machine.slice\0\0\0\0\0\0\r\0\0\0CPUAccounting\0\1b\0\0\0\0\1\0\0\0\0\0\0\0\21\0\0\0BlockIOAccounting\0\1b\0\0\0\0\1\0\0\0\20\0\0\0MemoryAccounting\0\1b\0\1\0\0\0, 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 428 sendmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{l\1\0\1D\1\0\0\n\0\0\0\255\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/systemd1\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop.systemd1\0\0\0\0\0\0\0\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\3\1s\0\22\0\0\0StartTransientUnit\0\0\0\0\0\0\10\1g\0\7ssa(sv)\0\0\0\0, 192}, {\32\0\0\0machine-lxc\\x2dbusy2.scope\0\0\4\0\0\0fail\0\0\0\0\24\1\0\0\5\0\0\0Slice\0\1s\0\0\0\0\r\0\0\0machine.slice\0\0\0\0\0\0\0\v\0\0\0Description\0\1s\0\0\23\0\0\0Container lxc-busy2\0\0\0\0\0\17\0\0\0TimeoutStopUSec\0\1t\0\0 \241\7\0\0\0\0\0\4\0\0\0PIDs\0\2au\0\0\0\0\4\0\0\0\210:\0\0\5\0\0\0Slice\0\1s\0\0\0\0\16\0\0\0/machine.slice\0\0\0\0\0\0\r\0\0\0CPUAccounting\0\1b\0\0\0\0\1\0\0\0\0\0\0\0\21\0\0\0BlockIOAccounting\0\1b\0\0\0\0\1\0\0\0\20\0\0\0MemoryAccounting\0\1b\0\1\0\0\0, 324}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 516 recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{l\3\1\1+\0\0\0\265\1\0\0]\0\0\0\6\1s\0\6\0\0\0:1.126\0\0\4\1s\0$\0\0\0org.freedesktop.systemd1.InvalidName\0\0\0\0\5\1u\0\n\0\0\0\10\1g\0\1s\0\0\7\1s\0\4\0\0\0:1.1\0\0\0\0\0\0\0Unit name /machine.slice is not valid.\0, 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 155 sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{PRIORITY=3\nSYSLOG_FACILITY=4\nCODE_FILE=src/machine/machine.c\nCODE_LINE=246\nCODE_FUNCTION=machine_start_scope\nSYSLOG_IDENTIFIER=systemd-machined\n, 144}, {MESSAGE=, 8}, {Failed to start machine scope: Unit name /machine.slice is not valid., 69}, {\n, 1}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 222 sendmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{l\3\1\1\27\0\0\0\v\0\0\0O\0\0\0\6\1s\0\6\0\0\0:1.130\0\0\4\1s\0\\0\0\0org.freedesktop.DBus.Error.IOError\0\0\0\0\0\0\5\1u\0\2\0\0\0\10\1g\0\1s\0\0, 96}, {\22\0\0\0Input/output error\0, 23}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 119 So machined is getting a useful error back from systemd Unit name /machine.slice is not valid. and syslog'ing that error, and then sending back the dbus client a useless Input/output error message :-( Once I fixed the unit name to removing the leading '/', I hit a second error recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{l\3\1\0014\0\0\0\301\1\0\0]\0\0\0\6\1s\0\6\0\0\0:1.126\0\0\4\1s\0#\0\0\0org.freedesktop.systemd1.UnitExists\0\0\0\0\0\5\1u\0\f\0\0\0\10\1g\0\1s\0\0\7\1s\0\4\0\0\0:1.1\0\0\0\0/\0\0\0Unit machine-lxc\\x2dbusy2.scope already exists.\0, 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 164 Unit machine-lxc\\x2dbusy2.scope already exists But neither machinectl list or systemctl --full show any such machine or unit existing. It seems like when it reported the bogus slice name, it did not fully clean up the transient scope unit it created. This is then blocking further attempts to create the same transient scope. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd shutdown vs ostree
On Sat, Jul 20, 2013 at 06:50:13PM -0400, Colin Walters wrote: So OSTree sets up systemd inside a chroot - /usr is a read-only bind mount, and /var is a bind mount outside the root to a shared location. Furthermore, /sysroot points to the real root. Since last time we discussed this: http://lists.freedesktop.org/archives/systemd-devel/2012-September/006668.html I now use this service inside dracut: https://git.gnome.org/browse/ostree/tree/src/dracut/ostree-prepare-root.service Which executes: https://git.gnome.org/browse/ostree/tree/src/switchroot/ostree-prepare-root.c Then finally we do dracut's normal systemctl switch-root, and everything continues as normal. I haven't had to patch the systemd codebase at all for this. The problem is that on shutdown, systemd will synthesize usr.mount and var.mount from /proc/self/mountinfo, but it can't really unmount them until the same point as the rootfs. Because these units fail to unmount, the normal shutdown process wedges. I can shutdown fine with systemctl --force poweroff, but then I don't get plymouth integration etc. One way to fix this might be to somehow tell systemd to just ignore these mount points during shutdown. Or possibly, switch back to the initramfs and unmount them from there. The ugly thing about switching back to the initramfs is that it requires unpacking it from the cpio blob again, which requires /boot to be mounted, only to run a few unmount syscalls, and then finally power off. But if there was a way to tell systemd to just ignore the mounts, then we'd drop into the final poweroff SIGTERM/SIGKILL/umount spree like sysvinit did, and things would work. Anyone else doing bind mount tricks like this? A while back had a similar-ish kind of problem with LXC, when the original FS had something mounted at say /foo/bar/wizz, and then libvirt bind mounted something at /foo, making /foo/bar/wizz inaccessible. systemd would still see these over-mounted mounts and fail to unmount them at shutdown. I fixed libvirt LXC to remove all sub-mounts before bind mounting the new thing at /foo, so not sure if the problems I saw with systemd would still exist or not. There is also a change proposed for the kernel namespaces yesterday to make it possible to stop a process inside a container from unmounting things that wasn't originally mounted inside the namespace. So if that is merged, systemd inside a container wouldn't be able to assume it has the privileges to unmount all filesystems it can see. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Error handling problems with systemd-machined
On Wed, Jul 24, 2013 at 02:13:30PM +0100, Daniel P. Berrange wrote: I'm working on integrating libvirt with systemd-machined for cgroups setup and hitting a number of problems A further discovery - if I pass MemoryAccounting=yes as a scope property, then the process gets immediately killed by the OOM killer Jul 24 14:30:26 localhost systemd[1]: Starting Container lxc-busy3. Jul 24 14:30:26 localhost systemd[1]: Started Container lxc-busy3. Jul 24 14:30:26 localhost systemd-machined[14756]: New machine lxc-busy3. Jul 24 14:30:26 localhost kernel: [ 4326.760834] libvirt_lxc invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 Jul 24 14:30:26 localhost kernel: [ 4326.760839] libvirt_lxc cpuset=/ mems_allowed=0 Jul 24 14:30:26 localhost kernel: [ 4326.760841] Pid: 26196, comm: libvirt_lxc Not tainted 3.9.0-0.rc1.git0.1.fc19.x86_64 #1 Jul 24 14:30:26 localhost kernel: [ 4326.760843] Call Trace: Jul 24 14:30:26 localhost kernel: [ 4326.760852] [810d2da6] ? cpuset_print_task_mems_allowed+0x96/0xc0 Jul 24 14:30:26 localhost kernel: [ 4326.760857] [8163cc20] dump_header+0x7a/0x1b3 Jul 24 14:30:26 localhost kernel: [ 4326.760860] [8113242e] oom_kill_process+0x1be/0x310 Jul 24 14:30:26 localhost kernel: [ 4326.760864] [811913d5] __mem_cgroup_try_charge+0xad5/0xb20 Jul 24 14:30:26 localhost kernel: [ 4326.760866] [81191c80] ? mem_cgroup_charge_common+0x120/0x120 Jul 24 14:30:26 localhost kernel: [ 4326.760869] [81191be6] mem_cgroup_charge_common+0x86/0x120 Jul 24 14:30:26 localhost kernel: [ 4326.760871] [8119349b] mem_cgroup_newpage_charge+0x4b/0xb0 Jul 24 14:30:26 localhost kernel: [ 4326.760874] [8115954c] handle_pte_fault+0x71c/0xa30 Jul 24 14:30:26 localhost kernel: [ 4326.760877] [81217039] ? ext4_file_write+0x99/0x3f0 Jul 24 14:30:26 localhost kernel: [ 4326.760880] [815223e2] ? __sys_recvmsg+0x112/0x290 Jul 24 14:30:26 localhost kernel: [ 4326.760882] [8115a671] handle_mm_fault+0x291/0x660 Jul 24 14:30:26 localhost kernel: [ 4326.760887] [816498e1] __do_page_fault+0x171/0x4f0 Jul 24 14:30:26 localhost kernel: [ 4326.760890] [811d7bd1] ? fsnotify+0x241/0x320 Jul 24 14:30:26 localhost kernel: [ 4326.760892] [81649c6e] do_page_fault+0xe/0x10 Jul 24 14:30:26 localhost kernel: [ 4326.760894] [816493aa] do_async_page_fault+0x2a/0xa0 Jul 24 14:30:26 localhost kernel: [ 4326.760896] [81646388] async_page_fault+0x28/0x30 Jul 24 14:30:26 localhost kernel: [ 4326.760899] Task in /machine.slice/machine-lxc\x2dbusy3.scope killed as a result of limit of /machine.slice/machine-lxc\x2dbusy3.scope Jul 24 14:30:26 localhost kernel: [ 4326.760901] memory: usage 0kB, limit 0kB, failcnt 7 Jul 24 14:30:26 localhost kernel: [ 4326.760920] memory+swap: usage 0kB, limit 9007199254740991kB, failcnt 0 Jul 24 14:30:26 localhost kernel: [ 4326.760921] kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 Jul 24 14:30:26 localhost kernel: [ 4326.760923] Memory cgroup stats for /machine.slice/machine-lxc\x2dbusy3.scope: cache:0KB rss:0KB mapped_file:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Jul 24 14:30:26 localhost kernel: [ 4326.760931] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Jul 24 14:30:26 localhost kernel: [ 4326.760956] [26196] 0 2619629225 1732 560 0 libvirt_lxc Jul 24 14:30:26 localhost kernel: [ 4326.760958] Memory cgroup out of memory: Kill process 26196 (libvirt_lxc) score 0 or sacrifice child Jul 24 14:30:26 localhost kernel: [ 4326.761064] Killed process 26196 (libvirt_lxc) total-vm:116900kB, anon-rss:3256kB, file-rss:3672kB Jul 24 14:30:26 localhost kernel: [ 4326.776462] virbr0: port 2(veth0) entered disabled state Jul 24 14:30:26 localhost kernel: [ 4326.777526] device veth0 left promiscuous mode Jul 24 14:30:26 localhost kernel: [ 4326.777548] virbr0: port 2(veth0) entered disabled state Jul 24 14:30:26 localhost avahi-daemon[431]: Withdrawing workstation service for veth1. Jul 24 14:30:26 localhost avahi-daemon[431]: Withdrawing workstation service for veth0. Jul 24 14:30:26 localhost systemd-machined[14756]: Machine lxc-busy3 terminated. It looks like when passing MemoryAccount=yes, then systemd is accidentally initializing the cgroup memory limit to 0 kb, with obvious results. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Error handling problems with systemd-machined
On Wed, Jul 24, 2013 at 02:36:44PM +0100, Daniel P. Berrange wrote: On Wed, Jul 24, 2013 at 02:13:30PM +0100, Daniel P. Berrange wrote: I'm working on integrating libvirt with systemd-machined for cgroups setup and hitting a number of problems A further discovery - if I pass MemoryAccounting=yes as a scope property, then the process gets immediately killed by the OOM killer [snip] It looks like when passing MemoryAccount=yes, then systemd is accidentally initializing the cgroup memory limit to 0 kb, with obvious results. It can be reproduced with simple '.slice' units defined outside of systemd-machined too # cat /etc/systemd/system/machine-demo.slice [Unit] Description=Demo slice [Slice] CPUAccounting=yes MemoryAccounting=yes # systemctl start machine-demo.slice # cat /sys/fs/cgroup/memory/machine.slice/machine-demo.slice/memory.limit_in_bytes 0 Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Error handling problems with systemd-machined
On Wed, Jul 24, 2013 at 05:59:48PM +0200, Lennart Poettering wrote: When something goes wrong with the CreateMachine DBus call though all I ever seem to get back is Input/output error. After strace'ing systemd-machined I find the real error recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{l\1\0\1\334\0\0\0\2\0\0\0\277\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/machine1\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop.machine1\0\0\0\0\0\0\0\0\2\1s\0 \0\0\0org.freedesktop.machine1.Manager\0\0\0\0\0\0\0\0\3\1s\0\r\0\0\0CreateMachine\0\0\0\10\1g\0\fsayssusa(sv)\0\0\0\0\0\0\0\7\1s\0\6\0\0\0:1.130\0\0\t\0\0\0lxc-busy2\0\0\0\20\0\0\0\335\247\271G\10F\27Y(s\0177]\367\327\353\v\0\0\0libvirt-lxc\0\t\0\0\0container\0\0\0\210:\0\0\0\0\0\0\0\0\0\0\204\0\0\0\0\0\0\0\5\0\0\0Slice\0\1s\0\0\0\0\16\0\0\0/machine.slice\0\0\0\0\0\0\r\0\0\0CPUAccounting\0\1b\0\0\0\0\1\0\0\0\0\0\0\0\21\0\0\0BlockIOAccounting\0\1b\0\0\0\0\1\0\0\0\20\0\0\0MemoryAccounting\0\1b\0\1\0\0\0, 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 428 sendmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{l\1\0\1D\1\0\0\n\0\0\0\255\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/systemd1\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop.systemd1\0\0\0\0\0\0\0\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\3\1s\0\22\0\0\0StartTransientUnit\0\0\0\0\0\0\10\1g\0\7ssa(sv)\0\0\0\0, 192}, {\32\0\0\0machine-lxc\\x2dbusy2.scope\0\0\4\0\0\0fail\0\0\0\0\24\1\0\0\5\0\0\0Slice\0\1s\0\0\0\0\r\0\0\0machine.slice\0\0\0\0\0\0\0\v\0\0\0Description\0\1s\0\0\23\0\0\0Container lxc-busy2\0\0\0\0\0\17\0\0\0TimeoutStopUSec\0\1t\0\0 \241\7\0\0\0\0\0\4\0\0\0PIDs\0\2au\0\0\0\0\4\0\0\0\210:\0\0\5\0\0\0Slice\0\1s\0\0\0\0\16\0\0\0/machine.slice\0\0\0\0\0\0\r\0\0\0CPUAccounting\0\1b\0\0\0\0\1\0\0\0\0\0\0\0\21\0\0\0BlockIOAccounting\0\1b\0\0\0\0\1\0\0\0\20\0\0\0MemoryAccounting\0\1b\0\1\0\0\0, 324}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 516 recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{l\3\1\1+\0\0\0\265\1\0\0]\0\0\0\6\1s\0\6\0\0\0:1.126\0\0\4\1s\0$\0\0\0org.freedesktop.systemd1.InvalidName\0\0\0\0\5\1u\0\n\0\0\0\10\1g\0\1s\0\0\7\1s\0\4\0\0\0:1.1\0\0\0\0\0\0\0Unit name /machine.slice is not valid.\0, 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 155 sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{PRIORITY=3\nSYSLOG_FACILITY=4\nCODE_FILE=src/machine/machine.c\nCODE_LINE=246\nCODE_FUNCTION=machine_start_scope\nSYSLOG_IDENTIFIER=systemd-machined\n, 144}, {MESSAGE=, 8}, {Failed to start machine scope: Unit name /machine.slice is not valid., 69}, {\n, 1}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 222 sendmsg(5, {msg_name(0)=NULL, msg_iov(2)=[{l\3\1\1\27\0\0\0\v\0\0\0O\0\0\0\6\1s\0\6\0\0\0:1.130\0\0\4\1s\0\\0\0\0org.freedesktop.DBus.Error.IOError\0\0\0\0\0\0\5\1u\0\2\0\0\0\10\1g\0\1s\0\0, 96}, {\22\0\0\0Input/output error\0, 23}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 119 So machined is getting a useful error back from systemd Unit name /machine.slice is not valid. and syslog'ing that error, and then sending back the dbus client a useless Input/output error message :-( Yeah, we really suck at handing out good errors. But usually should should have gotten an (equally useless) EINVAL in most cases. Once I fixed the unit name to removing the leading '/', I hit a second error recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{l\3\1\0014\0\0\0\301\1\0\0]\0\0\0\6\1s\0\6\0\0\0:1.126\0\0\4\1s\0#\0\0\0org.freedesktop.systemd1.UnitExists\0\0\0\0\0\5\1u\0\f\0\0\0\10\1g\0\1s\0\0\7\1s\0\4\0\0\0:1.1\0\0\0\0/\0\0\0Unit machine-lxc\\x2dbusy2.scope already exists.\0, 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 164 Unit machine-lxc\\x2dbusy2.scope already exists But neither machinectl list or systemctl --full show any such machine or unit existing. It seems like when it reported the bogus slice name, it did not fully clean up the transient scope unit it created. This is then blocking further attempts to create the same transient scope. Hmm, that's interesting. What does systemctl status say for the unit in question when this happens? Could you paste? # systemctl status 'machine-lxc\x2dbusy4.scope' machine-lxc\x2dbusy4.scope Loaded: stub (/run/systemd/system/machine-lxc\x2dbusy4.scope; static) Active: inactive (dead) Kay had some issues where the kernel's release_agent wouldn't be called on recent kernels, but I never had issues with that... If systemd is complaining about the bogus slice name /machine.slice is it possible that it has returned this error, before it ever placed the init PID into the cgroup? The kernel release_agent would never trigger if there was no process any cgroup to exit, and thus the slice may not get cleaned up. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o-
Re: [systemd-devel] Error handling problems with systemd-machined
On Wed, Jul 24, 2013 at 05:10:59PM +0100, Daniel P. Berrange wrote: On Wed, Jul 24, 2013 at 05:59:48PM +0200, Lennart Poettering wrote: Once I fixed the unit name to removing the leading '/', I hit a second error recvmsg(5, {msg_name(0)=NULL, msg_iov(1)=[{l\3\1\0014\0\0\0\301\1\0\0]\0\0\0\6\1s\0\6\0\0\0:1.126\0\0\4\1s\0#\0\0\0org.freedesktop.systemd1.UnitExists\0\0\0\0\0\5\1u\0\f\0\0\0\10\1g\0\1s\0\0\7\1s\0\4\0\0\0:1.1\0\0\0\0/\0\0\0Unit machine-lxc\\x2dbusy2.scope already exists.\0, 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 164 Unit machine-lxc\\x2dbusy2.scope already exists But neither machinectl list or systemctl --full show any such machine or unit existing. It seems like when it reported the bogus slice name, it did not fully clean up the transient scope unit it created. This is then blocking further attempts to create the same transient scope. Hmm, that's interesting. What does systemctl status say for the unit in question when this happens? Could you paste? # systemctl status 'machine-lxc\x2dbusy4.scope' machine-lxc\x2dbusy4.scope Loaded: stub (/run/systemd/system/machine-lxc\x2dbusy4.scope; static) Active: inactive (dead) Kay had some issues where the kernel's release_agent wouldn't be called on recent kernels, but I never had issues with that... If systemd is complaining about the bogus slice name /machine.slice is it possible that it has returned this error, before it ever placed the init PID into the cgroup? The kernel release_agent would never trigger if there was no process any cgroup to exit, and thus the slice may not get cleaned up. FYI, I can reproduce this with systemd-nspawn too # systemd-nspawn --slice /machine.slice -D /mnt/demo/ -M foo /bin/sh Spawning namespace container on /mnt/demo (console is /dev/pts/5). Init process in the container running as PID 32057. Failed to register machine: Input/output error Container failed with error code 251. Run that multiple times and you'll see the 2nd time machined gets the error about pre-existing unit. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] udevadm settle hangs due to veths in seperate network namespaces
On Fri, Jul 12, 2013 at 06:00:42PM +0200, Kay Sievers wrote: On Fri, Jul 12, 2013 at 5:00 PM, Daniel P. Berrange berra...@redhat.com wrote: On Fri, Jul 12, 2013 at 02:51:10PM +0100, Daniel P. Berrange wrote: We're hitting a problem in libvirt where 'udevadm settle' will get stuck in a loop until it eventually times out. Eventually we realized this happens when we have any LXC containers active with veth devices in a separate network namespace. Incidentally, I recall reading something by (iirc) Lennart saying that apps really should use 'udevadm settle' at all.\ You mean *not*, I guess. Opps. yes. There are still valid uses of settle for command line tools, and that will be likely valid in the future too. There is no simple replacement for this barrier to be implemented by simple command line tools. Letting then subscribe to hotplug would ask for too much in quite a few cases. No advanced subsystem or service though should rely or model around settle and make assumptions about everything is there now, tools should subscribe to udev events and after that enumerate the current devices. Things that pull-in settle at bootup are kind of broken, that is the aspect of seetle you heard from Lennart rightfully complaining, I guess. Libvirt uses it in a couple of places, all related to code which obtains lists of storage devices Which makes sense according to the current state of affairs. Storage tools are only slowly catching up with the reality of devices coming and going all the time on today's systems. They get fixed, and things look at least better today than they have been, but settle is still needed for some operations. - After adding a disk partition in parted, we use it to wait for the /dev/sdXXNNN device nodes to all show up Primary device node creation (not symlinks) is synchronous since a couple of years. Devtmps does that for us. The ioctl to add a part table entry, re-read the part table will not return until devtmpfs has created the device nodes. The udev symlinks though might only be available after a settle call. - After logging into an iscsi target with iscsiadm, we use it to wait for all the /dev/sdXXX devices nodes associated with the iSCSI target to appear. - After triggering a SCSI HBA rescan via sysfs, we use it to wait for all the /dev/sdXXX devices nodes associated with the SCI HBA to appear - After creating an NPIV virtual HBA via sysfs, we use it to wait for all the /dev/sdXXX devices nodes associated with the vHBA to appear As said, this should all be covered on more recent systems. - After activating an LVM volume group, we use it to wait for all the /dev/VGNAME/ device nodes to appear - After deleting an LVM volume we use it to wait for the device node to be removed - After adding an LVM volume we use it to wait for the device node to be added LVM is a story on its own, it's pretty complex, and it slowly gets fixed over time. With the very recent changes it might integrate nicer now. I guess there are still situations though where settle is needed and the simplest solution. All of that applies only to the command line tools again, not for bootup related services, or full-blown storage management services. It is not ok for them to relay on settle. You can see a pattern there - after doing some action related to storage, we need to synchronize wrt the creation/deletion of device nodes in /dev, otherwise we miss out LUNs when we scan for the list of device nodes associated with a HBA/VolGroup/etc. Any suggestions for alternative techniques / approaches here ? I think it's fine and is needed for libvirt to use settle. At least as long as it calls the command line tools. There is no generally available storage interface on Linux which would solve all these problems for libvirt, and I don't think you should declare these problems as libvirt problems. Using settle to get a barrier for the tools you need to use which themselves cannot handle async setup and hotplug sounds fine to me. Many of the issues though might already be history with devtmpfs, at least when the primary nodes (and not the symlinks) are used. Unfortunately we do make use of the /dev/disk/by- paths in order to get paths which are stable across hosts and/or reboots, but not always. So perhaps I'll look at avoiding use of 'settle' in cases where we don't need the symlinks the commands are synchronous. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org
Re: [systemd-devel] [HEADSUP] cgroup changes
On Mon, Jun 24, 2013 at 03:27:15PM +0200, Lennart Poettering wrote: On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote: 1. I put all the entire world into a separate, highly constrained cgroup. My real-time code runs outside that cgroup. This seems to exactly what slices are for, but I need kernel threads to go in to the constrained cgroup. Will systemd support this? I am not sure whether the ability to move kernel threads into cgroups will stay around at all, from the kernel side. Tejun, can you comment on this? KVM uses the vhost_net device for accelerating guest network I/O paths. This device creates a new kernel thread on each open(), and that kernel thread is attached to the cgroup associated with the process that open()d the device. If systemd allows for a process to be moved between cgroups, then it must also be capable of moving any associated kernel threads to the new cgroup at the same time. This co-placement of vhost-net threads with the KVM process, is very critical for I/O performance of KVM networking. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] systemd-nspawn/LXC containers pam login failure
Following the suggestion in the systemd-nspawn manpage I populated a mini Fedora 19 chroot, on a Fedora 19 host # yum -y --releasever=19 --nogpg --installroot=/srv/mycontainer \ --disablerepo='*' --enablerepo=fedora \ install systemd passwd yum fedora-release vim-minimal # chroot /srv/mycontainer passwd # systemd-nspawn -bD /srv/mycontainer Systemd boots up nicely presents a login prompt, but it is impossible to actually login, PAM always denying the attempts. Debugging this, there seem to be two issues 1. pam_loginuid.so tries to write to /proc/self/loginuid but is denied by the kernel. My kernel has CONFIG_AUDIT_LOGINUID_IMMUTABLE=y which means once a loginuid is set (in this case from my ssh session into the host), it can't be changed (eg by the 'login' process inside the container). From the KConfig comment, this appears to have been a new feature built explicitly for systemd based hosts. The loginuid appears to be inherited across fork/exec so, AFAICT, the only way to avoid this is to spawn the container from something which does not already have a loginuid set, eg systemd itself or some other process not associated with a login session. Not being able to spawn containers from a login session on the host is kind of a PITA for development / debuging :-( Seems we need to find a way to have systemd-nspawn ensure that the 'init' process inside the container does not have a 'loginuid' set, even if the thing starting the container does. On the flipside, it seems this would violate the kernel security design for this feature ? If that were the case, then the pam_loginuid module might need to be made a no-op inside containers. 2. The audit_log_acct_message() method which is called by pretty much any PAM module returns EPERM There is no actual syscall returning EPERM here. The EPERM appears to be coming back inside the netlink reply message from the kernel audit subsystem. Since pretty much every PAM module sends audit messages, this causes them all to return fatal errors, failing the login attempt The _pam_audit_writelog() method does have code to ignore EPERM, but it only does so if 'getuid() != 0'. The container login process has uid == 0, so EPERM is treated as fatal. The easy (but not neccessarily correct) fix is to change diff -rup Linux-PAM-1.1.6.orig/libpam/pam_audit.c Linux-PAM-1.1.6.new/libpam/pam_audit.c --- Linux-PAM-1.1.6.orig/libpam/pam_audit.c 2012-08-15 12:08:43.0 +0100 +++ Linux-PAM-1.1.6.new/libpam/pam_audit.c 2013-05-09 10:17:48.679403471 +0100 @@ -46,7 +46,7 @@ _pam_audit_writelog(pam_handle_t *pamh, pamh-audit_state |= PAMAUDIT_LOGGED; if (rc 0) { - if (rc == -EPERM getuid() != 0) + if (rc == -EPERM) return 0; if (errno != old_errno) { old_errno = errno; but I'd rather like to understand why the kernel audit netlink layer is replying with EPERM in the first place. The container has CAP_AUDIT_WRITE capability. Instead of removing the 'getuid() != 0' check, another option would be to augment it to also check /proc/1/environ for any 'container' env variable. If I remove the pam_loginuid module and also apply that above audit patch to PAM, then I can successfuly login to a container launched by systemd-nspawn. It would obviously be preferrable to figure out what needs to be done to make this work out of the box though. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd-nspawn/LXC containers pam login failure
On Thu, May 09, 2013 at 03:32:09PM +0200, Lennart Poettering wrote: On Thu, 09.05.13 11:38, Daniel P. Berrange (berra...@redhat.com) wrote: Following the suggestion in the systemd-nspawn manpage I populated a mini Fedora 19 chroot, on a Fedora 19 host # yum -y --releasever=19 --nogpg --installroot=/srv/mycontainer \ --disablerepo='*' --enablerepo=fedora \ install systemd passwd yum fedora-release vim-minimal # chroot /srv/mycontainer passwd # systemd-nspawn -bD /srv/mycontainer Systemd boots up nicely presents a login prompt, but it is impossible to actually login, PAM always denying the attempts. Yeah, this is a known problem. We generally suggest to turn off audit by booting with audit=0 on the kernel cmdline for now: https://fedoraproject.org/wiki/Features/SystemdLightweightContainers I guess I should add a comment about this to nspawn's man page too. The audit folks are working on adding container awareness to the audit subsystem in the kernel (which basically means that audit messages carry the outside PID of PID1 of the container, so that auditd can track this properly). Currently audit is completely confused by PID namespacing. Also, we want them to fix for us that opening a PID namespace resets loginuid in the container to -1. We have discussed this several times with them, and they wanted to something about it, but so far nothing happened. But we'll have another meeting about this next week, so I can put some pressure on this. Did you file any BZs against the kernel for this ? If not I'll sort out some BZs to track these problems. 2. The audit_log_acct_message() method which is called by pretty much any PAM module returns EPERM There is no actual syscall returning EPERM here. The EPERM appears to be coming back inside the netlink reply message from the kernel audit subsystem. Since pretty much every PAM module sends audit messages, this causes them all to return fatal errors, failing the login attempt The _pam_audit_writelog() method does have code to ignore EPERM, but it only does so if 'getuid() != 0'. The container login process has uid == 0, so EPERM is treated as fatal. The easy (but not neccessarily correct) fix is to change diff -rup Linux-PAM-1.1.6.orig/libpam/pam_audit.c Linux-PAM-1.1.6.new/libpam/pam_audit.c --- Linux-PAM-1.1.6.orig/libpam/pam_audit.c 2012-08-15 12:08:43.0 +0100 +++ Linux-PAM-1.1.6.new/libpam/pam_audit.c 2013-05-09 10:17:48.679403471 +0100 @@ -46,7 +46,7 @@ _pam_audit_writelog(pam_handle_t *pamh, pamh-audit_state |= PAMAUDIT_LOGGED; if (rc 0) { - if (rc == -EPERM getuid() != 0) + if (rc == -EPERM) return 0; if (errno != old_errno) { old_errno = errno; I tried to get a patch like this into PAM actually, but Steve (of course) said nononono! He's really married to the idea that audit breaks everything on any kind of error... This is kinda sad though, as otherwise this would have allowed us to turn off auditing in the container completely by removing CAP_AUDIT_CONTROL and CAP_AUDIT_WRITE of the container... I feared that might be the response from PAM maintainers :-( I guess libvirt-lxc is in a slightly better situation here regarding audit, since it never tries to spawn a container as child of a login session, hence loginuid will not be sealed off yet... If libvirtd has been started from systemd then yes. Of course during development I just run libvirtd from my source tree directly, so still hit the problem :-) Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd-nspawn/LXC containers pam login failure
On Thu, May 09, 2013 at 03:32:09PM +0200, Lennart Poettering wrote: On Thu, 09.05.13 11:38, Daniel P. Berrange (berra...@redhat.com) wrote: Following the suggestion in the systemd-nspawn manpage I populated a mini Fedora 19 chroot, on a Fedora 19 host # yum -y --releasever=19 --nogpg --installroot=/srv/mycontainer \ --disablerepo='*' --enablerepo=fedora \ install systemd passwd yum fedora-release vim-minimal # chroot /srv/mycontainer passwd # systemd-nspawn -bD /srv/mycontainer Systemd boots up nicely presents a login prompt, but it is impossible to actually login, PAM always denying the attempts. Yeah, this is a known problem. We generally suggest to turn off audit by booting with audit=0 on the kernel cmdline for now: https://fedoraproject.org/wiki/Features/SystemdLightweightContainers I guess I should add a comment about this to nspawn's man page too. The audit folks are working on adding container awareness to the audit subsystem in the kernel (which basically means that audit messages carry the outside PID of PID1 of the container, so that auditd can track this properly). Currently audit is completely confused by PID namespacing. Also, we want them to fix for us that opening a PID namespace resets loginuid in the container to -1. We have discussed this several times with them, and they wanted to something about it, but so far nothing happened. But we'll have another meeting about this next week, so I can put some pressure on this. Quite by accident I discovered that if you tell systemd-nspawn to create a new network namespace, you no longer hit the EPERM issues with sending audit messages. This is because the kernel only listens for audit messages in the initial network namespace. libaudit catches ECONNREFUSED and turns into a no-op returning success, meaning that PAM now works. So if you use systemd-nspawn --private-network, and make sure it is launched by systemd itself not from yuour shell, then the standard PAM config will 'just work' Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Systemd and cgroups
On Wed, Apr 10, 2013 at 12:43:56PM +0300, Kevin Wilson wrote: Hello, I have a question about systemd and cgroups: mount | grep cgroups shows that only one entry has name=systemd. and is mounted on /sys/fs/cgroup/systemd . (see below the full output of mount | grep cgroups Is it true that all other cgroup entry shown by mount | grep cgroups were not mounted by systemd (and may be unmounted without directly causing problems is systemd)? If some 3rd party application has mounted cgroups controllers before systemd starts, it will honour that setup. If they were not already mounted, then systemd itself will mount all the resource controllers that are compiled into the kernel. Systemd will only actually create sub-dirs in those controllers that are listed in the 'DefaultControllers' setting of systemd.conf, which defaults to 'cpu'. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] ExecRestart
On Wed, Dec 19, 2012 at 11:46:13PM +0100, Lennart Poettering wrote: On Wed, 28.11.12 22:41, Brandon Black (blbl...@gmail.com) wrote: The daemon's fast restart code does all of the expensive startup operations in the new daemon first (e.g. parsing large data input), then signals the existing daemon to shut itself down, waits for it to release its critical resources (e.g. sockets, pidfile), and finally takes over those resources and finishes starting itself. Basically it's using the overlap to avoid long service downtimes during that initial parsing phase (and if that parsing fails, it leaves the old daemon running to boot). [snip] Or in other words: I am pretty sure that we should not alter the current restart logic, and should not introduce ExecRestart=. However, we really should think about either introducing ExecReexec= or somehow making ExecReload= useful for reexec-style reloading, too. But I haven't made my mind up on this, how this could look like. FWIW, as previously mentioned, I'd love to see an explicitly supported way to trigger a re-exec of a daemon. Currently I'm just relying on the ability to send a custom signal to libvirt's virtlockd daemon. The problem is that sysadmins would need to learn a different signal number for each project's daemon. So I think there's value to admins in having a standard way to trigger this via sysadmin. Personally I think this should also be separate from ExecReload which is merely used to refresh configuration files. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] licence: remove references to old FSF address
On Sun, Dec 16, 2012 at 10:23:23PM +, Sami Kerola wrote: Bug: https://bugs.freedesktop.org/show_bug.cgi?id=57206 diff --git a/src/gudev/gudevclient.h b/src/gudev/gudevclient.h index b425d03..23bfce6 100644 --- a/src/gudev/gudevclient.h +++ b/src/gudev/gudevclient.h @@ -3,19 +3,18 @@ - * You should have received a copy of the GNU Lesser General Public - * License along with this library; if not, write to the - * Free Software Foundation, Inc., 59 Temple Place - Suite 330, - * Boston, MA 02111-1307, USA. + * You should have received a copy of the GNU Library General Public + * License along with this library; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA FWIW, in libvirt we decided that chasing the FSF's office relocations was wasting everyone's time, so switched the last paragraph to link to a web URL instead which will hopefully be more permanent... * You should have received a copy of the GNU Lesser General Public * License along with this library. If not, see * http://www.gnu.org/licenses/. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] I have switched libvirt-sandbox containers to use multi-user.target
On Tue, Nov 20, 2012 at 09:50:39AM -0500, Daniel J Walsh wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/20/2012 09:36 AM, Daniel P. Berrange wrote: On Tue, Nov 20, 2012 at 08:52:51AM -0500, Daniel J Walsh wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/19/2012 07:41 PM, Lennart Poettering wrote: On Fri, 16.11.12 15:06, Daniel J Walsh (dwa...@redhat.com) wrote: Isn't there a way to shut off systemV init scripts altogether, it just so happens that we hit one on my machine. But in the field a customer could have an init script and then setup containers and systemd will attempt to start it. I want a way to say don't run SysV Init scripts altogether. Hmm, there is currently no option for that. A semi-dirty trick might be to over-bind-mount /etc/rc.d with something empty? Lennart What run levels would get executed? I would prefer to mount over the empty run levels and allow an admin to be able to turn on a SysV init script. I'm not convinced we need to support that explicitly. If an admin wants to support execution of some ad-hoc script they can easily make a system unit that uses the various ExecXXX directives to invoke their arbitrary shell scripts. Daniel I was thinking more that if they wanted to execute chkconfig within the container, the right thing would happen, which I get by mounting empty dirs over /etc/rc.d/rc.[0-6]d Similar to us allowing the admin to execute systemctl enable foobar.service within the container. IMHO supporting legacy commands like chkconfig is a non-goal for libvirt-sandbox. It is brand new functionality designed around closely integrating with systemd, and I don't think we should pollute it with code for legacy / dieing init systems. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Client logging to journald without libsystemd-journal.so
I recently introduced support for libvirt logging to journald. Initially I had intended to use libsystemd-journal.so for the logging, however, in the end I made libvirt directly communicate with sendmsg(). First, I wanted to confirm two interface stability issues. - Is the client app - journald logging protocol considered to be ABI stable ? - Is the /run/systemd/journal/socket path considered to be stable ? Second, I wanted to mention why we couldn't use libsystemd-journal.so ourselves. The first problem is that there is no sd_journal_open/close API call to setup the file descriptor. The library uses a one time atomic global initialize to open its file descriptor which is then cached until exit() or execve() (it has SOCK_CLOEXEC set). The problem is that when libvirt does fork() to create client processes, one of the things it does is to iterate from 0 - sysconf(_SC_OPEN_MAX), closing every file descriptor, except those in its whitelist. Now I know there is the school of thought that says this is a bad idea, and that all code should correctly set O_CLOEXEC for all file descriptors. While a nice idea in theory, unfortunately this is not really practical for us in reality because there are too many 3rd party libraries we use which don't do this correctly. Not least because traditional UNIX APIs don't allow for atomically creating an FD with O_CLOEXEC set. So we're stuck with closing all FDs after fork() for a good long time yet. There are two things libsystemd-journal could do to help apps in this scenario. Either provide a way for apps to query the cached journal logging file descriptor, allowing them to explicitly leave it open. Alternatively provide explicit API to call to re-open the FD, which they could call after fork(). Possibly other solutions too, like requiring an explicit close/open like syslog though that has its own set of problems. The second blocker problem was figuring out a way to send log messages using only APIs declared async-signal safe. Again this is so that we can safely send log messages inbetween fork() and execve() which only permits async signal safe APIs. The sd_journal_send() API can't be used since it relies on vasprintf() which can allocate using malloc. The sd_journal_sendv() API is pretty close to what we'd want, but the way you have to format the iovec doesn't quite work. IIUC, it requires that each iovec contains a single formatted log item string KEY=VALUE. Populating data in such a way is inconvenient for libvirt. For libvirt it was easier for us to use two iovec elements for each log item, KEY= and VALUE, so that we can avoid doing the data copy implied by filling a single string with KEY=VALUE. The upshot is that we ended up filling an iovec[] ourselves, taking care of escaping '\n', and then directly sending it to journald. As long as the wire format and UNIX socket path are considered ABI stable by systemd devs, I'm fairly happy with the libvirt code as it. I just mention these issues in case you think it is desirable to add further libsystemd-journal.so APIs to make life easier for other applications doing logging in the future. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Client logging to journald without libsystemd-journal.so
On Thu, Nov 08, 2012 at 04:56:03PM -0500, Colin Walters wrote: Sorry about the tone in the last message, it was unnecessary. There's just some history here dating from the libxml2 days... No worries, no offence taken :-) On Thu, 2012-11-08 at 17:38 +0100, Daniel P. Berrange wrote: Yeah, we've looked at borrowed code from GLib in a few cases now, notably threads and atomic ops. I've previously looked at GLib's process spawning code, but didn't notice this particular item. Originally we did have an API fairly similar to the g_spawn_async_with_pipes API, but it is proved fairly cumbersome to use, so we've put together a much more flexible API now [1]. Yeah, I've been working on a new one: https://bugzilla.gnome.org/show_bug.cgi?id=672102 Possible, though I feel it is a little nasty, not least because when when journald then uses SCM_CREDS to find out the sender identity it will be getting the wrong pid and potentially wrong uid/gid too. This is an interesting case...conceptually it's true that it's a new pid, but I think it's a lot more useful usually what *code* it's running; ordinarily, that'd be an executable. But here we're running code from the parent just before executing a new child. Pretty much any error in fork-before-exec should be fatal, right? So in the case where you're logging an error (e.g. setuid() failed, prctl() failed), the pid is going to be irrelevant anyways since the process will soon exit. Hmm, good point. The uid/gid - yes, but on the other hand, the uid associated with the message will be the one that's conceptually in control at the moment. It is a tricky question really. If the code failed because it did not have permission to open the file, and the log contains the uid of the parent process, this could mislead the person analysing. At the same time I see your point that the uid/gid/pid should refer to the process in control which is the parent. Regardless though of the approach taken (log from parent, log from forked-before-exec'd child), it'd probably be good to include some standard structured field saying that the code is being run in a child setup. PREEXEC=1? If the log is sent from the child, you'd really want to also include the PID of the parent process, to allow the log messages to be directly correlated. A shame SCM_CREDS doesn't directly provide the parent-PID too Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] shutdown: do reboot() for openvz container
On Thu, Sep 13, 2012 at 12:30:00AM +0200, Lennart Poettering wrote: On Thu, 13.09.12 00:25, Kay Sievers (k...@vrfy.org) wrote: On Wed, Sep 12, 2012 at 11:54 PM, Lennart Poettering lenn...@poettering.net wrote: On Wed, 12.09.12 11:51, Daniel P. Berrange (berra...@redhat.com) wrote: NB when libvirt starts an LXC container, it first checks to see whether the kernel has the container aware reboot() support. If it does not, then it removes CAP_SYS_REBOOT from the container, to prevent any accidental whole system reboot. The sf.net LXC tools do the same thing. How do you check that? A version check or can you actually detect this feature explicitly? Returning EINVAL is also an easy way to check if this feature is supported by the kernel when invoking another 'reboot' option like CAD. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=cf3f89214ef6a33fad60856bc5ffd7bb2fc4709b But that's from inside the container. But LXC would need that from outside the container? Oh you just need a quick clone() + reboot() pair to figure that out. See the lxcContainerHasReboot() and lxcContainerRebootChild() methods in the libvirt lxc_container.c file: http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;hb=HEAD#l107 Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] shutdown: do reboot() for openvz container
On Wed, Sep 12, 2012 at 02:47:48PM +0400, Kir Kolyshkin wrote: On 09/11/2012 05:24 AM, Lennart Poettering wrote: On Fri, 24.08.12 16:22, Kir Kolyshkin (k...@openvz.org) wrote: Proper handling of reboot() syscall issued from the inside of a container was always supported by OpenVZ kernels. More to say, OpenVZ relies on the fact that container calls reboot in order to distinguish between shutdown and reboot-- in the latter case container is being restarted. This patch brings the reboot() back for OpenVZ container. Turns out the normal Linux containers understand reboot() just fine too. Please note though that the problem with reboot() wrt upstream containers was really nasty -- calling reboot inside container resulted in rebooting the whole system, not just the container. NB when libvirt starts an LXC container, it first checks to see whether the kernel has the container aware reboot() support. If it does not, then it removes CAP_SYS_REBOOT from the container, to prevent any accidental whole system reboot. The sf.net LXC tools do the same thing. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd coverity
On Thu, Aug 23, 2012 at 03:04:23PM +0200, Zbigniew Jędrzejewski-Szmek wrote: On 08/23/2012 02:36 PM, Lennart Poettering wrote: maybe we should add macros like: #define _cleanup_free_ __attribute__((cleanup(freep))) #define _cleanup_fclose_ __attribute__((cleanup(fclosep))) What do you think? I personally think that this would be a welcome change like the #pragma once cleanup. __attribute__(cleanup) goes all the way back to gcc 3.3, so there's little reason not to use it. On a related topic: maybe gotos should be used more often for structured cleanup. This might make the code slightly shorted, and also help avoid mistakes. diff --git src/journal/sd-journal.c src/journal/sd-journal.c index 0f7c02c..5e73a94 100644 --- src/journal/sd-journal.c +++ src/journal/sd-journal.c @@ -1418,37 +1418,33 @@ static sd_journal *journal_new(int flags, const char *path) { if (path) { j-path = strdup(path); -if (!j-path) { -free(j); -return NULL; -} +if (!j-path) +goto free_1; } j-files = hashmap_new(string_hash_func, string_compare_func); -if (!j-files) { -free(j-path); -free(j); -return NULL; -} +if (!j-files) +goto free_2; j-directories_by_path = hashmap_new(string_hash_func, string_compare_func); -if (!j-directories_by_path) { -hashmap_free(j-files); -free(j-path); -free(j); -return NULL; -} +if (!j-directories_by_path) +goto free_3; j-mmap = mmap_cache_new(); -if (!j-mmap) { -hashmap_free(j-files); -hashmap_free(j-directories_by_path); -free(j-path); -free(j); -return NULL; -} +if (!j-mmap) +goto free_4; return j; + +free_4: +hashmap_free(j-files); +free_3: +hashmap_free(j-directories_by_path); +free_2: +free(j-path); +free_1: +free(j); If you make sure that your pointer vars are all initialized to NULL, and that all free functions accept NULL, then you can collapse all those separate labels into one. This is much nicer, because then you don't need to go about re-numbering if you need to insert another goto in the middle of the function. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Re-exec()ing services for 'systemctl restart' ?
On Wed, Aug 08, 2012 at 07:07:38PM +0200, Lennart Poettering wrote: On Mon, 06.08.12 16:52, Daniel P. Berrange (berra...@redhat.com) wrote: For libvirt, we (will soon) have a daemon (virtlockd) which maintains exclusive fcntl() based locks on disk images/devices, on behalf of both libvirtd and any running QEMU or LXC instances. This is a safety critical daemon (hence separate from libvirtd), to the extent that if the daemon stops / crashes, the entire host should be immediately fenced using a kernel watchdog and/or hardware power control device. We still want to be able to restart this daemon during RPM upgrades to newer versions, but we can't use a normal stop+start sequence, because that will loose locks for any active VMs. Thus the daemon has the ability to re-exec() itself triggered by SIGUSR1, preserving its critical state. I've read the manpages for .service, .exec, etc but I've not seen any reference to changing config such that # systemctl restart virtdlockd.service will simply send SIGUSR1 to the process, instead of stopping it and then starting it again. Obviously I could make the RPM %post send SIGUSR1 directly and ignore systemctl, but that doesn't help admins who just expect to use systemctl. So I want to know if there is a recommended way to handle this kind of use case ? This is fundamentally difficult to implement, simply because restarting a service also means that the services binding to it need restarting too. And the ordering of that gets impossible if the stop/start sequence is atomic because it is done internally in the service, and cannot be split into two steps that we can order freely against each other. So, I fear we cannot really add this for you. As Kay suggested systemctl kill is probably your best choice here, or maybe systemctl reload. Ok, thanks for explaining the issues - I think I'll just use systemctl kill for now. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Re-exec()ing services for 'systemctl restart' ?
For libvirt, we (will soon) have a daemon (virtlockd) which maintains exclusive fcntl() based locks on disk images/devices, on behalf of both libvirtd and any running QEMU or LXC instances. This is a safety critical daemon (hence separate from libvirtd), to the extent that if the daemon stops / crashes, the entire host should be immediately fenced using a kernel watchdog and/or hardware power control device. We still want to be able to restart this daemon during RPM upgrades to newer versions, but we can't use a normal stop+start sequence, because that will loose locks for any active VMs. Thus the daemon has the ability to re-exec() itself triggered by SIGUSR1, preserving its critical state. I've read the manpages for .service, .exec, etc but I've not seen any reference to changing config such that # systemctl restart virtdlockd.service will simply send SIGUSR1 to the process, instead of stopping it and then starting it again. Obviously I could make the RPM %post send SIGUSR1 directly and ignore systemctl, but that doesn't help admins who just expect to use systemctl. So I want to know if there is a recommended way to handle this kind of use case ? Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Re-exec()ing services for 'systemctl restart' ?
On Mon, Aug 06, 2012 at 06:04:11PM +0200, Kay Sievers wrote: On Mon, Aug 6, 2012 at 5:52 PM, Daniel P. Berrange berra...@redhat.com wrote: For libvirt, we (will soon) have a daemon (virtlockd) which maintains exclusive fcntl() based locks on disk images/devices, on behalf of both libvirtd and any running QEMU or LXC instances. This is a safety critical daemon (hence separate from libvirtd), to the extent that if the daemon stops / crashes, the entire host should be immediately fenced using a kernel watchdog and/or hardware power control device. We still want to be able to restart this daemon during RPM upgrades to newer versions, but we can't use a normal stop+start sequence, because that will loose locks for any active VMs. Thus the daemon has the ability to re-exec() itself triggered by SIGUSR1, preserving its critical state. I've read the manpages for .service, .exec, etc but I've not seen any reference to changing config such that # systemctl restart virtdlockd.service will simply send SIGUSR1 to the process, instead of stopping it and then starting it again. Obviously I could make the RPM %post send SIGUSR1 directly and ignore systemctl, but that doesn't help admins who just expect to use systemctl. So I want to know if there is a recommended way to handle this kind of use case ? $ systemctl reload ... ? I thought about reload, but using that to re-exec the daemon seemed a little evil to me, since it was really for just reloading config files. or with the signal speficied: $ systemctl kill ... True, I guess I could use that. I'm infering from your response that there's no way to customize what 'restart' does, as you can do with stop/start/reload/etc. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Systemd usage wrt libvirt-sandbox
The libvirt-sandbox project[1] is providing an API and command line tools for constructing application sandboxes. It uses either LXC or KVM virtualization via libvirt, to confine execution of an application binary, giving it a read-only view of the host root filesystem, with custom writable areas grafted onto selected paths. eg if running httpd inside a sandbox, we give it a private /etc/httpd and /var/www, etc. The idea is to get the security isolation benefits of virtualization technology, without the administrative burden of extra OS installs that it normally entails. As such the only processes running inside each sandbox are the application being confined, and a minimal custom init binary provided by libvirt-sandbox itself. As we expand our use cases though, particularly to cover the secure containers feature[2] in Feora 17, it is clear that if we're not careful, our miniml libvirt-sandbox-init-common binary is going turn into a poor mans' copy of systemd. We want to avoid that, and instead actually make use of systemd directly. Since the sandbox shares the same root filesystem as the host, we can't simply exec 'systemd' as is. We'll need to setup a few custom writable mounts, where we write out custom units / targets, and let systemd keep any state. So I'm trying to figure out just what is the absolute minimal setup we can configure for systemd. Our primary target for development is to sandbox apache. So I'd like to figure out what minimal config / directory structure I need to create to run systemd and have it only run apache, and a login shell (for debug inside the sandbox). I'm guessing that I can perhaps get away with setting up an override of the host's /etc/systemd, and writing out custom basic.target and default.target unit files, which merely running httpd.unit and a shell ? Regards, Daniel [1] http://berrange.com/tags/libvirt-sandbox/ http://libvirt.org/git/?p=libvirt-sandbox.git;a=summary https://fedoraproject.org/wiki/Features/VirtSandbox [2] https://fedoraproject.org/wiki/Features/SecureContainers -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] RFC: Cooperating in the cgroup tree
On Fri, Aug 19, 2011 at 01:25:16AM +0200, Lennart Poettering wrote: Heya, I put together a short recommendations document explaining how applications making use of cgroups should try to behave in the cgroupfs trees. Since the trees are shared resources everybody should behave nicely and not muck with everybody else's cgroups. http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups I hope these rules are something everybody involved with cgroup client side management can agree to. For now, I'd just like to ask for comments on this on this ML. I'll publish this on a wider scale later on. So, is there anything I forgot? Anything you are missing on this list? Happy to hear your thoughts and ideas! I think this looks like a good set of rules for app developers to follow. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel