Re: [systemd-devel] default journal retention policy
On Thu, Dec 22, 2022, at 11:00 AM, Lennart Poettering wrote: > On Do, 22.12.22 10:56, Chris Murphy (li...@colorremedies.com) wrote: >> Still another idea, we could add a new setting MinRetentionSec=90day >> which would translate into "not less than 90 days" and would only >> delete journal files once all the entries in a journal file are at >> least 90 days old. > > Well, that naming would suggest it would override the size > constraints, and it shouldn't. But yeah, ignoring the choice of name I > think it would make sense to add that. Add an RFE. Yeah I'm not sure what to call it. LessRetentionSec or PrefRetentionSec add jargon, but maybe that's adequately dealt with by updating the journald.conf man page? -- Chris Murphy
[systemd-devel] default journal retention policy
Hi, Fedora Workstation working group is considering reducing the journal retention policy from upstream default. This is the tracking issue https://pagure.io/fedora-workstation/issue/213 This is the Fedora development list discussion thread https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/thread/NDO5S2KUUDO5G6JLKZGQNFBXOW5KHPR5/#XATT3XYFV2UALPTJTL5RSQD3D4IVNSVO As Lennart mentions in that devel thread, it's preferred that the change be upstreamable, and the Fedora Workstation working group agrees. https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/message/XATT3XYFV2UALPTJTL5RSQD3D4IVNSVO/ The consensus of the discussion is that there should be less retention. The range of retention varies quite a bit, but I think 3-6 months is OK. In practice, most configurations eventually up with 4G of journals, since that's the cap. This typically is over a year of journals, but of course it really depends on additional configuration, e.g. in my case I do a lot of debugging, so I'm often enabling debug logging, therefore 4G worth of journal files happens pretty quick, maybe 3 months. As I understand it, rsyslog has a two week retention policy by default. journald supports quite a lot of knobs related to journal total size, free space, file sizes, and rentention time. My favorite simple idea of the moment is to set a default MaxRetentionSec=100day which translates to "probably not less than 90 days, but not more than 100 days" of retention. The policy looks at entry age to determine if the retention threshold is met, but the garbage collection affects journal files. So if a single entry in a file reaches 100 days, the whole file is deleted, which could plausibly be a week or two of entries. Still another idea, we could add a new setting MinRetentionSec=90day which would translate into "not less than 90 days" and would only delete journal files once all the entries in a journal file are at least 90 days old. By leaving all the other settings alone, the 4G cap (or if less, the 10% of file system size rule) still applies. So in no case would any use case end up using more space for logs. Any thoughts? -- Chris Murphy
Re: [systemd-devel] How can we debug systemd-gpt-auto-generator failures?
On Thu, Jul 28, 2022, at 6:50 AM, Kevin P. Fleming wrote: > I've got two systems that report a failure (exit code 1) every time > systemd-gpt-auto-generator is run. There are a small number of reports > of this affecting other users too: > > https://bugs.archlinux.org/task/73168 > > This *may* be related to the use of ZFS, although I've got a > half-dozen systems using ZFS and only two of them have this issue. Are the two with the problem multiple device ZFS? And the rest are single device ZFS? -- Chris Murphy
Re: [systemd-devel] No space left errors on shutdown with systemd-homed /home dir
On Mon, Jan 31, 2022, at 11:26 PM, Zygo Blaxell wrote: > On Sat, Jan 29, 2022 at 10:53:00AM +0100, Goffredo Baroncelli wrote: > It does suck that the kernel handles resizing below the minimum size of > the filesystem so badly; however, even if it rejected the resize request > cleanly with an error, it's not necessarily a good idea to attempt it. > Pushing the lower limits of what is possible in resize to save a handful > of GB is asking for trouble. It's far better to overestimate generously > than to underestimate the minimum size. Yeah there's an inherent conflict with online shrink: the longer the time needed to relocate bg's, the more unpredictable operations can occur during that time to thwart any original estimations made about the shrink operation. I wondered a bit ago about a shrink API that takes shrink size as a suggestion rather than as a definite, and then the file system does the best job it can. Either this API reports actual shrink size once it completes, or the requesting program needs to know to call BTRFS_IOC_FS_INFO and BTRFS_IOC_DEV_INFO to know the actual size. This hypothetical API could have boundaries outside of which if the kernel code estimates it's going to fall short of, could trigger a cancel of the shrink. This could be size or time based. e.g. BTRFS_IOC_RESIZE_BEST (effort). -- Chris Murphy
Re: [systemd-devel] No space left errors on shutdown with systemd-homed /home dir
On Wed, Jun 1, 2022, at 5:36 AM, Colin Guthrie wrote: > Goffredo Baroncelli wrote on 31/05/2022 19:12: > > > I suppose that colin.home is a sparse file, so even it has a length of > > 394GB, it consumes only 184GB. So to me these are valid values. It > > doesn't matter the length of the files. What does matter is the value > > returned by "du -sh". > > > > Below I create a file with a length of 1000GB. However being a sparse > > file, it doesn't consume any space and "du -sh" returns 0 > > > > $ truncate -s 1000GB foo > > $ du -sh foo > > 0foo > > $ ls -l foo > > -rw-r--r-- 1 ghigo ghigo 1 May 31 19:29 foo > > Yeah the file will be sparse. > > That's not really an issue, I'm not worried about the fact it's not > consuming as much as it reports as that's all expected. > > The issue is that systemd-homed (or btrfs's fallocate) can't handle this > situation and that user is effectively bricked unless migrated to a host > with more storage space! Hopefully there's time for systemd-252 for a change still? That version is what I expect to ship in Fedora 37 [1] There's merit to sd-homed and I want it to be safe and reliable for users to keep using, in order to build momentum. I really think sd-homed must move the shrink on logout, to login. When the user logs out, they are decently likely to immediately close the laptop lid thus suspend-to-ram; or shutdown. I don't know if shrink can be cancelled. But regardless, there's going to be a period of time where the file system and storage stacks are busy, right at the time the user is expecting *imminent* suspend or shutdown, which no matter what has to be inhibited until the shrink is cancelled or completed, and all pending writes are flushed to stable media. Next, consider the low battery situation. Upon notification, anyone with an 18+ month old battery knows there may be no additional warnings, and you could in fact get a power failure next. In this scenario we have to depend on all storage stack layers, and the drive firmware, doing the exact correct thing in order for the file system to be in a consistent state to be mountable at next boot. I just think this is too much risk, and since sd-homed is targeted at laptop users primarily, all the more reason the fs resize operation should happen at login time, not logout. In fact, sd-homed might want to inhibit a resize shrink operation if (a) AC power is not plugged in and (b) battery remaining is less than 30%, or some other reasonable value. The resize grow operation is sufficiently cheap and fast that I don't think it needs inhibiting. Thoughts? I also just found a few bug reports with a non-exhaustive search that also make me nervous about fs shrink at logout (also implying restart and shutdown) time. On shutdown, homed resizes until it gets killed https://github.com/systemd/systemd/issues/22901 Getting "New partition doesn't fit into backing storage, refusing" https://github.com/systemd/systemd/issues/22255 fails to resize https://github.com/systemd/systemd/issues/22124 [1] Branch from Rawhide August 9, the earliest release date would be October 18. -- Chris Murphy
Re: [systemd-devel] No space left errors on shutdown with systemd-homed /home dir
[sorry had to resend to get it on btrfs list, due to html in the original :\] On Wed, Jun 1, 2022, at 5:36 AM, Colin Guthrie wrote: > Goffredo Baroncelli wrote on 31/05/2022 19:12: > > > I suppose that colin.home is a sparse file, so even it has a length of > > 394GB, it consumes only 184GB. So to me these are valid values. It > > doesn't matter the length of the files. What does matter is the value > > returned by "du -sh". > > > > Below I create a file with a length of 1000GB. However being a sparse > > file, it doesn't consume any space and "du -sh" returns 0 > > > > $ truncate -s 1000GB foo > > $ du -sh foo > > 0foo > > $ ls -l foo > > -rw-r--r-- 1 ghigo ghigo 1 May 31 19:29 foo > > Yeah the file will be sparse. > > That's not really an issue, I'm not worried about the fact it's not > consuming as much as it reports as that's all expected. > > The issue is that systemd-homed (or btrfs's fallocate) can't handle this > situation and that user is effectively bricked unless migrated to a host > with more storage space! Hopefully there's time for systemd-252 for a change still? That version is what I expect to ship in Fedora 37 [1] There's merit to sd-homed and I want it to be safe and reliable for users to keep using, in order to build momentum. I really think sd-homed must move the shrink on logout, to login. When the user logs out, they are decently likely to immediately close the laptop lid thus suspend-to-ram; or shutdown. I don't know if shrink can be cancelled. But regardless, there's going to be a period of time where the file system and storage stacks are busy, right at the time the user is expecting *imminent* suspend or shutdown, which no matter what has to be inhibited until the shrink is cancelled or completed, and all pending writes are flushed to stable media. Next, consider the low battery situation. Upon notification, anyone with an 18+ month old battery knows there may be no additional warnings, and you could in fact get a power failure next. In this scenario we have to depend on all storage stack layers, and the drive firmware, doing the exact correct thing in order for the file system to be in a consistent state to be mountable at next boot. I just think this is too much risk, and since sd-homed is targeted at laptop users primarily, all the more reason the fs resize operation should happen at login time, not logout. In fact, sd-homed might want to inhibit a resize shrink operation if (a) AC power is not plugged in and (b) battery remaining is less than 30%, or some other reasonable value. The resize grow operation is sufficiently cheap and fast that I don't think it needs inhibiting. Thoughts? I also just found a few bug reports with a non-exhaustive search that also make me nervous about fs shrink at logout (also implying restart and shutdown) time. On shutdown, homed resizes until it gets killed https://github.com/systemd/systemd/issues/22901 Getting "New partition doesn't fit into backing storage, refusing" https://github.com/systemd/systemd/issues/22255 fails to resize https://github.com/systemd/systemd/issues/22124 [1] Branch from Rawhide August 9, the earliest release date would be October 18. -- Chris Murphy
Re: [systemd-devel] No space left errors on shutdown with systemd-homed /home dir
On Sat, Jan 29, 2022 at 2:53 AM Goffredo Baroncelli wrote: > > I think that for the systemd uses cases (singled device FS), a simpler > approach would be: > > fstatfs(fd, ) > needed = sfs.f_blocks - sfs.f_bavail; > needed *= sfs.f_bsize > > needed = roundup_64(needed, 3*(1024*1024*1024)) > > Comparing the original systemd-homed code, I made the following changes > - 1) f_bfree is replaced by f_bavail (which seem to be more consistent to the > disk usage; to me it seems to consider also the metadata chunk allocation) > - 2) the needing value is rounded up of 3GB in order to consider a further 1 > data chunk and 2 metadata chunk (DUP)) > > Comments ? I'm still wondering if such a significant shrink is even indicated, in lieu of trim. Isn't it sufficient to just trim on logout, thus returning unused blocks to the underlying filesystem? And then do an fs resize (shrink or grow) as needed on login, so that the user home shows ~80% of the free space in the underlying file system? homework-luks.c:3407:/* Before we shrink, let's trim the file system, so that we need less space on disk during the shrinking */ -- Chris Murphy
Re: [systemd-devel] No space left errors on shutdown with systemd-homed /home dir
On Wed, Jan 26, 2022 at 4:19 PM Boris Burkov wrote: > > On Thu, Jan 27, 2022 at 12:07:53AM +0200, Apostolos B. wrote: > > This is what homectl inspect user reports: > > > > Disk Size: 128.0G > > Disk Usage: 3.8G (= 3.1%) > > Disk Free: 124.0G (= 96.9%) > > > > and this is what btrfs usage reports: > > > > sudo btrfs filesystem usage /home/toliz > > > > Overall: > > Device size: 127.98GiB > > Device allocated: 4.02GiB > > Device unallocated: 123.96GiB > > Device missing: 0.00B > > Used: 1.89GiB > > Free (estimated): 124.10GiB(min: 62.12GiB) > > Free (statfs, df): 124.10GiB > > Data ratio: 1.00 > > Metadata ratio: 2.00 > > Global reserve: 5.14MiB(used: 0.00B) > > Multiple profiles:no > > > > Data,single: Size:2.01GiB, Used:1.86GiB (92.73%) > >/dev/mapper/home-toliz 2.01GiB > > > > Metadata,DUP: Size:1.00GiB, Used:12.47MiB (1.22%) > >/dev/mapper/home-toliz 2.00GiB > > > > System,DUP: Size:8.00MiB, Used:16.00KiB (0.20%) > >/dev/mapper/home-toliz 16.00MiB > > > > Unallocated: > >/dev/mapper/home-toliz 123.96GiB > > > > OK, there is plenty of unallocated space, thanks for confirming. > > Looking at the stack trace a bit more, the only thing that really sticks > out as suspicious to me is btrfs_shrink_device, I'm not sure who would > want to do that or why. systemd-homed by default uses btrfs on LUKS on loop mount, with a backing file. On login, it grows the user home file system with some percentage (I think 80%) of the free space of the underlying file system. And on logout, it does both fstrim and shrinks the fs. I don't know why it does both, it seems adequate to do only fstrim on logout to return unused blocks to the underlying file system; and to do an fs resize on login to either grow or shrink the user home file system. But also, we don't really have a great estimator of the minimum size a file system can be. `btrfs inspect-internal min-dev-size` is pretty broken right now. https://github.com/kdave/btrfs-progs/issues/271 I'm not sure if systemd folks would use libbtrfsutil facility to determine the minimum device shrink size? But also even the kernel doesn't have a very good idea of how small a file system can be shrunk. Right now it basically has to just start trying, and does it one block group at a time. Adding systemd-devel@ -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Thu, Dec 30, 2021 at 3:59 PM Chris Murphy wrote: > > ZFS uses volume and user properties which we could probably mimic with > xattr. I thought I asked about xattr instead of subvolume names at one > point in the thread but I don't see it. So instead of using subvolume > names, what about stuffing this information in xattr? My gut instinct > is this is less transparent and user friendly, it requires more tools > to know how to user to troubleshoot and fix, etc. Separate from whether the obscurity of an xattr is a good idea or not, read-only snapshots can't have xattr added, removed, or modified. We can rename read-only snapshots, however. While we could unset the ro property, that also wipes received UUID used by send/receive. And while we could make an rw snapshot of the ro snapshot, modify the xattr, then make an ro snapshot of the rw snapshot, this alters the parent UUI also used by send/receive. So it complicates send/receive workflows as a potential update mechanism, or for backup/restore, for anything that tracks these UUIDs, e.g. btrbk. -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
(I'm sorta not doing a great job of using "sub-volume" to mean generically any of Btrfs subvolume or a directory or a logical volume, so hopefully anyone still following can make the leap that I don't intend this spec to be Btrfs specific. I like it being general purpose.) On Tue, Dec 21, 2021 at 6:57 AM Ludwig Nussel wrote: > > The way btrfs is used in openSUSE is based on systems from ten years > ago. A lot has changed since then. Now with the idea to have /usr on a > separate read-only subvolume the current model doesn't really work very > well anymore IMO. So I think there's a window of opportunity to change > the way openSUSE does things :-) ZFS uses volume and user properties which we could probably mimic with xattr. I thought I asked about xattr instead of subvolume names at one point in the thread but I don't see it. So instead of using subvolume names, what about stuffing this information in xattr? My gut instinct is this is less transparent and user friendly, it requires more tools to know how to user to troubleshoot and fix, etc. -- Chris Murphy
Re: [systemd-devel] Antw: [EXT] Re: the need for a discoverable sub-volumes specification
On Mon, Dec 27, 2021 at 3:40 AM Ulrich Windl wrote: > > >>> Ludwig Nussel schrieb am 21.12.2021 um 14:57 in > Nachricht <662e1a92-beb4-e1f1-05c9-e0b38e40e...@suse.de>: > > ... > > The way btrfs is used in openSUSE is based on systems from ten years > > ago. A lot has changed since then. Now with the idea to have /usr on a > > separate read-only subvolume the current model doesn't really work very > > well anymore IMO. So I think there's a window of opportunity to change > > the way openSUSE does things :-) > > Oh well, while you are doing so: Also improve support for a separate /boot > volume when snapshotting. Yeah how to handle /boot gives me headaches. We have a kind of rollback, the possibility of choosing among kernels. But which kernels are bootable depends on the /usr its paired with. We need a mechanism to match /boot and /usr together, so that the user doesn't get stuck choosing a kernel version for which the modules don't exist in an older generation /usr. And then does this imply some additional functionality in the bootloader to achieve it, or should this information be fully encapsulated in Boot Loader Spec compliant snippets? -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Tue, Dec 21, 2021 at 6:57 AM Ludwig Nussel wrote: > > Chris Murphy wrote: > > The part I'm having a hard time separating is the implicit case (use > > some logic to assemble the correct objects), versus explicit (the > > bootloader snippet points to a root and the root contains an fstab - > > nothing about assembly is assumed). And should both paradigms exist > > concurrently in an installed system, and how to deconflict? > > Not sure there is a conflict. The discovery logic is well defined after > all. Also I assume normal operation wouldn't mix the two. Package > management or whatever installs updates would automatically do the right > thing suitable for the system at hand. rootflags=subvol/subvolid= should override the discoverable sub-volumes generator I don't expect rootflags is normally used in a discoverable sub-volumes workflow, but if the user were to add it for some reason, we'd want it to be favored. > > > Further, (open)SUSE tends to define the root to boot via `btrfs > > subvolume set-default` which is information in the file system itself, > > neither in the bootloader snipper nor in the naming convention. It's > > neat, but also not discoverable. If users are trying to > > The way btrfs is used in openSUSE is based on systems from ten years > ago. A lot has changed since then. Now with the idea to have /usr on a > separate read-only subvolume the current model doesn't really work very > well anymore IMO. So I think there's a window of opportunity to change > the way openSUSE does things :-) I think the transactional model can accommodate better anyway, and is the direction I'd like to go in with Fedora. Make updates/upgrades happen out of band (in a container on a snapshot). We can apply resource control limits so that the upgrade process doesn't negatively impact the user's higher priority workload. If the update fails to complete or fails a set of simple tests - the snapshot is simply discarded. No harm done to the running system. If it passes checks, then its name is changed to indicate it's the favored "next root" following reboot. And we don't have to keep a database to snapshot, assemble, and discard things, it can all be done by naming scheme. I think the naming scheme should include some sort of "in-progress" tag, so it's discoverable such a sub-volume is (a) not active (b) in some state of flux that potentially was interrupted (c) isn't critical to the system. Such a sub-volume should either be destroyed (failed update) or renamed (update succeeds). If the owning process were to fail (crash, powerfailure), the next time it runs to check for updates, it would discover this "in-progress" sub-volume and remove it (assume it's in a failed state). -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Thu, Nov 4, 2021 at 9:39 AM Lennart Poettering wrote: > 3. Inside the "@auto" dir of the "super-root" fs, have dirs named >[:]. The type should have a similar vocubulary >as the GPT spec type UUIDs, but probably use textual identifiers >rater than UUIDs, simply because naming dirs by uuids is >weird. Examples: > >/@auto/root-x86-64:fedora_36.0/ >/@auto/root-x86-64:fedora_36.1/ >/@auto/root-x86-64:fedora_37.1/ >/@auto/home/ >/@auto/srv/ >/@auto/tmp/ > >Which would be assembled by the initrd into the following via bind >mounts: > >/ → /@auto/root-x86-64:fedora_37.1/ >/home/→ /@auto/home/ >/srv/ → /@auto/srv/ >/var/tmp/ → /@auto/tmp/ What about arbitrary mountpoints and their subvolumes? Things we can't predict in advance for all use cases? For example: For my non-emphemeral systems: * /var/log is a directory contained in subvolume "varlog-x86-64:fedora.35" * /var/lib/libvirt/images is a directory contained in subvolume "varlibvirtimages-x86-64:fedora.35" * /var/lib/flatpak is a directory contained in a subvolume "varlibflatpak-x86-64:any" - as it isn't Fedora specific, uses its own versioning so in this case I'd expect it gets mounted with any distribution. These exist so they are excluded from a snapshot and rollback regime that applies to "root-x86-64:fedora.35" which contains usr/ var/ etc/ A rollback of root does not rollback the systemd journal, VM images, or flatpaks. Is space a valid separator in the name of the subvolume? Or underscore? This would become / to define the path to the mount point. Additionally, I'm noticing that none of 'journalctl -o verbose' or json or export shows what subvolume was mounted at each mount point. I need to use systemd debug for this information to be included in the journal. Assembly of versioned roots is probably useful logging information by default. e.g. Dec 10 10:45:00 fovo.local systemd[1]: Mounting '@auto/root-x86-64:fedora.35' at /sysroot... Dec 10 10:45:11 fovo.local systemd[1]: Mounting '@auto/home' at /home... Dec 10 10:45:11 fovo.local systemd[1]: Mounting '@auto/varlibflatpak' at /var/lib/flatpak... Dec 10 10:45:11 fovo.local systemd[1]: Mounting '@auto/varlibvirtimages-x86-64:fedora.35 at /var/lib/libvirt/images... Dec 10 10:45:11 fovo.local systemd[1]: Mounting '@auto/varlog-x86-64:fedora.35' at /var/log... Dec 10 10:45:11 fovo.local systemd[1]: Mounting '@auto/swap' at /var/swap... -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Thu, Nov 11, 2021 at 12:28 PM Lennart Poettering wrote: > That said: naked squashfs sucks. Always wrap your squashfs in a GPT > wrapper to make things self-descriptive. Do you mean the image file contains a GPT, and the squashfs is a partition within the image? Does this recommendation apply to any image? Let's say it's a Btrfs image. And in the context of this thread, the GPT partition type GUID would be the "super-root" GUID? -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Tue, Nov 9, 2021 at 8:48 AM Ludwig Nussel wrote: > > Lennart Poettering wrote: > > Or to say this explicitly: we could define the spec to say that if > > we encounter: > > > >/@auto/root-x86-64:fedora_36.0+3-0 > > > > on first boot attempt we'd rename it: > > > >/@auto/root-x86-64:fedora_36.0+2-1 > > > > and so on. Until boot succeeds in which case we'd rename it: > > > >/@auto/root-x86-64:fedora_36.0 > > > > i.e. we'd drop the counting suffix. > > Thanks for the explanation and pointer! > > Need to think aloud a bit :-) > > That method basically works for systems with read-only root. Ie where > the next OS to boot is in a separate snapshot, eg MicroOS. > A traditional system with rw / on btrfs would stay on the same subvolume > though. Ie the "root-x86-64:fedora_36.0" volume in the example. In > openSUSE package installation automatically leads to ro snapshot > creation. In order to fit in I suppose those could then be named eg. > "root-x86-64:fedora_36.N+0" with increasing N. Due to the +0 the > subvolume would never be booted. Yeah the N+0 subvolumes could be read-only snapshots, their purpose is only to be used as an immutable checkpoint from which to produce derivatives, read-write subvolumes. But what about the case of being in a preboot environment, and have no way (yet) to rename or create a new snapshot to boot, and you need to boot one of these read-only snapshots? What if the bootloader was smart enough to add the proper volatile overlay arrangement anytime an N+0 subvolume is chosen for boot? Is that plausible and useful? > Anyway, let's assume the ro case and both efi partition and btrfs volume > use this scheme. That means each time some packages are updated we get a > new subvolume. After reboot the initrd in the efi partition would try to > boot that new subvolume. If it reaches systemd-bless-boot.service the > new subvolume becomes the default for the future. > > So far so good. What if I discover later that something went wrong > though? Some convenience tooling to mark the current version bad again > would be needed. > > But then having Tumbleweed in mind it needs some capability to boot any > old snapshot anyway. I guess the solution here would be to just always > generate a bootloader entry, independent of whether a kernel was > included in an update. Each entry would then have to specify kernel, > initrd and the root subvolume to use. The part I'm having a hard time separating is the implicit case (use some logic to assemble the correct objects), versus explicit (the bootloader snippet points to a root and the root contains an fstab - nothing about assembly is assumed). And should both paradigms exist concurrently in an installed system, and how to deconflict? Further, (open)SUSE tends to define the root to boot via `btrfs subvolume set-default` which is information in the file system itself, neither in the bootloader snipper nor in the naming convention. It's neat, but also not discoverable. If users are trying to learn+understand+troubleshoot how systems boot and assemble themselves, to what degree are they owed transparency without needing extra tools or decoder rings to reveal settings? The default subvolume is uniquely btrfs, and without an equivalent anywhere else (so far as I'm aware) I'm reluctant to use that for day to day boots. I can see the advantage of this for btrfs for some sort of rescue/emergency boot subvolume however... where it doesn't contain the parameter "rootflags=subvol=$root" (which acts as an override for the default subvolume set in the fs itself) then the btrfs default subvolume would be used. I'm struggling with its role in all of this though. -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Fri, Nov 19, 2021 at 4:17 AM Lennart Poettering wrote: > > On Do, 18.11.21 14:51, Chris Murphy (li...@colorremedies.com) wrote: > > > How to do swapfiles? > > Is this really a concept that deserves too much attention? *shrug* Only insofar as I like order, and like the idea of agreeing on where things belong if there's going to appear somewhere. > I mean, I > have the suspicion that half the benefit of swap space is that it can > act as backing store for hibernation. Yes and that's a terrible conflation. The swapfile/device is for anonymous pages. And hibernation images are not anon pages, and even have special rules like must be contained in contiguous physical device blocks. It may turn out that 'swsusp' (Swap Suspend) in the kernel shouldn't be deprecated, and instead focus future effort on 'uswsusp'. But discussions around signed and authenticated hibernation images for UEFI Secure Boot and kernel lockdown compatibility, have all been around the kernel implementation. https://www.kernel.org/doc/Documentation/power/swsusp.rst https://www.kernel.org/doc/Documentation/power/userland-swsusp.rst > But swap files are icky for that > since that means the resume code has to mount the fs first, but given > the fs is dirty during the hibernation state this is highly problematic. It's sufficiently complicated and non-fail-safe (it's fail danger) that it's broken. On btrfs, it's more tedious but less broken because you must use both resume=UUID=$uuid resume_offset=$physicaloffsethibernationimage In effect the kernel does not need to mount ro the btrfs file system at all, it gets the hint for the physical location of the hibernation image from kernel boot parameter. Other file systems support discovery of the physical offset once the file system is mounted ro. On Btrfs you can see the swapfile as having a punch through mechanism. It's a reservation of blocks, and page outs happen directly to that reservation of blocks, not via the file system itself. This is why there are all these limitations: balance doesn't touch block groups containing any swapfile blocks, you can't do any kind of multiple device stuff, you can't snapshot/reflink the swapfile, etc. Which is why I'm in favor of just ceding this entire territory over to systemd to manage correctly. But as a prerequisite, the hibernation image should be separate from the swapfile. And should have a metadata format so we can pair file system state to hibernation image state, that way for sure we aren't running into catastrophic nonsense like this right at the top of https://www.kernel.org/doc/Documentation/power/swsusp.rst **BIG FAT WARNING** If you touch anything on disk between suspend and resume... ...kiss your data goodbye. If you do resume from initrd after your filesystems are mounted... ...bye bye root partition. Horrible. > Hence, I have the suspicion that if you do swap you should probably do > swap partitions, not swap files, because it can cover all usecase: > paging *and* hibernation. I agree only insofar as it's the most reliable thing we have right now. Not that it's an efficient or safe design, you still can have problems if you rw mount a file system, and then resume from a hibernation image. The kernel has no concept of matching a file system state to that of a hibernation image, so that the hibernation image can be invalidated, thus avoiding subsequent corruption. > > Currently I'm creating a "swap" subvolume in the top-level of the file > > system and /etc/fstab looks like this > > > > UUID=$FSUUID/var/swap btrfs noatime,subvol=swap 0 0 > > /var/swap/swapfile1 none swap defaults 0 0 > > > > This seems to work reliably after hundreds of boots. > > > > a. Is this naming convention for the subvolume adequate? Seems like it > > can just be "swap" because the GPT method is just a single partition > > type GUID that's shared by multiboot Linux setups, i.e. not arch or > > distro specific > > I'd still put it one level down, and marke it with some non-typical > character so that it is less likely to clash with anything else. I'm not sure I understand "one level down". The "swap" subvolume would be in the top-level of the Btrfs file system, just like Fedora's existing "root" and "home" subvolumes are in the top level. > > > b. Is the mount point, /var/swap, OK? > > I see no reason why not. OK super. > > > c. What should the additional naming convention be for the swapfile > > itself so swapon happens automatically? > > To me it appears these things should be distinct: if automatic > activation of swap files is desirable, then there should probably be a > systemd generator that finds all suitable files in /var/swap/ and > generates .swap units for them. This would then w
Re: [systemd-devel] Antw: [EXT] Re: [systemd‑devel] the need for a discoverable sub‑volumes specification
On Mon, Nov 22, 2021 at 3:02 AM Ulrich Windl wrote: > > >>> Lennart Poettering schrieb am 19.11.2021 um 10:17 > in > Nachricht : > > On Do, 18.11.21 14:51, Chris Murphy (li...@colorremedies.com) wrote: > > > >> How to do swapfiles? > > > > Is this really a concept that deserves too much attention? I mean, I > > have the suspicion that half the benefit of swap space is that it can > > act as backing store for hibernation. But swap files are icky for that > > since that means the resume code has to mount the fs first, but given > > the fs is dirty during the hibernation state this is highly problematic. > > > > Hence, I have the suspicion that if you do swap you should probably do > > swap partitions, not swap files, because it can cover all usecase: > > paging *and* hibernation. > > Out of curiosity: What about swap LVs, possibly thin-provisioned ones? I don't think that's supported. https://listman.redhat.com/archives/linux-lvm/2020-November/msg00039.html -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Thu, Nov 18, 2021 at 2:51 PM Chris Murphy wrote: > > How to do swapfiles? > > Currently I'm creating a "swap" subvolume in the top-level of the file > system and /etc/fstab looks like this > > UUID=$FSUUID/var/swap btrfs noatime,subvol=swap 0 0 > /var/swap/swapfile1 none swap defaults 0 0 > > This seems to work reliably after hundreds of boots. > > a. Is this naming convention for the subvolume adequate? Seems like it > can just be "swap" because the GPT method is just a single partition > type GUID that's shared by multiboot Linux setups, i.e. not arch or > distro specific > b. Is the mount point, /var/swap, OK? > c. What should the additional naming convention be for the swapfile > itself so swapon happens automatically? Actually I'm thinking of something different suddenly... because without user ownership of swapfiles, and instead systemd having domain over this, it's perhaps more like: /x-systemd.auto/swap -> /run/systemd/swap And then systemd just manages the files in that directory per policy, e.g. do on demand creation of swapfiles with variable size increments, as well as cleanup. -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
How to do swapfiles? Currently I'm creating a "swap" subvolume in the top-level of the file system and /etc/fstab looks like this UUID=$FSUUID/var/swap btrfs noatime,subvol=swap 0 0 /var/swap/swapfile1 none swap defaults 0 0 This seems to work reliably after hundreds of boots. a. Is this naming convention for the subvolume adequate? Seems like it can just be "swap" because the GPT method is just a single partition type GUID that's shared by multiboot Linux setups, i.e. not arch or distro specific b. Is the mount point, /var/swap, OK? c. What should the additional naming convention be for the swapfile itself so swapon happens automatically? Also, instead of /@auto/ I'm wondering if we could have /x-systemd.auto/ ? This makes it more clearly systemd's namespace, and while I'm a big fan of the @ symbol for typographic history reasons, it's being used in the subvolume/snapshot regimes rather haphazardly for different purposes which might be confusing? e.g. Timeshift expects subvolumes it manages to be prefixed with @. Meanwhile SUSE uses @ for its (visible) root subvolume in which everything else goes. And still ZFS uses @ for their (read-only) snapshots. -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
Lennart most recently (about a year ago) wrote on this in a mostly unrelated Fedora devel@ thread. I've found the following relevant excerpts and provide the source URL as well. BTW, we once upon a time added a TODO list item of adding a btrfs generator to systemd, similar to the existing GPT generator: it would look at the subvolumes of the root btrfs fs, and then try to mount stuff it finds if it follows a certain naming scheme. https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/message/M756KVDNY65VONU3GA5CSXB4LBJD3ZIW/ All I am asking for is to make this simple and robust and forward looking enough so that we can later add something like the generator I proposed without having to rerrange anything. i.e. make the most basic stuff self-describing now, even if the automatic discovering/mounting of other subvols doesn't happen today, or even automatic snapshotting. By doing that correctly now, you can easily extend things later incrementally without breaking stuff, just by *adding* stuff. And you gain immediate compat with "systemd-nspawn --image=" right-away as the basic minimum, which already is great. https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/message/JB2PMFPPRS4YII3Q4BMHW3V33DM2MT44/ We manage to name RPMs with versions, epochs, archs and so on, I doubt we need much more for naming subvolumes to auto-assemble. https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/message/VBVFQOG5EYI73CGFVCLMGX72IZUCQEYG/ -- Chris Murphy
[systemd-devel] the need for a discoverable sub-volumes specification
There is a Discoverable Partitions Specification http://systemd.io/DISCOVERABLE_PARTITIONS/ The problem with this for Btrfs, ZFS, and LVM is a single volume can represent multiple use cases via multiple volumes: subvolumes (btrfs), datasets (ZFS), and logical volumes (LVM). I'll just use the term sub-volume for all of these, but I'm open to some other generic term. None of the above volume managers expose the equivalent of GPT's partition type GUID per sub-volume. One possibility that's available right now is the sub-volume's name. All we need is a spec for that naming convention. An early prototype of this idea was posted by Lennart: https://0pointer.net/blog/revisiting-how-we-put-together-linux-systems.html Lennart previously mentioned elsewhere that this is probably outdated. So let's update it and bring it more in line with the purpose and goal set out in the discoverable partition spec, which is to obviate the need for /etc/fstab. -- Chris Murphy
Re: [systemd-devel] [EXT] Re: consider dropping defrag of journals on btrfs
On Tue, Feb 9, 2021 at 8:02 AM Phillip Susi wrote: > > > Chris Murphy writes: > > > Basically correct. It will merge random writes such that they become > > sequential writes. But it means inserts/appends/overwrites for a file > > won't be located with the original extents. > > Wait, I thoguht that was only true for metadata, not normal file data > blocks? Well, maybe it becomes true for normal data if you enable > compression. Or small files that get leaf packed into the metadata > chunk. Both data and metadata. > > If it's really combining streaming writes from two different files into > a single interleaved write to the disk, that would be really silly. It's not interleaving. It uses delayed allocation to make random writes into sequential writes. It's tries harder to keep file blocks together for the nossd case. It's a bit more opportunistic with the ssd mount option. And it also depends on the pattern of the writer. There's btrfs heatmap to get a visual idea of these behaviors. https://github.com/knorrie/btrfs-heatmap -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Mon, Feb 8, 2021 at 8:20 AM Phillip Susi wrote: > > > Chris Murphy writes: > > > I showed that the archived journals have way more fragmentation than > > active journals. And the fragments in active journals are > > insignificant, and can even be reduced by fully allocating the journal > > Then clearly this is a problem with btrfs: it absolutely should not be > making the files more fragmented when asked to defrag them. I've asked. We'll see.. > > file to final size rather than appending - which has a good chance of > > fragmenting the file on any file system, not just Btrfs. > > And yet, you just said the active journal had minimal fragmentation. Yes, the extents are consistently 8MB in the nodatacow case, old and new file system alike. Same as ext4 and XFS. > That seems to mean that the 8mb fallocates that journald does is working > well. Sure, you could proabbly get fewer fragments by fallocating the > whole 128 mb at once, but there are tradeoffs to that that are not worth > it. One fragment per 8 mb isn't a big deal. Ideally a filesystem will > manage to do better than that ( didn't btrfs have a persistent > reservation system for this purpose? ), but it certainly should not > commonly do worse. I don't think any of the file systems guarantee a contiguous block range upon fallocate, they only guarantee that writes to fallocated space will succeed. i.e. it's a space reservation. But yeah in practice, 8MB is small enough that chances are you'll see one 8MB extent. And I agree 8MB isn't a big deal. Does anyone complain about journal fragmentation on ext4 or xfs? If not, then we come full circle to my second email in the thread which is don't defragment when nodatacow, only defragment when datacow. Or use BTRFS_IOC_DEFRAG_RANGE and specify 8MB length. That does seem to consistently no op on nodatacow journals which have 8MB extents. > > Further, even *despite* this worse fragmentation of the archived > > journals, bcc-tools fileslower shows no meaningful latency as a > > result. I wrote this in the previous email. I don't understand what > > you want me to show you. > > *Of course* it showed no meaningful latency because you did the test on > an SSD, which has no meaningful latency penalty due to fragmentation. > The question is how bad is it on HDD. The reason I'm dismissive is because the nodatacow fragment case is the same as ext4 and XFS; the datacow fragment case is both spectacular and non-deterministic. The workload will matter where these random 4KiB journal writes end up on an HDD. I've seen journals with hundreds to thousands of extents. I'm not sure what we learn from me doing a single isolated test on an HDD. And also, only defragmenting on rotation strikes me as leaving performance on the table, right? If there is concern about fragmented archived journals, then isn't there concern about fragmented active journals? But it sounds to me like you want to learn what the performance is of journals defragmented with BTFS_IOC_DEFRAG specifically? I don't think it's interesting because you're still better off leaving nodatacow journals alone, and something still has to be done in the datacow case. It's two extremes. What the performance is doesn't matter, it's not going to tell you anything you can't already infer from the two layouts. > > And since journald offers no ability to disable the defragment on > > Btrfs, I can't really do a longer term A/B comparison can I? > > You proposed a patch to disable it. Test before and after the patch. Is there a test mode for journald to just dump a bunch of random stuff into the journal to age it? I don't want to wait weeks to get a dozen journal files. > > > I did provide data. That you don't like what the data shows: archived > > journals have more fragments than active journals, is not my fault. > > The existing "optimization" is making things worse, in addition to > > adding a pile of unnecessary writes upon journal rotation. > > If it is making things worse, that is definately a bug in btrfs. It > might be nice to avoid the writes on SSD though since there is no > benefit there. Agreed. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Mon, Feb 8, 2021 at 7:56 AM Phillip Susi wrote: > > > Chris Murphy writes: > > >> It sounds like you are arguing that it is better to do the wrong thing > >> on all SSDs rather than do the right thing on ones that aren't broken. > > > > No I'm suggesting there isn't currently a way to isolate > > defragmentation to just HDDs. > > Yes, but it sounded like you were suggesting that we shouldn't even try, > not just that it isn't 100% accurate. Sure, some SSDs will be stupid > and report that they are rotational, but most aren't stupid, so it's a > good idea to disable the defragmentation on drives that report that they > are non rotational. So far I've seen, all USB devices report rotational. All USB flash drives, and any SSD in an enclosure. Maybe some way of estimating rotational based on latency standard deviation, and stick that in sysfs, instead of trusting device reporting. But in the meantime, the imperfect rule could be do not defragment unless it's SCSI/SATA/SAS and it reports it's rotational. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [EXT] Re: consider dropping defrag of journals on btrfs
On Mon, Feb 8, 2021 at 1:24 AM Ulrich Windl wrote: > > I didn't follow the thread tightly, but there was a happy mix of IOps, > fragments (and no bandwidth), > but I wonder here: Isn't it concept of BtrFS that writes are fragmented if > there is no contiguous free space? > The idea was *not* to spend time trying to find a goot spoace to write to, but > use the next available one. Basically correct. It will merge random writes such that they become sequential writes. But it means inserts/appends/overwrites for a file won't be located with the original extents. > >> If you want an optimization that's actually useful on Btrfs, > >> /var/log/journal/ could be a nested subvolume. That would prevent any > > Actually I stil ldidn't get the benefit of a BtrFS subvolume, but that 's a > different topic: > Don't all wrtites end up in a single storage pool? Subvolumes/snapshots are file b-trees. It's the granularity of snapshots, send/receive, and the fsync log tree. And at least user space tools don't do recursive snapshotting, so they stop at subvolume boundaries which can be important in some cases if the intent is to use nodatacow. Snapshots results in nodatacow files being subject to cow. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Fri, Feb 5, 2021 at 8:23 AM Phillip Susi wrote: > Chris Murphy writes: > > > But it gets worse. The way systemd-journald is submitting the journals > > for defragmentation is making them more fragmented than just leaving > > them alone. > > Wait, doesn't it just create a new file, fallocate the whole thing, copy > the contents, and delete the original? Same inode, so no. As to the logic, I don't know. I'll ask upstream to document it. ?How can that possibly make > fragmentation *worse*? I'm only seeing this pattern with journald journals, and BTRFS_IOC_DEFRAG. But I'm also seeing it with all archived journals. Meanwhile, active journals exhibit no different pattern from ext4 and xfs, no worse fragmentation. Consid other storage technologies where COW and snapshots come into play. For example anything based on device-mapper thin provisioning is going to run into these issues. How it allocates physical extents isn't up to the file system. Duplicate a file and delete the original, you might get a more fragmented file as well. The physical layout is entirely decoupled from the file system - where the filesystem could tell you "no fragmentation" and yet it is highly fragmented, or vice versa. These problems are not unique to Btrfs. Is there a VFS API for handling these isues? Should there be? I really don't think any application, including journald, should be having to micromanage these kinds of things on a case by case basis. General problems like this need general solutions. > > All of those archived files have more fragments (post defrag) than > > they had when they were active. And here is the FIEMAP for the 96MB > > file which has 92 fragments. > > How the heck did you end up with nearly 1 frag per mb? I didn't do anything special, it's a default configuration. I'll ask Btrfs developers about it. Maybe it's one of those artifacts of FIEMAP I mentioned previously. Maybe it's not that badly fragmented to a drive that's going to reorder reads anyway, to be more efficient about it. > > If you want an optimization that's actually useful on Btrfs, > > /var/log/journal/ could be a nested subvolume. That would prevent any > > snapshots above from turning the nodatacow journals into datacow > > journals, which does significantly increase fragmentation (it would in > > the exact same case if it were a reflink copy on XFS for that matter). > > Wouldn't that mean that when you take snapshots, they don't include the > logs? That's a snapshot/rollback regime design and policy question. If you snapshot the subvolume that contains the journals, the journals will be in the snapshot. The user space tools do not have an option for recursive snapshots, so snapshotting does end at subvolume boundaries. If you want journals snapshot, then their enclosing subvolume would need to be snapshot. > That seems like an anti feature that violates the principal of > least surprise. If I make a snapshot of my root, I *expect* it to > contain my logs. You can only rollback that which you snapshot. If you snapshot a root without excluding journals, if you rollback, you rollback the journals. That's data loss. (open)suse has a snapshot/rollback regime configured and enabled by default out of the box. Logs are excluded from it, same as the bootloader. (Although I'll also note they default to volatile systemd journals, and use rsyslogd for persistent logs.) Fedora meanwhile does have persistent journald journals in the root subvolume, but there's no snapshot/rollback regime enabled out of the box. I'm inclined to have them excluded, not so much to avoid cow of the nodatacow journals, but avoiding discontinuity in the journals upon rollback. > > > I don't get the iops thing at all. What we care about in this case is > > latency. A least noticeable latency of around 150ms seems reasonable > > as a starting point, that's where users realize a delay between a key > > press and a character appearing. However, if I check for 10ms latency > > (using bcc-tools fileslower) when reading all of the above journals at > > once: > > > > $ sudo journalctl -D > > /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager > > > > Not a single report. None. Nothing took even 10ms. And those journals > > are more fragmented than your 20 in a 100MB file. > > > > I don't have any hard drives to test this on. This is what, 10% of the > > market at this point? The best you can do there is the same as on SSD. > > The above sounded like great data, but not if it was done on SSD. Right. But also I can't disable the defragmentation in order to do a proper test on HDD. > > You can't depend on sysfs to conditionally do defragmentation on only > > rotational media, too many fragile media claim to be rotating.
Re: [systemd-devel] consider dropping defrag of journals on btrfs
More data points. 1. An ext4 file system with a 112M system.journal, it has 15 extents. >From FIEMAP we can pretty much see it's really made from 14 8MB extents, consistent with multiple appends. And it's the exact same behavior seen on Btrfs with nodatacow journals. https://pastebin.com/6vuufwXt 2. A Btrfs file system with a 24MB system.journal, nodatacow, 4 extents. The fragments are consistent with #1 as a result of nodatacow journals. https://pastebin.com/Y18B2m4h 3. Continuing from #2, 'journalctl --rotate' strace shows this results in: ioctl(31, BTRFS_IOC_DEFRAG) = 0 filefrag shows the result, 17 extents. But this is misleading because 9 of them are in the same position as before, so it seems to be a minimalist defragment. Btrfs did what was requested but with both limited impact and efficacy, at least on nodatacow files having minimal fragmentation to begin with. https://pastebin.com/1ufErVMs 4. Continuing from #3, 'btrfs fi defrag -l 32M' pointed to this same file results in a single extent file. strace shows this uses ioctl(3, BTRFS_IOC_DEFRAG_RANGE, {start=0, len=33554432, flags=0, extent_thresh=33554432, compress_type=BTRFS_COMPRESS_NONE}) = 0 and filefrag shows the single extent mapping: https://pastebin.com/429fZmNB While this is a numeric improvement (no fragmentation), again there's no proven advantage of defragmenting nodatacow journals on Btrfs. It's just needlessly contributing to write amplification. -- The original commit description only mentions COW, it doesn't mention being predicated on nodatacow. In effect commit f27a386430cc7a27ebd06899d93310fb3bd4cee7 is obviated by commit 3a92e4ba470611ceec6693640b05eb248d62e32d four months later. I don't think they were ever intended to be used together, and combining them seems accidental. Defragmenting datacow files makes some sense on rotating media. But that's the exception, not the rule. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Fri, Feb 5, 2021 at 3:55 PM Lennart Poettering wrote: > > On Fr, 05.02.21 20:58, Maksim Fomin (ma...@fomin.one) wrote: > > > > You know, we issue the btrfs ioctl, under the assumption that if the > > > file is already perfectly defragmented it's a NOP. Are you suggesting > > > it isn't a NOP in that case? > > > > So, what is the reason for defragmenting journal is BTRFS is > > detected? This does not happen at other filesystems. I have read > > this thread but has not found a clear answer to this question. > > btrfs like any file system fragments files with nocow a bit. Without > nocow (i.e. with cow) it fragments files horribly, given our write > pattern (wich is: append something to the end, and update a few > pointers in the beginning). By upstream default we set nocow, some > downstreams/users undo that however. (this is done via tmpfiles, > i.e. journald doesn't actually set nocow ever). I don't see why it's upstream's problem to solve downstream decisions. If they want to (re)enable datacow, then they can also setup some kind of service to defragment /var/log/journal/ on a schedule, or they can use autodefrag. > When we archive a journal file (i.e stop writing to it) we know it > will never receive any further writes. It's a good time to undo the > fragmentation (we make no distinction whether heavily fragmented, > little fragmented or not at all fragmented on this) and thus for the > future make access behaviour better, given that we'll still access the > file regularly (because archiving in journald doesn't mean we stop > reading it, it just means we stop writing it — journalctl always > operates on the full data set). defragmentation happens in the bg once > triggered, it's a simple ioctl you can invoke on a file. if the file > is not fragmented it shouldn't do anything. ioctl(3, BTRFS_IOC_DEFRAG_RANGE, {start=0, len=16777216, flags=0, extent_thresh=33554432, compress_type=BTRFS_COMPRESS_NONE}) = 0 What 'len' value does journald use? > other file systems simply have no such ioctl, and they never fragment > as terribly as btrfs can fragment. hence we don't call that ioctl. I did explain how to avoid the fragmentation in the first place, to obviate the need to defragment. 1. nodatacow. journald does this already 2. fallocate the intended final journal file size from the start, instead of growing them in 8MB increments. 3. Don't reflink copy (including snapshot) the journals. This arguably is not journald's responsibility but as it creates both the journal/ directory and $MACHINEID directory, it could make one or both of them as subvolumes instead to ensure they're not subject to snapshotting from above. > I'd even be fine dropping it > entirely, if someone actually can show the benefits of having the > files unfragmented when archived don't outweigh the downside of > generating some iops when executing the defragmentation. I showed that the archived journals have way more fragmentation than active journals. And the fragments in active journals are insignificant, and can even be reduced by fully allocating the journal file to final size rather than appending - which has a good chance of fragmenting the file on any file system, not just Btrfs. Further, even *despite* this worse fragmentation of the archived journals, bcc-tools fileslower shows no meaningful latency as a result. I wrote this in the previous email. I don't understand what you want me to show you. And since journald offers no ability to disable the defragment on Btrfs, I can't really do a longer term A/B comparison can I? >i.e. someone > does some profiling, on both ssd and rotating media. Apparently noone > who cares about this apparently wants to do such research though, and > hence I remain deeply unimpressed. Let's not try to do such > optimizations without any data that actually shows it betters things. I did provide data. That you don't like what the data shows: archived journals have more fragments than active journals, is not my fault. The existing "optimization" is making things worse, in addition to adding a pile of unnecessary writes upon journal rotation. Conversely, you have not provided data proving that nodatacow fallocated files on Btrfs are any more fragmented than fallocated files on ext4 or XFS. 2-17 fragments on ext4: https://pastebin.com/jiPhrDzG https://pastebin.com/UggEiH2J That behavior is no different for nodatacow fallocated journals on Btrfs. There's no point in defragmenting these no matter the file system. I don't have to profile this on HDD, I know that even in the best case you're not likely to get (certainly not guaranteed) to get fewer fragments than this. Defrag on Btrfs is for the thousands of fragments case, which is what you get with datacow journals. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] udev and btrfs multiple devices
On Thu, Feb 4, 2021 at 6:28 AM Lennart Poettering wrote: > > On Mi, 03.02.21 22:32, Chris Murphy (li...@colorremedies.com) wrote: > > It doesn't. It waits indefinitely. > > > > [* ] A start job is running for > > /dev/disk/by-uuid/cf9c9518-45d4-43d6-8a0a-294994c383fa (12min 36s / no > > limit) > > Is this on encrypted media? No. Plain partitions. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
ad access patterns becuase the archived files are fragment. Right. So pick a size for the journal file, I don't really care what it is but they seem to get upwards of 128MB in size so just use that. Make a 128MB file from the very start, fallocate it, and then when full, rotate and create a new one. Stop the anti-pattern of tacking on in 8MB increments. And stop defragmenting them. That is the best scenario for HDD, USB sticks, and NVMe. Looking at the two original commits, I think they were always in conflict with each other, happening within months of each other. They are independent ways of dealing with the same problem, where only one of them is needed. And the best of the two is fallocate+nodatacow which makes the journals behave the same as on ext4 where you also don't do defragmentation. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] udev and btrfs multiple devices
On Wed, Feb 3, 2021 at 10:32 PM Chris Murphy wrote: > > On Thu, Jan 28, 2021 at 7:18 AM Lennart Poettering > wrote: > > > > On Mi, 27.01.21 17:19, Chris Murphy (li...@colorremedies.com) wrote: > > > > > Is it possible for a udev rule to have a timeout? For example: > > > /usr/lib/udev/rules.d/64-btrfs.rules > > > > > > This udev rule will wait indefinitely for a missing device to > > > appear. > > > > Hmm, no, that's a mis understaning. "rules" can't "wait". The > > activation of the btrfs file system won't happen, but that should then > > be caught by systemd mount timeouts and put you into recovery mode. > > It doesn't. It waits indefinitely. > > [* ] A start job is running for > /dev/disk/by-uuid/cf9c9518-45d4-43d6-8a0a-294994c383fa (12min 36s / no > limit) https://github.com/systemd/systemd/issues/18466 -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Wed, Feb 3, 2021 at 9:46 AM Lennart Poettering wrote: > > Performance is terrible if cow is used on journal files while we write > them. I've done it for a year on NVMe. The latency is so low, it doesn't matter. > It would be great if we could turn datacow back on once the files are > archived, and then take benefit of compression/checksumming and > stuff. not sure if there's any sane API for that in btrfs besides > rewriting the whole file, though. Anyone knows? A compressed file results in a completely different encoding and extent size, so it's a complete rewrite of the whole file, regardless of the cow/nocow status. Without compression it'd be a rewrite because in effect it's a different extent type that comes with checksums. i.e. a reflink copy of a nodatacow file can only be a nodatacow file; a reflink copy of a datacow file can only be a datacow file. The conversion between them is basically 'cp --reflink=never' and you get a complete rewrite. But you get a complete rewrite of extents by submitting for defragmentation too, depending on the target extent size. It is possible to do what you want by no longer setting nodatacow on the enclosing dir. Create a 0 length journal file, set nodatacow on that file, then fallocate it. That gets you a nodatacow active journal. And then you can just duplicate it in place with a new name, and the result will be datacow and automatically compressed if compression is enabled. But the write hit has already happened by writing journal data into this journal file during its lifetime. Just rename it on rotate. That's the least IO impact possible at this point. Defragmenting it means even more writes, and not much of a gain if any, unless it's datacow which isn't the journald default. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Wed, Feb 3, 2021 at 9:41 AM Lennart Poettering wrote: > > On Di, 05.01.21 10:04, Chris Murphy (li...@colorremedies.com) wrote: > > > f27a386430cc7a27ebd06899d93310fb3bd4cee7 > > journald: whenever we rotate a file, btrfs defrag it > > > > Since systemd-journald sets nodatacow on /var/log/journal the journals > > don't really fragment much. I typically see 2-4 extents for the life > > of the journal, depending on how many times it's grown, in what looks > > like 8MiB increments. The defragment isn't really going to make any > > improvement on that, at least not worth submitting it for additional > > writes on SSD. While laptop and desktop SSD/NVMe can handle such a > > small amount of extra writes with no meaningful impact to wear, it > > probably does have an impact on much more low end flash like USB > > sticks, eMMC, and SD Cards. So I figure, let's just drop the > > defragmentation step entirely. > > Quite frankly, given how iops-expensive btrfs is, one probably > shouldn't choose btrfs for such small devices anyway. It's really not > where btrfs shines, last time I looked. Btrfs aggressively delays metadata and data allocation, so I don't agree that it's expensive. There is a wandering trees problem that can result in write amplification, that's a different problem. But via native compression overall writes are proven to significantly reduce overall writes. But in any case, reading a journal file and rewriting it out, which is what defragment does, doesn't really have any benefit given the file doesn't fragment much anyway due to (a) nodatacow and (b) fallocate, which is what systemd-journald does on Btrfs. It'd make more sense to defragment only if the file is datacow. At least then it also gets compressed, which isn't the case when it's nodatacow. > > > Further, since they are nodatacow, they can't be submitted for > > compression. There was a quasi-bug in Btrfs, now fixed, where > > nodatacow files submitted for decompression were compressed. So we no > > longer get that unintended benefit. This strengthens the case to just > > drop the defragment step upon rotation, no other changes. > > > > What do you think? > > Did you actually check the iops this generates? I don't understand the relevance. > > Not sure it's worth doing these kind of optimizations without any hard > data how expensive this really is. It would be premature. Submitting the journal for defragment in effect duplicates the journal. Read all extents, and rewrite those blocks to a new location. It's doubling the writes for that journal file. It's not like the defragment is free. > That said, if there's actual reason to optimize the iops here then we > could make this smart: there's actually an API for querying > fragmentation: we could defrag only if we notice the fragmentation is > really too high. FIEMAP isn't going to work in the case the files are being fragmented. The Btrfs extent size becomes 128KiB in that case, and it looks like massive fragmentation. So that needs to be made smarter first. I don't have a problem submitting the journal for a one time defragment upon rotation if it's datacow, if empty journal-nocow.conf exists. But by default, the combination of fallocate and nodatacow already avoids all meaningful fragmentation, so long as the journals aren't being snapshot. If they are, well, that too is a different problem. If the user does that and we're still defragmenting the files, it'll explode their space consumption because defragment is not snapshot aware, it results in all shared extents becoming unshared. > But quite frankly, this sounds polishing things after the horse > already left the stable: if you want to optimize iops, then don't use > btrfs. If you bought into btrfs, then apparently you are OK with the > extra iops it generates, hence also the defrag costs. Somehow I think you're missing what I've asking for, which is to stop the unnecessary defragment step because it's not an optimization. It doesn't meaningfully reduce fragmentation at all, it just adds write amplification. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] udev and btrfs multiple devices
On Thu, Jan 28, 2021 at 7:18 AM Lennart Poettering wrote: > > On Mi, 27.01.21 17:19, Chris Murphy (li...@colorremedies.com) wrote: > > > Is it possible for a udev rule to have a timeout? For example: > > /usr/lib/udev/rules.d/64-btrfs.rules > > > > This udev rule will wait indefinitely for a missing device to > > appear. > > Hmm, no, that's a mis understaning. "rules" can't "wait". The > activation of the btrfs file system won't happen, but that should then > be caught by systemd mount timeouts and put you into recovery mode. It doesn't. It waits indefinitely. [* ] A start job is running for /dev/disk/by-uuid/cf9c9518-45d4-43d6-8a0a-294994c383fa (12min 36s / no limit) -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] udev and btrfs multiple devices
On Thu, Jan 28, 2021 at 1:03 AM Greg KH wrote: > > On Wed, Jan 27, 2021 at 05:19:38PM -0700, Chris Murphy wrote: > > > > Next, is it possible to enhance udev so that it can report the number > > of devices expected for a Btrfs file system? This information is > > currently in the Btrfs superblock found on each device in the > > num_devices field. > > https://github.com/storaged-project/udisks/pull/838#issuecomment-768372627 > > It's not up to udev to report that, but rather have either the kernel > export that, or have the tool that udev calls determine that. I mean expose in udevadm info, e.g. E: ID_BTRFS_NUM_DEVICES=4 -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] udev and btrfs multiple devices
Is it possible for a udev rule to have a timeout? For example: /usr/lib/udev/rules.d/64-btrfs.rules This udev rule will wait indefinitely for a missing device to appear. It'd be better if it gives up at some point and drops to a dracut shell. Is that possible? The only alternative right now is the user has to force power off, and boot with something like rd.break=pre-mount, although I'm not 100% certain that'll break soon enough to avoid the hang. Next, is it possible to enhance udev so that it can report the number of devices expected for a Btrfs file system? This information is currently in the Btrfs superblock found on each device in the num_devices field. https://github.com/storaged-project/udisks/pull/838#issuecomment-768372627 Thanks, -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Tue, Jan 5, 2021 at 10:04 AM Chris Murphy wrote: > > f27a386430cc7a27ebd06899d93310fb3bd4cee7 > journald: whenever we rotate a file, btrfs defrag it > > Since systemd-journald sets nodatacow on /var/log/journal the journals > don't really fragment much. I typically see 2-4 extents for the life > of the journal, depending on how many times it's grown, in what looks > like 8MiB increments. The defragment isn't really going to make any > improvement on that, at least not worth submitting it for additional > writes on SSD. While laptop and desktop SSD/NVMe can handle such a > small amount of extra writes with no meaningful impact to wear, it > probably does have an impact on much more low end flash like USB > sticks, eMMC, and SD Cards. So I figure, let's just drop the > defragmentation step entirely. > > Further, since they are nodatacow, they can't be submitted for > compression. There was a quasi-bug in Btrfs, now fixed, where > nodatacow files submitted for decompression were compressed. So we no > longer get that unintended benefit. This strengthens the case to just > drop the defragment step upon rotation, no other changes. > > What do you think? A better idea. Default behavior: journals are nodatacow and are not defragmented. If '/etc/tmpfiles.d/journal-nocow.conf ` exists, do the reverse. Journals are datacow, and files are defragmented (and compressed, if it's enabled). -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Antw: [EXT] emergency shutdown, don't wait for timeouts
On Mon, Jan 4, 2021 at 12:43 PM Phillip Susi wrote: > > > Reindl Harald writes: > > > i have seen "user manager" instances hanging for way too long and way > > more than 3 minutes over the last 10 years > > The default timeout is 3 minutes iirc, so at that point it should be > forcibly killed. Hi, This is too long for a desktop or laptop use case. It should be around 15-20 seconds. It's completely reasonable for users to reach for the power button and force it off by 30 seconds. Fedora Workstation Working Group is tracking an issue expressly to get to around 20 seconds (or better). https://pagure.io/fedora-workstation/issue/163 It is a given there will be some kind of state or data loss by just forcing a shutdown. I think what we need is the console, revealed by ESC, needs to contain sufficient information on what and why the reboot/shutdown is being held back. So that we can figure out why those processes aren't terminating fast enough and get them fixed. A journaled file system should just do log replay at the next mount and the file system itself will be fine. Fine means consistent. But for overwriting file systems, files could be left in an in-between state. It just depends on what's being written to and when and how. A COW file system can better survive an abrupt poweroff since nothing is being overwritten. But I'm skeptical just virtually pulling the power cord is such a great idea to depend on. And for offline updates, we'd want to inhibit the aggressive reboot/shutdown, to ensure updating is complete and all writes are on stable media. But for the aggressive shutdown case, some way of forcing remount ro? Or possibly FIFREEZE/FITHAW? Some boot/bootloader folks have asked fs devs for an atomic freeze+thaw ioctl, i.e. one that is guaranteed to return to thaw. But this has been rebuffed so far. While thaw seems superfluous for the use case under discussion, it's possible poweroff command will be blocked by the freeze. And the thaw itself can be blocked by the freeze, when sysroot is the file system being frozen. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] consider dropping defrag of journals on btrfs
f27a386430cc7a27ebd06899d93310fb3bd4cee7 journald: whenever we rotate a file, btrfs defrag it Since systemd-journald sets nodatacow on /var/log/journal the journals don't really fragment much. I typically see 2-4 extents for the life of the journal, depending on how many times it's grown, in what looks like 8MiB increments. The defragment isn't really going to make any improvement on that, at least not worth submitting it for additional writes on SSD. While laptop and desktop SSD/NVMe can handle such a small amount of extra writes with no meaningful impact to wear, it probably does have an impact on much more low end flash like USB sticks, eMMC, and SD Cards. So I figure, let's just drop the defragmentation step entirely. Further, since they are nodatacow, they can't be submitted for compression. There was a quasi-bug in Btrfs, now fixed, where nodatacow files submitted for decompression were compressed. So we no longer get that unintended benefit. This strengthens the case to just drop the defragment step upon rotation, no other changes. What do you think? -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] btrfs raid not ready but systemd tries to mount it anyway
On Mon, Oct 12, 2020 at 1:33 AM Lennart Poettering wrote: > > On So, 11.10.20 14:57, Chris Murphy (li...@colorremedies.com) wrote: > > > Hi, > > > > A Fedora 32 (systemd-245.8-2.fc32) user has a 10-drive Btrfs raid1 set > > to mount in /etc/fstab: > > > > UUID=f89f0a16- /srv btrfs defaults,nofail,x-systemd.requires=/ > > 0 0 > > > > For some reason, systemd is trying to mount this file system before > > all ten devices are ready. Supposedly this rule applies: > > https://github.com/systemd/systemd/blob/master/rules.d/64-btrfs.rules.in > > udev calls the btrfs ready ioctl whenever a new btrfs fs block deice > shows up. The ioctl will fail as long as not all devices that make up > the fs have shown up. It succeeds once all devices for the fs are > there. i.e. for n=10 devices it will return failure 9 times, and > sucess the 1 final time. > > When precisely it returns success or failure is entirely up to the btrfs > kernel > code. systemd/udev doesn't have any control on that. The udev btrfs > builtin is too trivial for that: it just calls the ioctl and that > pretty much is it. What does this line mean? Does it mean the 'btrfs ready' ioctl has been called at this moment and the device is ready? i.e. this specific device is ready now, but not before now? [ 30.923721] kernel: BTRFS: device label BTRFS_RAID1_srv devid 1 transid 60815 /dev/sdg scanned by systemd-udevd (710) Because I see six such lines for this file system before the mount attempt. And four such lines after the mount attempt. If "all devices ready" is not true until the last such line appears, then the mount is happening too soon for some reason. > For historical reasons udev log level is independent from the rest of > systemd log level. Thus use udev.log_priority=debug to turn on udev > debug logging. I'll have him retry with udev.log_priority=debug and if I get a moment I'll try to reproduce. The difficulty is reproducing truly missing devices is easy and appears to work, whereas in this case they are merely late being scanned for whatever reason (maybe they take longer to spin up, maybe the HBA they're connected to is just slow or has a later loading driver, etc) -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] btrfs raid not ready but systemd tries to mount it anyway
On Sun, Oct 11, 2020 at 11:56 PM Andrei Borzenkov wrote: > > 11.10.2020 23:57, Chris Murphy пишет: > > Hi, > > > > A Fedora 32 (systemd-245.8-2.fc32) user has a 10-drive Btrfs raid1 set > > to mount in /etc/fstab: > > > > UUID=f89f0a16- /srv btrfs defaults,nofail,x-systemd.requires=/ > > 0 0 > > > > For some reason, systemd is trying to mount this file system before > > all ten devices are ready. Supposedly this rule applies: > > https://github.com/systemd/systemd/blob/master/rules.d/64-btrfs.rules.in > > > > Fedora does have /usr/lib/udev/rules.d/64-btrfs.rules but I find no > > reference at all to this rule when the user boots with 'rd.udev.debug > > systemd.log_level=debug'. The entire journal is here: > > > > https://drive.google.com/file/d/1jVHjAQ8CY9vABtM2giPTB6XeZCclm7R-/view > > > > Educated guess - rule is missing in initrd and you do not run udev > trigger after switch to root. I will ask the user to double check their initrd, but mine definitely has it without any initrd/dracut related customizations. $ sudo lsinitrd initramfs-5.8.8-200.fc32.x86_64.img | grep btrfs btrfs -rw-r--r-- 1 root root 616 May 29 12:35 usr/lib/udev/rules.d/64-btrfs.rules -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] btrfs raid not ready but systemd tries to mount it anyway
Hi, A Fedora 32 (systemd-245.8-2.fc32) user has a 10-drive Btrfs raid1 set to mount in /etc/fstab: UUID=f89f0a16- /srv btrfs defaults,nofail,x-systemd.requires=/ 0 0 For some reason, systemd is trying to mount this file system before all ten devices are ready. Supposedly this rule applies: https://github.com/systemd/systemd/blob/master/rules.d/64-btrfs.rules.in Fedora does have /usr/lib/udev/rules.d/64-btrfs.rules but I find no reference at all to this rule when the user boots with 'rd.udev.debug systemd.log_level=debug'. The entire journal is here: https://drive.google.com/file/d/1jVHjAQ8CY9vABtM2giPTB6XeZCclm7R-/view I expect a workaround would be to use mount option: x-systemd.automount,noauto,nofail,x-systemd.requires=/ In fact, I'm not sure x-systemd.requires is needed because / must be mounted successfully to read /etc/fstab in the first place; in order to know to mount this file system at /srv Anyway I'm mainly confused why the btrfs udev rule is seemingly not applied in this case. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [Help] Can't log in to homed user account: "No space left on device"
On Mon, Aug 24, 2020 at 2:44 AM Andrii Zymohliad wrote: > > > I suspect that the workaround until > > this is figured out why the fallocate fails (I suspect shared extents, > > there's evidence this home file has been snapshot and I don't know > > what effect that has on fallocating the file) is to use > > --luks-discard=true ? That should avoid the need to fallocate when > > opening. > > Just to confirm, "homectl update --luks-discard=on azymohliad" fixed the > issue for me. I can log in again. Thanks a lot to Chris and Andrei! > > > > And the user will have to be vigilant about the space usage > > of user home because it's now possible to overcommit. > > I guess it's better to reduce home size (to something like 300-350G in my > case) to decrease the probability of overcommit? > Yes. But mainly you'll just want to keep an eye on the underlying file system free space. If it goes empty, the home file system won't know this and can try to write anyway, and then we get a pretty icky kind of failure with the home file system, possibly very bad. Of course this is a temporary situation, we need to find a long term solution because it's definitely not intended that the user babysit the two file systems like this. There should be built-in logic to make sure It Just Works. But also in your case you should try to find out why there are shared extents. That's more of a btrfs question so i'll resume that conversation on the btrfs list. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [Help] Can't log in to homed user account: "No space left on device"
On Sun, Aug 23, 2020 at 6:47 AM Andrei Borzenkov wrote: > > 23.08.2020 15:34, Andrii Zymohliad пишет: > >> Here is the log after authentication attempt: > >> https://gitlab.com/-/snippets/2007113 > >> And just in case here is the full log since boot: > >> https://gitlab.com/-/snippets/2007112 > > > > Sorry, links are broken, re-uploaded: > > > > Authentication part: https://gitlab.com/-/snippets/2007123 > > Full log: https://gitlab.com/-/snippets/2007124 > > > > Yes, as suspected: > > > сер 23 14:12:48 az-wolf-pc systemd-homed[917]: Not enough disk space > to fully allocate home. > > This comes from > > if (fallocate(backing_fd, FALLOC_FL_KEEP_SIZE, 0, st->st_size) < > 0) { > > ... > if (ERRNO_IS_DISK_SPACE(errno)) { > log_debug_errno(errno, "Not enough disk space to > fully allocate home."); > return -ENOSPC; /* make recognizable */ > } > > return log_error_errno(errno, "Failed to allocate > backing file blocks: %m"); > } > > So fallocate syscall failed. Try manually > > fallocate -l 403G -n /home/azymohliad.home > > if it fails too, the question is better asked on btrfs list. User reports this from 'homectl inspect' LUKS Discard: online=no offline=yes Does this mean 'fstrim' is issued before luksClose? And 'fallocate' is issued before luksOpen? If so, it seems it'd be trivial to run into a fatal attempt to activate, just by deactivating a user home of default size. And then consuming free space on the underlying volume, such that there isn't enough free space to fallocate the home file before opening it again. What am I missing? What it seems homed needs to do if fallocate fails for whatever reason is to have some kind of fallback. Otherwise the user is stuck being unable to open their user home. I suspect that the workaround until this is figured out why the fallocate fails (I suspect shared extents, there's evidence this home file has been snapshot and I don't know what effect that has on fallocating the file) is to use --luks-discard=true ? That should avoid the need to fallocate when opening. And the user will have to be vigilant about the space usage of user home because it's now possible to overcommit. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] kernel messages not making it to journal
On Thu, Jun 4, 2020 at 5:30 AM Michal Koutný wrote: > > Hi. > > On Mon, Jun 01, 2020 at 07:11:15PM -0600, Chris Murphy > wrote: > > But journalctl does not show it at all. Seems like it might be a bug, > > I expect it to be recorded in the journal, not only found in dmesg. > Journald fetches dmesg messages too (see jounrald.conf:ReadKMsg=). It's > not clear whether you run journalctl as root or non-privileged user that > may not have access to the system-wide kernel messages. > > If you don't see the messages in journal as root and you can reproduce > it, I suggest you file an issue on Github [1]. It's 100% reproducible. https://github.com/systemd/systemd/issues/16173 -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] kernel messages not making it to journal
dmesg shows this: [ 22.947118] systemd-journald[629]: File /var/log/journal/26336922e1044e80ae4bd42e1d6b9099/user-1000.journal corrupted or uncleanly shut down, renaming and replacing. [ 22.953883] systemd-journald[629]: Creating journal file /var/log/journal/26336922e1044e80ae4bd42e1d6b9099/user-1000.journal on a btrfs file system, and copy-on-write is e But journalctl does not show it at all. Seems like it might be a bug, I expect it to be recorded in the journal, not only found in dmesg. systemd-245.4-1.fc32.x86_64 Thanks, -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] How to use systemd-repart partitions?
On Wed, May 20, 2020 at 4:01 AM Lennart Poettering wrote: > > On Mi, 20.05.20 00:12, Tobias Hunger (tobias.hun...@gmail.com) wrote: > > > > > The one thing that is frustrating is to get a machine image generated > > by my build server onto a new piece of hardware. So I wanted to see > > how far I can get with systemd-repart and co. to get this initial > > deployment to new hardware more automated after booting with the > > machine image from an USB stick. > > So I eventually want to cover three usecases with systemd-repart: > > 1. building OS images > 2. installing an OS from an installer medium onto a host medium > 3. adapting an OS images to the medium it has been dd'ed to on first >boot > > I think the 3rd usecase we already cover quite OK. > > To deliver the others, I want to add Encrypt= and Format= as > mentioned. To cover the 2nd usecase I then also want to provide > CopyBlocks= and CopyFiles=. The former would stream data into a > partition that is created on the block level. Primary usecase would be > to copy a partition 1:1 from the installer medium onto the host > medium. The latter would copy in a file tree on the fs level. Future feature for the former case: - Btrfs seed/sprout feature expressly supports this use case for replicating a seed image when destination is also Btrfs. # mount /dev/seed /mnt # btrfs device add /dev/sprout /mnt # mount -o remount,rw /mnt # btrfs device remove /dev/seed /mnt This results in replication happening, from seed to sprout device. Future feature to consider for the latter case, maybe it's more of an optimization: - ability to create btrfs subvolumes - ability to 'cp -a --reflink' Possibly an unaware installer just copies files over blindly. But then the "repart" task is to create a snapshottable layout after the fact without having to recopy everything. > On first boot, "systemd-repart" would run to add in swap or so, maybe > scaled by RAM size or so, and maybe format and encrypt /home. Or add in zram-generator configuration if the generator is available, and not worry about creating a persistent device. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] location of user-1000.journal
Hi, I'm wondering if user journals are better being located in ~/.var by default? In particular in a systemd-homed context when ~/ is encrypted. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] homed, LUKS2 passphrase encoding, and recovery key
Thanks for the answer, it's very useful. When I asked the question, I didn't fully appreciate the cryptographic and anti-forensic capabilities in LUKS that almost certainly should not be re-implemented elsewhere. I'd like to better understand what it would take to support UTF-8 passphrases for LUKS (luksFormat, luksOpen). Consistently and reliably, in a portable user home context. Of course the keyboard could change. Locale could, thus default local language of the host system could be different. That's the short version. Everything below this line is a super verbose explanation how I'm arriving at the above. I assume users want their login passphrase to use local characters. Germans should of course be allowed to create a login passphrase using characters with umlaut; Japanese should of course be allowed to create passphrases using kanji. And so on. I further assume that this same login passphrase is what should be used for `cryptsetup luksFormat/luksOpen' in order to *avoid* more indirection, and being forced to invent new crypto, which entails a lot of work and risk: security and interoperability. Many users are conditioned to accept a restriction to the 95 printable of the first 128 ASCII characters for a LUKS passphrase. That's because the typical workflow demands volume unlocking in an environment with a limited input stack (initramfs and plymouth). But I assume a global user isn't prepared for, and shouldn't have to accept, such a limitation for their login password. And in a systemd-homed context, that means the login password, if it's what's handed off to cryptsetup, are the same and also cannot be limited to ASCII. So the question comes full circle. What are all of the things that can affect the encoding between the user's passphrase as it exists in their mind, and as handed off to cryptsetup? How to store that metadata? That way it can be tested against, and provide the user with helpful hints about why authentication isn't succeeding. Right now it's a one size fits all. There's no difference in the error between wrong passphrase (this user is not authentic) and encoding failure due to keyboard change, or keymapping change, or whatever else can affect encoding. Small problem? :D -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] homed, LUKS2 passphrase encoding, and recovery key
I stumbled onto a LUKS2 keymapping story on the dm-crypt list [1] that nearly ended in user data loss. The two suggestions for how to avoid such problems is to use either ASCII or modhex based passphrases. [3] I'm curious about whether this is something homed can help deal with: users who want to use a single login+encryption passphrase in their native language, keyboard mapping, and character set (likey UTF-8). Or otherwise, enforce limits on the passphrase characters so that the user doesn't unwittingly get stuck unable to access their data or even login. The implementation details don't really concern me. But I am interested in whether there's a role for homed to play; or some other component; and maybe even time frame for a near term policy vs down the road enhancement. Maybe near term just enforce ASCII or modhex (or make it configurable)? That's very biased against non-Latin languages, which I don't like, but I'd rather see that restriction enforced than users getting dumped on an island where they may very well just give up and lose data over it. A longer term strategy is homed adds another layer of indirection where the LUKS2 passphrase is random, modhex based. And then itself encrypted, and protected by a KEK based on the user's passphrase where all the things that can affect the encoding of that passphrase are included in the identity metadata. That way if that state differs, the user can be informed or even given a hint what aspect of the system has changed compared to when the passphrase was originally set. Also, somewhat unrelated, is if homed can provide a mechanism for setting up a recovery key for LUKS2 images? I'm not fussy on whether this would or should be something a desktop first boot program (or create new user panel) would opt into, or opt out of. I think it'd be sane if homed just always created one, reporting the random passphrase by varlink to the desktop's setup/create user program, and if that program wants to drop this information - well whatever. But the DE could offer to save it to a USB stick, display it for manual recording or screenshotting, or display it as a printable+scannable code. Or perhaps a variation on this for yubikey setup is the option to setup two keys. [1] the setup https://www.saout.de/pipermail/dm-crypt/2019-December/006279.html the cause https://www.saout.de/pipermail/dm-crypt/2019-December/006283.html [2] modhex https://www.saout.de/pipermail/dm-crypt/2019-December/006285.html ASCII, although actually personally excludes upper case https://www.saout.de/pipermail/dm-crypt/2019-December/006287.html -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] perform fsck on everyt boot
On Wed, Nov 20, 2019, 11:58 PM Belisko Marek wrote: > On Thu, Nov 21, 2019 at 7:25 AM Chris Murphy > wrote: > > > > On Tue, Nov 12, 2019 at 3:52 AM Belisko Marek > wrote: > > > > > > On Mon, Nov 11, 2019 at 4:47 PM Lennart Poettering > > > wrote: > > > > > > > > On Mo, 11.11.19 13:33, Belisko Marek (marek.beli...@gmail.com) > wrote: > > > > 65;5802;1c > > > > > Hi, > > > > > > > > > > I'm using systemd 234 (build by yocto) and I've setup automount of > > > > > sdcard in fstab. This works perfectly fine. But I have seen from > time > > > > > to time when system goes to emergency mode because sdcard > filesystem > > > > > (ext4) have an issue and cannot be mounted. I was thinking about > > > > > forcing fsck for every boot. Reading manual it should be enough to > set > > > > > passno (6th column in fstab) to anything higher then 0. I set ti > to 2 > > > > > but inspecting logs it doesn't seems fsck is performed. Am I still > > > > > missing something? Thanks. > > > > > > > > Well, note that ext4's fsck only does an actual file system check > > > > every now and then. Hence: how did you determine fsck wasn't started? > > > > > > > > Do you see the relevant fsck in "systemctl -t service | grep > > > > systemd-fsck@"? > > > I just saw in log: > > > [ OK ] Found device /dev/mmcblk1p1. > > > Mounting /mnt/sdcard... > > > [8.339072] EXT4-fs (mmcblk1p1): VFS: Found ext4 filesystem with > > > invalid superblock checksum. Run e2fsck? > > > [FAILED] Failed to mount /mnt/sdcard. > > > > This isn't normal. Your effort should be on finding out why this > > problem is happening in the first place. This doesn't strike me as the > > (somewhat) ordinary case of unclean unmount, which results in journal > > replay at next mount attempt. But something considerably more serious. > Problem is it's very hard to reproduce and this is not rootfs just > external SDcard for storing some data. > If I hit this system goes to emergency mode and device is dead and I > would like to prevent that in first place. > IMO fsck should help to recover this issue and should continue without > issues. Thanks. > Possibly adding nofail option in fstab will prevent startup from going into rescue.target. I'm skeptical of unattended use of fsck. That's what journal replay is for, and if replay can't fix the problem, then the underlying problem needs to be fixed rather than papered over with fsck. You might consider testing this SDCard with f3, which will check for corruption, and fake flash. Reformat, mount, f3write /mountpoint, f3read /mountpoint. I don't trust consumer SDCards for anything. I've had name brand stuff fail. The systemd journal should show evidence of either umount success or failure for this SDCard on restart or shutdown. Do the corruptions only happen on shutdowns? It both shutdown and restart? SDCards can get really fussy, exhibiting corruptions, or just brick themselves, when power is removed while writes are still happening internally. Cheaper flash may be slower to flush to stable media. You can give it more time by manually unmounting this SDCard before reboot or shutdown. Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] perform fsck on everyt boot
On Tue, Nov 12, 2019 at 3:52 AM Belisko Marek wrote: > > On Mon, Nov 11, 2019 at 4:47 PM Lennart Poettering > wrote: > > > > On Mo, 11.11.19 13:33, Belisko Marek (marek.beli...@gmail.com) wrote: > > 65;5802;1c > > > Hi, > > > > > > I'm using systemd 234 (build by yocto) and I've setup automount of > > > sdcard in fstab. This works perfectly fine. But I have seen from time > > > to time when system goes to emergency mode because sdcard filesystem > > > (ext4) have an issue and cannot be mounted. I was thinking about > > > forcing fsck for every boot. Reading manual it should be enough to set > > > passno (6th column in fstab) to anything higher then 0. I set ti to 2 > > > but inspecting logs it doesn't seems fsck is performed. Am I still > > > missing something? Thanks. > > > > Well, note that ext4's fsck only does an actual file system check > > every now and then. Hence: how did you determine fsck wasn't started? > > > > Do you see the relevant fsck in "systemctl -t service | grep > > systemd-fsck@"? > I just saw in log: > [ OK ] Found device /dev/mmcblk1p1. > Mounting /mnt/sdcard... > [8.339072] EXT4-fs (mmcblk1p1): VFS: Found ext4 filesystem with > invalid superblock checksum. Run e2fsck? > [FAILED] Failed to mount /mnt/sdcard. This isn't normal. Your effort should be on finding out why this problem is happening in the first place. This doesn't strike me as the (somewhat) ordinary case of unclean unmount, which results in journal replay at next mount attempt. But something considerably more serious. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] RFC: luksSuspend support in sleep/sleep.c
On Fri, Nov 1, 2019 at 2:31 PM Matthew Garrett wrote: > > The initrd already contains a UI stack in order to permit disk unlock > at boot time, so this really doesn't seem like a problem? It's a very small and limited UI stack. At least the GNOME developers I've discussed it with, this environment has all kinds of a11y, i18n, and keymapping problems. To solve it means either baking a significant portion of the GNOME stack into the initramfs, or some kind of magic because the resources don't exist to create a minimized GNOME stack that could achieve this. And so far the effort has been to try and make the initramfs smaller, and more generic. I have no idea how either Apple or Microsoft solve this problem. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] RFC: luksSuspend support in sleep/sleep.c
On Thu, Oct 31, 2019 at 4:55 PM Lennart Poettering wrote: > Hmm, so far this all just worked for me, I didn't run into any trouble > with suspending just $HOME? What about /var and /home sharing the same volume? I'm pretty sure the default layout for Fedora Silverblue is a separate var volume, mounted at /var, with /var/home bind mounted to /home. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] systemd-analyze shows long firmware time with sd-boot
Hi, systemd-243-2.gitfab6f01.fc31.x86_64 systemd-analyze reports very high firmware times, seemingly related to uptime since last cold boot. That is, if I poweroff then boot, the time for firmware seems reasonable; whereas reboots appear to show cumulative time for only the firmware value. This is in a qemu-kvm VM, using edk2-ovmf-20190501stable-4.fc31.noarch reboot (warm boot) $ systemd-analyze Startup finished in 8min 2.506s (firmware) + 12ms (loader) + 1.414s (kernel) + 1.383s (initrd) + 8.962s (userspace) = 8min 14.278s graphical.target reached after 8.946s in userspace [chris@localhost-live ~]$ # systemd-analyze Startup finished in 18min 13.786s (firmware) + 11ms (loader) + 1.420s (kernel) + 1.361s (initrd) + 8.855s (userspace) = 18min 25.434s graphical.target reached after 8.840s in userspace [root@localhost-live ~]# # systemd-analyze Startup finished in 20min 22.976s (firmware) + 11ms (loader) + 1.410s (kernel) + 1.360s (initrd) + 8.801s (userspace) = 20min 34.560s graphical.target reached after 8.790s in userspace [root@localhost-live ~]# # systemd-analyze Startup finished in 51min 8.018s (firmware) + 12ms (loader) + 1.415s (kernel) + 1.370s (initrd) + 8.836s (userspace) = 51min 19.653s graphical.target reached after 8.821s in userspace [root@localhost-live ~]# poweroff+boot (cold boot) # systemd-analyze Startup finished in 2.402s (firmware) + 15ms (loader) + 1.498s (kernel) + 1.358s (initrd) + 8.723s (userspace) = 13.998s graphical.target reached after 8.709s in userspace [root@localhost-live ~]# -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] sd-boot kickstart
On Tue, Oct 1, 2019 at 1:05 AM Damian Ivanov wrote: > > Hello, > > I watched the video and presentation > https://cfp.all-systems-go.io/media/sdboot-asg2019.pdf > I could not agree more! Anaconda/Kickstart install grub as the > bootloader. Is there some hidden option to use sd-boot instead or is > it necessary to install sd-boot manually after the OS is deployed? I only see extlinux as an alternative to grub: https://pykickstart.readthedocs.io/en/latest/ I think it's a question for Anaconda developers how to support sd-boot: https://www.redhat.com/mailman/listinfo/anaconda-devel-list -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] systemd243rc2, sysd-coredump is not triggered on segfaults
Maybe it's something unique to gnome-shell segfaults, that's the only thing I have crashing right now. But I've got a pretty good reproducer to get it to crash and I never have any listings with coredumpctl. process segfaults but systemd-coredump does not capture it https://bugzilla.redhat.com/show_bug.cgi?id=1748145 -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd backlight:acpi_video0 fails, no such device
On Mon, Sep 2, 2019 at 1:56 AM Hans de Goede wrote: > > Hi, > > On 02-09-19 07:17, Mantas Mikulėnas wrote: > > On Mon, Sep 2, 2019 at 7:34 AM Chris Murphy > <mailto:li...@colorremedies.com>> wrote: > > > > systemd-243~rc2-2.fc31.x86_64 > > kernel-5.3.0-0.rc6.git1.1.fc32.x86_64 > > > > This might be a regression, at least I don't remember this happening > > before. I can use the expected keys for built-in display brightness, > > and built-in keyboard brightness. But the service unit fails with an > > out of the box installation. > > > > > > [chris@fmac ~]$ sudo systemctl status > > systemd-backlight@backlight:acpi_video0.service > > ● systemd-backlight@backlight:acpi_video0.service - Load/Save Screen > > Backlight Brightness of backlight:acpi_video0 > > Loaded: loaded (/usr/lib/systemd/system/systemd-backlight@.service; > > static; vendor preset: disabled) > > Active: failed (Result: exit-code) since Sun 2019-09-01 19:57:37 > > MDT; 8min ago > > Docs: man:systemd-backlight@.service(8) > >Process: 667 ExecStart=/usr/lib/systemd/systemd-backlight load > > backlight:acpi_video0 (code=exited, status=1/FAILURE) > > Main PID: 667 (code=exited, status=1/FAILURE) > > > > Sep 01 19:57:37 fmac.local systemd[1]: Starting Load/Save Screen > > Backlight Brightness of backlight:acpi_video0... > > Sep 01 19:57:37 fmac.local systemd-backlight[667]: Failed to get > > backlight or LED device 'backlight:acpi_video0': No such device > > Sep 01 19:57:37 fmac.local systemd[1]: > > systemd-backlight@backlight:acpi_video0.service: Main process exited, > > code=exited, status=1/FAILURE > > Sep 01 19:57:37 fmac.local systemd[1]: > > systemd-backlight@backlight:acpi_video0.service: Failed with result > > 'exit-code'. > > Sep 01 19:57:37 fmac.local systemd[1]: Failed to start Load/Save > > Screen Backlight Brightness of backlight:acpi_video0. > > [chris@fmac ~]$ > > > > # find /sys -name "*video0*" > > /sys/class/video4linux/video0 > > /sys/devices/pci:00/:00:1a.7/usb1/1-2/1-2:1.0/video4linux/video0 > > # ls -l /sys/class/backlight/ > > total 0 > > lrwxrwxrwx. 1 root root 0 Sep 1 19:57 gmux_backlight -> > > ../../devices/pnp0/00:03/backlight/gmux_backlight > > lrwxrwxrwx. 1 root root 0 Sep 1 19:57 intel_backlight -> > > > > ../../devices/pci:00/:00:02.0/drm/card0/card0-LVDS-1/intel_backlight > > > > > > Could it be that acpi_backlight is loaded at first, but gets replaced by > > intel_backlight before systemd could react? > > Maybe, the gmux_backlight suggests that this is a macbook. It is a 2011 Macbook Pro. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] systemd backlight:acpi_video0 fails, no such device
systemd-243~rc2-2.fc31.x86_64 kernel-5.3.0-0.rc6.git1.1.fc32.x86_64 This might be a regression, at least I don't remember this happening before. I can use the expected keys for built-in display brightness, and built-in keyboard brightness. But the service unit fails with an out of the box installation. [chris@fmac ~]$ sudo systemctl status systemd-backlight@backlight:acpi_video0.service ● systemd-backlight@backlight:acpi_video0.service - Load/Save Screen Backlight Brightness of backlight:acpi_video0 Loaded: loaded (/usr/lib/systemd/system/systemd-backlight@.service; static; vendor preset: disabled) Active: failed (Result: exit-code) since Sun 2019-09-01 19:57:37 MDT; 8min ago Docs: man:systemd-backlight@.service(8) Process: 667 ExecStart=/usr/lib/systemd/systemd-backlight load backlight:acpi_video0 (code=exited, status=1/FAILURE) Main PID: 667 (code=exited, status=1/FAILURE) Sep 01 19:57:37 fmac.local systemd[1]: Starting Load/Save Screen Backlight Brightness of backlight:acpi_video0... Sep 01 19:57:37 fmac.local systemd-backlight[667]: Failed to get backlight or LED device 'backlight:acpi_video0': No such device Sep 01 19:57:37 fmac.local systemd[1]: systemd-backlight@backlight:acpi_video0.service: Main process exited, code=exited, status=1/FAILURE Sep 01 19:57:37 fmac.local systemd[1]: systemd-backlight@backlight:acpi_video0.service: Failed with result 'exit-code'. Sep 01 19:57:37 fmac.local systemd[1]: Failed to start Load/Save Screen Backlight Brightness of backlight:acpi_video0. [chris@fmac ~]$ # find /sys -name "*video0*" /sys/class/video4linux/video0 /sys/devices/pci:00/:00:1a.7/usb1/1-2/1-2:1.0/video4linux/video0 # ls -l /sys/class/backlight/ total 0 lrwxrwxrwx. 1 root root 0 Sep 1 19:57 gmux_backlight -> ../../devices/pnp0/00:03/backlight/gmux_backlight lrwxrwxrwx. 1 root root 0 Sep 1 19:57 intel_backlight -> ../../devices/pci:00/:00:02.0/drm/card0/card0-LVDS-1/intel_backlight # find /sys -name "*acpi*" /sys/kernel/debug/acpi /sys/bus/platform/drivers/acpi-fan /sys/bus/platform/drivers/axp288_pmic_acpi /sys/bus/acpi /sys/bus/acpi/drivers/acpi_als /sys/firmware/acpi /sys/module/rtc_cmos/parameters/use_acpi_alarm /sys/module/acpi_als /sys/module/industrialio/holders/acpi_als /sys/module/pci_hotplug/parameters/debug_acpi /sys/module/kfifo_buf/holders/acpi_als /sys/module/acpiphp /sys/module/libata/parameters/noacpi /sys/module/libata/parameters/acpi_gtf_filter /sys/module/acpi /sys/module/acpi/parameters/acpica_version OK so maybe no expected hook to discover the brightness to save it or load it is all? *shrug* -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] shutdown on service unit timeout?
Hi, Is it possible for a systemd service file to ask for a poweroff upon service timeout? If not, could it be done; or suggest an alternative? Here's the use case: No Screensaver/Powerdown after Inactivity at LUKS Password Prompt https://bugzilla.redhat.com/show_bug.cgi?id=1742953 The summary is: plymouth waits indefinitely with a prompt for a passphrase, leads to excessive power consumption including battery if it's a laptop (it'll wait until the battery dies), and screen burn in. This can happen unattended if e.g. Fedora is the default boot, but the user dual boots Windows which has a tendency to wake up, do updates at "offline" times, and reboots... to Fedora where it waits indefinitely for a LUKS passphrase. I'm sure there are other examples. Plausibly anything that hangs during startup would have this behavior; only once we're at gdm (or equivalent on other DE's) is there a timer that will at least blank the screen, and possibly also optionally trigger suspend to RAM. Or alternative to a service unit opt in method, a way for systemd itself to opt into a "power off after X minutes unless Y process reports it's started successfully" type of behavior. In any case, it's up to the distro to decide the policy, with a way for the user to opt out of that by setting the applicable timeout value to something like 0, to indicate they really want an indefinite wait. Thanks, -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] startup hang at 'load/save random seed'
[ 10.281769] fmac.local systemd[1]: Starting Update UTMP about System Boot/Shutdown... [ 10.295504] fmac.local audit[806]: SYSTEM_BOOT pid=806 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg=' comm="systemd-update-utmp" exe="/usr/lib/systemd/systemd-update-utmp" hostname=? addr=? terminal=? res=success' [ 10.305289] fmac.local systemd[1]: Started Update UTMP about System Boot/Shutdown. [ 10.305527] fmac.local audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-update-utmp comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 15.264423] fmac.local systemd[1]: systemd-rfkill.service: Succeeded. [ 15.268231] fmac.local audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-rfkill comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 286.296649] fmac.local kernel: random: crng init done [ 286.301223] fmac.local kernel: random: 7 urandom warning(s) missed due to ratelimiting [ 286.319857] fmac.local systemd[1]: Started Load/Save Random Seed. [ 286.322850] fmac.local audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-random-seed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 286.323576] fmac.local systemd[1]: Reached target System Initialization. I don't know why there's ratelimiting on urandom warnings, I have printk.devkmsg=on This also seems relevant. [chris@fmac ~]$ sudo journalctl -b -o short-monotonic | grep -i seed [8.870985] fmac.local systemd[1]: Starting Load/Save Random Seed... [9.021818] fmac.local systemd-random-seed[619]: Kernel entropy pool is not initialized yet, waiting until it is. [ 286.319857] fmac.local systemd[1]: Started Load/Save Random Seed. [ 286.322850] fmac.local audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-random-seed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [chris@fmac ~]$ --- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] startup hang at 'load/save random seed'
This is a new problem I'm seeing just today on Fedora Rawhide 5.3.0-0.rc3.git0.1.fc31.x86_64+debug systemd-243~rc1-1.fc31.x86_64 The problem doesn't happen when reverting to systemd-242-6.git9d34e79.fc31.x86_64 The hang lasts about 4-5 minutes, then boot proceeds. Or if I head to early-debug shell and start typing, it almost immediately clears up and boot proceeds. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd.journald.forward_to doesn't forward all journal messages
On Thu, Aug 1, 2019 at 12:43 AM Stefan Tatschner wrote: > > On Wed, 2019-07-31 at 16:27 -0600, Chris Murphy wrote: > > The Obi Wan quote seems to apply here. "Who's the more foolish, the > > fool, or the fool who follows him?" > > > > You're invited to stop following me. > > Calm down folks. I am certain he wanted to be sure that's not an xy > problem: http://xyproblem.info/ > > Sometimes it makes sense to ask these questions in order to potentially > save everyone's time. Maybe without the “ass” word, but anyway… Could be. X = systemd-journald captured and recorded messages in system.journal; for any number of reasons this might not be available if early boot/startup crashes. Y = what I'd accept as a fallback to 'journalctl' is the complete "pretty" version of what systemd records by either forwarding the journal to kmsg or to console The problem I describe is, the contents of forwarding don't match up with what's in the journal, and it seems like a bug or missing functionality that so much (roughly 1/3 of messages) just don't get forwarded. For example, set rd.debug, and then boot. Only a few early dracut debug messages are forwarded to console or kmsg, the overwhelming bulk of rd.debug messages are simply dropped. But they're in the journal. So if I don't have the journal, and they're not forwarded, it's a problem to debug an early startup failure. And I freely admit that it's rare indeed that both the journal is unavailable, and there are no useful hints at all forwarded to console or kmsg. But I'm not even sure it's expected most all of the journal messages should be forwarded with the forward parameter. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd.journald.forward_to doesn't forward all journal messages
On Wed, Jul 31, 2019 at 12:30 PM Greg Oliver wrote: > I do not mean to sound like an ass here - especially since you have spent > hours of time ripping this stuff apart on Fedora live media (I actually > thought initially since you were digging so deep, you worked there), but I > mean really - what is the point? That you do not understand the point or lack the imagination for a point doesn't mean there isn't a point. > Are you planning on using a Fedora Live Media image as production? If so, > you clearly have not encountered all the stuff that usually does not work on > live media and should create your own anyhow. The forwarding problem is reproducible on installed systems. > I apologize, but I just do not get the effort spent here by anyone. The Obi Wan quote seems to apply here. "Who's the more foolish, the fool, or the fool who follows him?" You're invited to stop following me. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd.journald.forward_to doesn't forward all journal messages
On Mon, Jul 29, 2019 at 1:26 AM Lennart Poettering wrote: > > On So, 28.07.19 22:11, Chris Murphy (li...@colorremedies.com) wrote: > > > Using either of the following: > > > > systemd.log_level=debug systemd.journald.forward_to_kmsg log_buf_len=8M > > > > systemd.log_level=debug systemd.log_target=kmsg log_buf_len=8M > > Note that this is not sufficient. You also have to pass > "printk.devkmsg=on" too, otherwise the kernel ratelimits log output > from usperspace ridiculously a lot, and you will see lots of dropped > messages. > > I have documented this now here: > > https://github.com/systemd/systemd/pull/13208 BOOT_IMAGE=/images/pxeboot/vmlinuz root=live:CDLABEL=Fedora-WS-Live-rawh-20190728-n-1 rd.live.image systemd.wants=zram-swap.service systemd.log_level=debug systemd.journald.forward_to_kmsg log_buf_len=8M printk.devkmsg=on Many messages I see in the journal still do not appear in kmsg. For example from /dev/kmsg 6,20619,201107529,-;zram: Cannot change disksize for initialized device 12,23154,208596765,-;org.fedoraproject.Anaconda.Modules.Network[2498]: DEBUG:anaconda.modules.network.network:Applying boot options KernelArguments([('BOOT_IMAGE', '/images/pxeboot/vmlinuz'), ('root', 'live:CDLABEL=Fedora-WS-Live-rawh-20190728-n-1'), ('rd.live.image', None), ('systemd.wants', 'zram-swap.service'), ('systemd.log_level', 'debug'), ('systemd.journald.forward_to_kmsg', None), ('log_buf_len', '8M'), ('printk.devkmsg', 'on')]) 12,25049,210822858,-;org.fedoraproject.Anaconda.Modules.Storage[2498]: DEBUG:anaconda.modules.storage.disk_selection.selection:Protected devices are set to '['/dev/zram0']'. ^C [root@localhost-live liveuser]# journalctl -o short-monotonic | grep zram [ 203.224915] localhost-live systemd[1477]: Added job dev-zram0.device/nop to transaction. [ 203.225017] localhost-live systemd[1477]: dev-zram0.device: Installed new job dev-zram0.device/nop as 295 [ 203.225143] localhost-live systemd[1477]: Added job sys-devices-virtual-block-zram0.device/nop to transaction. [ 203.225245] localhost-live systemd[1477]: sys-devices-virtual-block-zram0.device: Installed new job sys-devices-virtual-block-zram0.device/nop as 296 [ 203.225355] localhost-live systemd[1477]: sys-devices-virtual-block-zram0.device: Job 296 sys-devices-virtual-block-zram0.device/nop finished, result=done [ 203.225570] localhost-live systemd[1477]: dev-zram0.device: Job 295 dev-zram0.device/nop finished, result=done [ 208.959944] localhost-live systemd[1477]: Added job dev-zram0.device/nop to transaction. [ 208.961015] localhost-live systemd[1477]: dev-zram0.device: Installed new job dev-zram0.device/nop as 340 [ 208.961324] localhost-live systemd[1477]: Added job sys-devices-virtual-block-zram0.device/nop to transaction. [ 208.961508] localhost-live systemd[1477]: sys-devices-virtual-block-zram0.device: Installed new job sys-devices-virtual-block-zram0.device/nop as 341 [ 208.961789] localhost-live systemd[1477]: sys-devices-virtual-block-zram0.device: Job 341 sys-devices-virtual-block-zram0.device/nop finished, result=done [ 208.962021] localhost-live systemd[1477]: dev-zram0.device: Job 340 dev-zram0.device/nop finished, result=done [ 209.822448] localhost-live systemd[1477]: Added job dev-zram0.device/nop to transaction. [ 209.822625] localhost-live systemd[1477]: dev-zram0.device: Installed new job dev-zram0.device/nop as 377 [ 209.822757] localhost-live systemd[1477]: Added job sys-devices-virtual-block-zram0.device/nop to transaction. [ 209.822861] localhost-live systemd[1477]: sys-devices-virtual-block-zram0.device: Installed new job sys-devices-virtual-block-zram0.device/nop as 378 [ 209.822983] localhost-live systemd[1477]: sys-devices-virtual-block-zram0.device: Job 378 sys-devices-virtual-block-zram0.device/nop finished, result=done [ 209.823106] localhost-live systemd[1477]: dev-zram0.device: Job 377 dev-zram0.device/nop finished, result=done [ 213.866820] localhost-live anaconda[2490]: blivet: DeviceTree.get_device_by_path: path: /dev/zram0 ; incomplete: False ; hidden: False ; [ 213.868392] localhost-live anaconda[2490]: blivet: failed to resolve '/dev/zram0' [root@localhost-live liveuser]# Literally zero of those lines appear in kmsg 6,20619,201107529,-;zram: Cannot change disksize for initialized device 12,23154,208596765,-;org.fedoraproject.Anaconda.Modules.Network[2498]: DEBUG:anaconda.modules.network.network:Applying boot options KernelArguments([('BOOT_IMAGE', '/images/pxeboot/vmlinuz'), ('root', 'live:CDLABEL=Fedora-WS-Live-rawh-20190728-n-1'), ('rd.live.image', None), ('systemd.wants', 'zram-swap.service'), ('systemd.log_level', 'debug'), ('systemd.journald.forward_to_kmsg', None), ('log_buf_len', '8M'), ('printk.devkmsg', 'on')]) 12,25049,210822858,-;org.fedoraproject.Anaconda.Modules.Storage[2498]: DEBUG:anaconda.modules.storage.disk_selection.selection:Protected devices are set to '['/dev/zram0']'. The fir
[systemd-devel] systemd.journald.forward_to doesn't forward all journal messages
Using either of the following: systemd.log_level=debug systemd.journald.forward_to_kmsg log_buf_len=8M systemd.log_level=debug systemd.log_target=kmsg log_buf_len=8M There's quite a lot of messages in the journal, but not in kmsg. As in, so many missing messages that the feature is nearly useless for debugging. Is it expected or should I file a bug? In fact when I use systemd.log_level=kmsg there are messages missing out of both the journal and kmsg; when I do not use that parameter, the expected messages are in the journal. So it's like something is trying to forward a subset of messages to kmsg but then they get dropped? I don't know how to debug this... -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journald deleting logs on LiveOS boots
On Sun, Jul 21, 2019 at 11:48 PM Ulrich Windl wrote: > > >>> Chris Murphy schrieb am 18.07.2019 um 17:55 in > Nachricht > : > > On Thu, Jul 18, 2019 at 4:50 AM Uoti Urpala wrote: > >> > >> On Mon, 2019-07-15 at 14:32 -0600, Chris Murphy wrote: > >> > So far nothing I've tried gets me access to information that would > >> > give a hint why systemd-journald thinks there's no free space and yet > >> > it still decides to create a single 8MB system journal, which then > >> > almost immediately gets deleted, including all the evidence up to that > >> > point. > >> > >> Run journald under strace and check the results of the system calls > >> used to query space? (One way to run it under strace would be to change > >> the unit file to use "strace -D -o /run/output systemd-journald" as the > >> process to start.) > > > > It's a good idea but strace isn't available on Fedora live media. So I > > either have to learn how to create a custom live media locally (it's a > > really complicated process) or convince Fedora to add strace to live > > media... > > Wouldn't it be easer to scp the binary from a compatible system? What binary? The problem with the strace idea is that /run/output is 0 length. There's nothing to scp. For whatever reason strace creates /run/output but isn't writing anything to it. But based on this PR it looks like some aspect of this problem is understood by systemd developers. https://github.com/systemd/systemd/pull/13120 -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journald deleting logs on LiveOS boots
On Fri, Jul 19, 2019 at 8:45 AM Uoti Urpala wrote: > > On Thu, 2019-07-18 at 21:52 -0600, Chris Murphy wrote: > > # df -h > > ... > > /dev/mapper/live-rw 6.4G 5.7G 648M 91% / > > > > And in the log: > > 47,19636,16754831,-;systemd-journald[905]: Fixed min_use=1.0M > > max_use=648.7M max_size=81.0M min_size=512.0K keep_free=973.1M > > n_max_files=100 > > > > Why is keep_free bigger than available free space? Is that the cause > > of the vacuuming? > > The default value for keep_free is the smaller of 4 GiB or 15% of total > filesystem size. Since the filesystem is small and has less than 15% > free, it's already over the default limit. Those defaults are defined > in src/journal/journal_file.c. When over the limit, journald still uses > at least DEFAULT_MIN_USE (increased to the initial size of journals on > disk if any). But it looks suspicious that this is 1 MiB while > FILE_SIZE_INCREASE is 8 MiB - doesn't this imply that any use at all > immediately goes over 1 MiB? > > You can probably work around the issue by setting a smaller > SystemKeepFree in journald.conf. I don't really need a work around, this is live media and that file will be read only. What I need is to understand exactly why the journal vacuuming is being triggered. Either that's simply not being recorded in the journal in the first place, and must be inferred by esoteric knowledge, or the reason is in the journal that has been vacuum, and is thus lost. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journald deleting logs on LiveOS boots
This is suspicious: # df -h ... /dev/mapper/live-rw 6.4G 5.7G 648M 91% / And in the log: 47,19636,16754831,-;systemd-journald[905]: Fixed min_use=1.0M max_use=648.7M max_size=81.0M min_size=512.0K keep_free=973.1M n_max_files=100 Why is keep_free bigger than available free space? Is that the cause of the vacuuming? 47,19867,16908013,-;systemd-journald[905]: /var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee/system.journal: Allocation limit reached, rotating. 47,19868,16908029,-;systemd-journald[905]: Rotating... And then 47,27860,22417049,-;systemd-journald[905]: Vacuuming... 47,27861,22427712,-;systemd-journald[905]: Deleted archived journal /var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee/system@daa6e38474b84afc8404527c7b204c24-0001-00058dff129d34ae.journal (8.0M). 47,27862,22427724,-;systemd-journald[905]: Vacuuming done, freed 8.0M of archived journals from /var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee. That vacuuming event is the direct cause of the data loss. But does it happen because keep_free is greater than free space, and if so then why is keep_free greater than free space? Slightly off topic but why are there over 18000 (that's not a typo) of these identical lines? It seems excessive. 47,27869,22428300,-;systemd-journald[905]: Journal effective settings seal=no compress=yes compress_threshold_bytes=512B I've updated the bug report to include an attachment of the journal I've captured by forwarding to kmsg. https://bugzilla.redhat.com/show_bug.cgi?id=1715699#c17 Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journald deleting logs on LiveOS boots
On Thu, Jul 18, 2019 at 4:36 PM Greg Oliver wrote: > I am assuming you have ripped apart the initramfs to see exactly how RedHat > is invoking systemd in the live images? I have not. No idea what I'd be looking for. # ps aux | grep systemd Installed system (Fedora 30): root 1 0.0 0.1 171444 14696 ?Ss 12:05 0:09 /usr/lib/systemd/systemd --switched-root --system --deserialize 18 Live system (Rawhide): root 1 0.8 0.4 171516 14708 ?Ss 19:57 0:07 /usr/lib/systemd/systemd --switched-root --system --deserialize 32 -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journald deleting logs on LiveOS boots
On Thu, Jul 18, 2019 at 4:02 PM Chris Murphy wrote: > > On Thu, Jul 18, 2019 at 10:18 AM Dave Howorth wrote: > > > > On Thu, 18 Jul 2019 09:55:51 -0600 > > Chris Murphy wrote: > > > On Thu, Jul 18, 2019 at 4:50 AM Uoti Urpala > > > wrote: > > > > > > > > On Mon, 2019-07-15 at 14:32 -0600, Chris Murphy wrote: > > > > > So far nothing I've tried gets me access to information that would > > > > > give a hint why systemd-journald thinks there's no free space and > > > > > yet it still decides to create a single 8MB system journal, which > > > > > then almost immediately gets deleted, including all the evidence > > > > > up to that point. > > > > > > > > Run journald under strace and check the results of the system calls > > > > used to query space? (One way to run it under strace would be to > > > > change the unit file to use "strace -D -o /run/output > > > > systemd-journald" as the process to start.) > > > > > > It's a good idea but strace isn't available on Fedora live media. So I > > > either have to learn how to create a custom live media locally (it's a > > > really complicated process) or convince Fedora to add strace to live > > > media... > > > > I'm not a fedora user, but I don't think it's that difficult to run > > strace. > > > > To run it once, start your live image and type: > > > > # yum install strace > > > > You will need to reinstall it if you reboot. > > > > To permanently install it apparently you need to configure your USB > > with persistent storage. I haven't looked up how to do that. > > I thought about that, but this is a substantial alteration from the > original ISO in terms of the storage layout and how everything gets > assembled. But it's worth a shot. If it is a systemd bug, then it > should still reproduce. If it doesn't reproduce, then chances are it's > some kind of assembly related problem. > > Still seems like a systemd-journald bug that neither forward to > console nor to kmsg includes any useful systemd or dracut debugging. I used livecd-iso-to-disk to create the persistent boot media, the problem does reproduce. But I needed a custom initramfs that has the modified systemd-journald.service unit file in it, as well as strace, because this problem happens that early in the startup process. Thing is, that custom initramfs blows up trying to assemble the persistent overlay. Looking at the compose logs for this image, I see a rather complex dracut command is needed: dracut --nomdadmconf --nolvmconf --xz --add livenet dmsquash-live convertfs pollcdrom qemu qemu-net --omit plymouth --no-hostonly --debug --no-early-microcode --force boot/initramfs-5.3.0-0.rc0.git4.1.fc31.x86_64.img 5.3.0-0.rc0.git4.1.fc31.x86_64 Fine. Now I can boot. Next gotcha is that /run/output is a 0 byte length file. I don't know why. I've booted with enforcing=0 in case selinux doesn't like strace writing to /run/ but that hasn't made a difference. The change I made in /usr/lib/systemd/system/systemd-journald.service is ExecStart=/usr/bin/strace -D -o /run/output /usr/lib/systemd/systemd-journald I've also tried booting with the original initrd which doesn't have the systemd-journald modification, relying only on the one on persistent storage. I still get /run/output as a 0 length file. I've tried booting with enforcing=0 in case selinux is denying the write, but that doesn't solve the problem. So I'm still stuck. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journald deleting logs on LiveOS boots
On Thu, Jul 18, 2019 at 10:18 AM Dave Howorth wrote: > > On Thu, 18 Jul 2019 09:55:51 -0600 > Chris Murphy wrote: > > On Thu, Jul 18, 2019 at 4:50 AM Uoti Urpala > > wrote: > > > > > > On Mon, 2019-07-15 at 14:32 -0600, Chris Murphy wrote: > > > > So far nothing I've tried gets me access to information that would > > > > give a hint why systemd-journald thinks there's no free space and > > > > yet it still decides to create a single 8MB system journal, which > > > > then almost immediately gets deleted, including all the evidence > > > > up to that point. > > > > > > Run journald under strace and check the results of the system calls > > > used to query space? (One way to run it under strace would be to > > > change the unit file to use "strace -D -o /run/output > > > systemd-journald" as the process to start.) > > > > It's a good idea but strace isn't available on Fedora live media. So I > > either have to learn how to create a custom live media locally (it's a > > really complicated process) or convince Fedora to add strace to live > > media... > > I'm not a fedora user, but I don't think it's that difficult to run > strace. > > To run it once, start your live image and type: > > # yum install strace > > You will need to reinstall it if you reboot. > > To permanently install it apparently you need to configure your USB > with persistent storage. I haven't looked up how to do that. I thought about that, but this is a substantial alteration from the original ISO in terms of the storage layout and how everything gets assembled. But it's worth a shot. If it is a systemd bug, then it should still reproduce. If it doesn't reproduce, then chances are it's some kind of assembly related problem. Still seems like a systemd-journald bug that neither forward to console nor to kmsg includes any useful systemd or dracut debugging. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journald deleting logs on LiveOS boots
On Thu, Jul 18, 2019 at 4:50 AM Uoti Urpala wrote: > > On Mon, 2019-07-15 at 14:32 -0600, Chris Murphy wrote: > > So far nothing I've tried gets me access to information that would > > give a hint why systemd-journald thinks there's no free space and yet > > it still decides to create a single 8MB system journal, which then > > almost immediately gets deleted, including all the evidence up to that > > point. > > Run journald under strace and check the results of the system calls > used to query space? (One way to run it under strace would be to change > the unit file to use "strace -D -o /run/output systemd-journald" as the > process to start.) It's a good idea but strace isn't available on Fedora live media. So I either have to learn how to create a custom live media locally (it's a really complicated process) or convince Fedora to add strace to live media... -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journald deleting logs on LiveOS boots
This problem of missing early boot messages is now happening on a default boot of Fedora Live images. I don't have to change any boot parameters to trigger it. This image: https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20190715.n.1/compose/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-Rawhide-20190715.n.1.iso Boot it in a VM, launch Terminal and: [liveuser@localhost-live ~]$ sudo journalctl -o short-monotonic [sudo] password for liveuser: -- Logs begin at Tue 2019-07-16 11:58:09 EDT, end at Tue 2019-07-16 15:58:09 EDT. -- [ 24.017711] localhost-live systemd[1472]: Startup finished in 293ms. [ 24.447054] localhost-live dbus-broker-launch[1496]: Service file '/usr/share/dbus-1/servi... And neither a forwarded to console, nor directed to kmsg, do I see /run/log or /var/log anywhere, so quite a lot of information is not being forwarded and is just lost because of this bug. Further, I discovered an note in an earlier bug I filed in February 2019 about this problem: "OK I'm not seeing this with systemd-241~rc2-2.fc30.x86_64 in the 20190211 Live media" so that leads me to believe this is a systemd regression, but I can't prove it because of this bug. I have no systemd debug messages available to look at... Fedora Basic Release Criteria "A system logging infrastructure must be available, enabled by default, and working." I really think logging is not working, and this should block. But without a really specific cause + solution it's questionable to block. --- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journald deleting logs on LiveOS boots
On Mon, Jul 15, 2019 at 2:32 PM Chris Murphy wrote: > > This is still a problem with systemd-242-5.git7a6d834.fc31.x86_64 > > If I boot using 'systemd.journald.forward_to_console=1 > console=ttyS0,38400 console=tty1 systemd.log_level=debug rd.debug > rd.udev.debug' > > There is no debug output forwarded to console, only kernel messages > and normal systemd logging is forwarded. Going to a working/installed system so I don't have to deal with this bug; I'm still finding all kinds of messages that appear on virt-manager's console, despite the forward to console boot param, that do not get forwarded to ttyS0 when I'm connected using 'virsh console $(vmname)' So?? Is that a bug? It seems like a bug. Example from "virsh console" output: [ 25.009292] systemd[1]: Started NTP client/server. [ 40.797229] input: spice vdagent tablet as /devices/virtual/input/input5 But on the virt-manager console, there's a pile of messages from 26 seconds through 40 seconds. Why are they missing? -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journald deleting logs on LiveOS boots
On Mon, Jul 15, 2019 at 2:32 PM Chris Murphy wrote: > > If I boot using 'systemd.log_level=debug rd.debug rd.udev.debug > systemd.log_target=kmsg log_buf_len=64M printk.devkmsg=on' Another data point. Is kmsg dumped into a file is 5MiB, after the point in time when journald had done vacuuming on /var/log/journal which was already 8+ MiB in size. So at least 3MiB of journal messages were not being sent to kmsg. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journald deleting logs on LiveOS boots
This is still a problem with systemd-242-5.git7a6d834.fc31.x86_64 If I boot using 'systemd.journald.forward_to_console=1 console=ttyS0,38400 console=tty1 systemd.log_level=debug rd.debug rd.udev.debug' There is no debug output forwarded to console, only kernel messages and normal systemd logging is forwarded. And of course this bug means that those debug messages are lost once the vaccuuming happens. If I boot using 'systemd.log_level=debug rd.debug rd.udev.debug systemd.log_target=kmsg log_buf_len=64M printk.devkmsg=on' For sure a bunch of dracut messages are not being forwarded to kmsg, none of the rd.live.image debug stuff is listed, so I can't even see how things are being assembled for live boots, with time stamps, to see if that stuff might not be ready at the time journald switches from using /run/log to /var/log. I can't even see in kmsg the journald switch from /run/log to /var/log. That itself seems like a bug, given systemd.log_target=kmsg, I'd like to think that should cause an exact copy to dump to kmsg, of what's going to system.journal but apparently that's not the case. So far nothing I've tried gets me access to information that would give a hint why systemd-journald thinks there's no free space and yet it still decides to create a single 8MB system journal, which then almost immediately gets deleted, including all the evidence up to that point. For sure sysroot and / are available rw by these points: <31>[ 10.898648] systemd[1]: sysroot.mount: About to execute: /usr/bin/mount /dev/mapper/live-rw /sysroot ... <31>[ 12.061370] systemctl[879]: Switching root - root: /sysroot; init: n/a This is the loss of the journal up to this point: <47>[ 24.318297] systemd-journald[905]: /var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee/system.journal: Allocation limit reached, rotating. <47>[ 24.318315] systemd-journald[905]: Rotating... <47>[ 24.332853] systemd-journald[905]: Reserving 147626 entries in hash table. <47>[ 24.367396] systemd-journald[905]: Vacuuming... <47>[ 24.389952] systemd-journald[905]: Deleted archived journal /var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee/system@2f2d06548b5f4c259693b56558cc89c6-0001-00058dbdb33d1f5e.journal (8.0M). <47>[ 24.389965] systemd-journald[905]: Vacuuming done, freed 8.0M of archived journals from /var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee. <47>[ 24.390015] systemd-journald[905]: Journal effective settings seal=no compress=yes compress_threshold_bytes=512B <47>[ 24.390126] systemd-journald[905]: Retrying write. Retrying what write and why does it need to retry? What failed? -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] swap on zram service unit, using Conflicts=umount
OK so I'm seeing with systemd.log_level debug at shutdown that there is a swapoff [ 406.997210] fmac.local systemd[1]: dev-zram0.swap: About to execute: /sbin/swapoff /dev/zram0 [ 406.997844] fmac.local systemd[1]: dev-zram0.swap: Forked /sbin/swapoff as 1966 [ 406.998308] fmac.local systemd[1]: dev-zram0.swap: Changed active -> deactivating [ 406.998332] fmac.local systemd[1]: Deactivating swap /dev/zram0... [ 406.999489] fmac.local systemd[1966]: dev-zram0.swap: Executing: /sbin/swapoff /dev/zram0 This happens after /home and /boot/efi are umounted, but before /boot and / are umounted. This is not unexpected, but I like the idea of no swapoff at reboot if that's possible. Any system under sufficient memory and swap pressure could very well have at least delayed shutdown if systemd shutdown/reboot actually waits for swapoff to exit successfully. Is there any interest in a systemd included swap on zram service to obviate the need for others basically reinventing this wheel? I kinda put it in the category of fstrim.service and the existing systemd cryptswap logic. I expect it wouldn't be enabled by default, but distros can enable or start it as their use cases dictate. Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] swap on zram service unit, using Conflicts=umount
On Tue, Jun 25, 2019 at 3:30 AM Zbigniew Jędrzejewski-Szmek wrote: > > On Tue, Jun 25, 2019 at 10:55:27AM +0200, Lennart Poettering wrote: > > On Mo, 24.06.19 13:16, Zbigniew Jędrzejewski-Szmek (zbys...@in.waw.pl) > > wrote: > > > > > > So for tmpfs mounts that don't turn off DefaultDependencies= we > > > > implicit add in an After=swap.target ordering dep. The thinking was > > > > that there's no point in swapping in all data of a tmpfs because we > > > > want to detach the swap device when we are going to flush it all out > > > > right after anyway. This made quite a difference to some folks. > > > > > > But we add Conflicts=umount.target, Before=umount.target, so we do > > > swapoff on all swap devices, which means that swap in the data after all. > > > Maybe that's an error, and we should remove this, at least for > > > normal swap partitions (not files)? > > > > We never know what kind of weird storage swap might be on, I'd > > probably leave that in, as it's really hard to figure out correctly > > when leaving swap on would be safe and when not. > > > > Or to say this differently: if people want to micro-optimize that, > > they by all means should, but in that case they should probably drop > > in their manually crafted .swap unit with DefaultDependencies=no and > > all the ordering in place they need, and nothing else. i.e. I believe > > this kind of optimization is nothing we need to cater for in the > > generic case when swap is configured with /etc/fstab or through GPT > > enumeration. > > Not swapping off would make a nice optimization. Maybe we should > invert this, and "drive" this from the other side: if we get a stop > job for the storage device, then do the swapoff. Then if there are > devices which don't need to stop, we wouldn't swapoff. This would cover > the common case of swap on partition. > > I haven't really thought about the details, but in principle this > should already work, if all the dependencies are declared correctly. I like the sound of this. The gotcha with current swap on zram units (there are a few floating out there including this one), is they conflate two different things: setup and teardown of the zram device, and swapon/swapoff. What's probably better and more maintainable would be a way to setup the zram device with a service unit, and then specify it in /etc/fstab as a swap device so that the usual systemd swapon/off behavior is used. Any opinion here? (Somewhat off topic: I wish zswap was not still experimental: by enabling zswap with an ordinary swap device, it creates a memory pool (which you can define a percentage of total RAM to use) that's compressed for swap before it hits the backing device. Basically, it's like a RAM cache for swap. It'll swap to memory first, and then overflow to a swap partition or file. It also lacks all the weird interfaces of zram.) https://wiki.archlinux.org/index.php/Improving_performance#Zram_or_zswap > > zswap is different: we know exactly that the swap data is located in > > RAM, not on complex storage, hence it's entirely safe to not > > disassemble it at all, iiuc. > > Agreed. It seems that any Conflicts= (including the one I proposed) are > unnecessary/harmful. OK I'll negate the commit that inserts it. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] swap on zram service unit, using Conflicts=umount
On Mon, Jun 24, 2019 at 6:11 AM Lennart Poettering wrote: > That said, I don't really grok zram, and not sure why there's any need > to detach it at all. I mean, if at shutdown we lose compressed RAM > or lose uncompressed RAM shouldn't really matter. Hence from my > perspective there's no need for Conflicts= at all, but maybe I am > missing something? Huh yeah, possibly if anything there could be low memory systems just barely getting by with swap on zram, and even swapoff at reboot time would cause it to get stuck. It might just be better to clobber it at reboot time? I'd like to allow a user to 'systemctl stop zram' which does swapoff and removes the zram device. But is there something that could go into the unit file that says "don't wait for swapoff, if everything else is ready for shutdown, go ahead and reboot now?" -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] swap on zram service unit, using Conflicts=umount
Hi, I've got a commit to add 'Conflicts=umount.target' to this zram service based on a bug comment I cited in the comment. But I'm not certain I understand if it's a good idea or necessary. https://src.fedoraproject.org/fork/chrismurphy/rpms/zram/c/63900c455e8a53827aed697b9f602709b7897eb2?branch=devel I figure it's plausible at shutdown time that something is swapped out, and a umount before swapoff could hang (briefly or indefinitely I don't know), and therefore it's probably better to cause swapoff to happen before umount. ? Thanks, -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journald deleting logs on LiveOS boots
On Wed, Jun 19, 2019 at 6:15 AM Lennart Poettering wrote: > > On Di, 18.06.19 20:34, Chris Murphy (li...@colorremedies.com) wrote: > > > When I boot Fedora 30/Rawhide Workstation (LiveOS install media) in a > > VM with ~2GiB memory, using any combination of systemd, dracut, or > > udev debugging, events are missing from the journal. > > > > systemd-241-7.gita2eaa1c.fc30.x86_64 > > systemd-242-3.git7a6d834.fc31.x86_64 > > > > 'journalctl -b -o short-monotonic' starts at ~22s monotonic time. i.e. > > all the messages before that are not in the journal. This problem > > doesn't happen with netinstalls. Fedora LiveOS boots setup a memory > > based persistent overlay, and /var/log/journal exists and > > what do you mean by "memory based persistent overlay"? if its in > memory it's not persistant, is it? /me is confused... I agree it is an oxymoron. The device-mapper rw ext4 volume where all writes go is memory based, as the install media and the base image are read-only. Since this volume is rw, and it's mounted at /, that makes /var/log/journal rw, and appears to be persistent from the systemd-journald point of view, so it uses it instead of /run/log/journal [9.127691] localhost systemd[1]: Starting Flush Journal to Persistent Storage... > > is this LVM stuff or overlayfs? Neither, only device mapper. [root@localhost-live liveuser]# lvs [root@localhost-live liveuser]# vgs [root@localhost-live liveuser]# pvs [root@localhost-live liveuser]# dmsetup status live-base: 0 13635584 linear live-rw: 0 13635584 snapshot 163328/67108864 648 [root@localhost-live liveuser]# All of this is setup by dracut with the boot parameter 'rd.live.image' > > > systemd-journald tries to flush there, whereas on netinstalls > > /var/log/journal does not exist. > > > > Using systemd.log_target=kmsg I discovered that systemd-journald is > > deleting logs in the LiveOS case, but I don't know why, / has ~750M > > free > > > > [ 24.910792] systemd-journald[922]: Vacuuming... > > [ 24.921802] systemd-journald[922]: Deleted archived journal > > /var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee/system@818f3f40f19849e08a1b37b9c1e304f1-0001-00058ba31bec725e.journal > > (8.0M). > > [ 24.921808] systemd-journald[922]: Vacuuming done, freed 8.0M of > > archived journals from > > /var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee. > > > > I filed a bug here and have nominated it as a blocker bug for the > > F31 cycle. > > Note that journald logs early on how much disk space it assumes to > have free, you might see that in dmesg? (if not, boot with > systemd.journald.forward_to_kmsg=1 on the kernel cmdline) Using systemd.log_level=debug and systemd.journald.forward_to_kmsg=1 I only see: [4.224599] systemd-journald[327]: Fixed min_use=1.0M max_use=145.8M max_size=18.2M min_size=512.0K keep_free=218.7M n_max_files=100 If I boot with defaults (no debug options), I now see the following, which is otherwise lost (not in the journal and not in dmesg, despite systemd.journald.forward_to_kmsg=1) [9.124045] localhost systemd-journald[933]: Runtime journal (/run/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee) is 8.0M, max 146.6M, 138.6M free. ...snip... [9.179050] localhost systemd-journald[933]: Time spent on flushing to /var is 17.824ms for 936 entries. [9.179050] localhost systemd-journald[933]: System journal (/var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee) is 8.0M, max 8.0M, 0B free. Why does it say 0B free? [root@localhost-live liveuser]# df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 1.5G 0 1.5G 0% /dev tmpfs1.5G 4.0K 1.5G 1% /dev/shm tmpfs1.5G 1.1M 1.5G 1% /run tmpfs1.5G 0 1.5G 0% /sys/fs/cgroup /dev/sr0 1.9G 1.9G 0 100% /run/initramfs/live /dev/mapper/live-rw 6.4G 5.5G 879M 87% / tmpfs1.5G 4.0K 1.5G 1% /tmp vartmp 1.5G 0 1.5G 0% /var/tmp tmpfs294M 0 294M 0% /run/user/0 tmpfs294M 4.0K 294M 1% /run/user/1000 [root@localhost-live liveuser]# free -m totalusedfree shared buff/cache available Mem: 2932 2231754 1 9542435 Swap: 0 0 0 [root@localhost-live liveuser]# cat /etc/systemd/journald.conf # This file is part of systemd. # # systemd is free software; you can redistribute it and/or modify it # under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation; either version 2.1 of the License, or # (at your option) any later version. # # Entries in this file show the compile time defaults. # You can change settings by ed
[systemd-devel] journald deleting logs on LiveOS boots
When I boot Fedora 30/Rawhide Workstation (LiveOS install media) in a VM with ~2GiB memory, using any combination of systemd, dracut, or udev debugging, events are missing from the journal. systemd-241-7.gita2eaa1c.fc30.x86_64 systemd-242-3.git7a6d834.fc31.x86_64 'journalctl -b -o short-monotonic' starts at ~22s monotonic time. i.e. all the messages before that are not in the journal. This problem doesn't happen with netinstalls. Fedora LiveOS boots setup a memory based persistent overlay, and /var/log/journal exists and systemd-journald tries to flush there, whereas on netinstalls /var/log/journal does not exist. Using systemd.log_target=kmsg I discovered that systemd-journald is deleting logs in the LiveOS case, but I don't know why, / has ~750M free [ 24.910792] systemd-journald[922]: Vacuuming... [ 24.921802] systemd-journald[922]: Deleted archived journal /var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee/system@818f3f40f19849e08a1b37b9c1e304f1-0001-00058ba31bec725e.journal (8.0M). [ 24.921808] systemd-journald[922]: Vacuuming done, freed 8.0M of archived journals from /var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee. I filed a bug here and have nominated it as a blocker bug for the F31 cycle. LiveOS boot, journalctl is missing many early messages https://bugzilla.redhat.com/show_bug.cgi?id=1715699 Comments 8 and 9 most relevent at this point (rather noisy troubleshooting before that). A related question is whether it's even appropriate or /var/log/journal to exist on LiveOS boots, rather than just have journals go to /run/log/journal? Obviously we must have the entire journal for LiveOS boots, from monotonic time 0.0, no matter the debug options chosen, I'd say for at least 5 minutes? Almost immediately dropping the first 30s just because I've got debug options set is a big problem for troubleshooting other problems with early startup of install images. Thanks, -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] transient hang when starting cryptography setup for swap
This is still a bug on Fedora 30. I can only reproduce it on one computer, I'm not sure why. The working case [ 23.310555] flap.local systemd[1]: dev-disk-by\x2duuid-dcae3053\x2d1cc2\x2d4890\x2da33b\x2d6d71b3dc97df.device: Changed dead -> plugged [ 23.310639] flap.local systemd[1]: dev-disk-by\x2did-dm\x2dname\x2dcryptswap.device: Changed dead -> plugged [ 23.310681] flap.local systemd[1]: dev-mapper-cryptswap.device: Changed dead -> plugged [ 23.310724] flap.local systemd[1]: dev-mapper-cryptswap.device: Job 165 dev-mapper-cryptswap.device/start finished, result=done [ 23.310765] flap.local systemd[1]: Found device /dev/mapper/cryptswap. In the non-working case, that never happens. systemd just doesn't see the swap device appear. But in early debug shell, 'blkid' sees it. Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] 5.2rc2, circular lock warning systemd-journal and btrfs_page_mkwrite
] fmac.local kernel: ? btrfs_record_root_in_trans+0x44/0x70 [btrfs] [7.887408] fmac.local kernel: __mutex_lock+0x92/0x930 [7.887422] fmac.local kernel: ? btrfs_record_root_in_trans+0x44/0x70 [btrfs] [7.887425] fmac.local kernel: ? rcu_read_lock_sched_held+0x6b/0x80 [7.887427] fmac.local kernel: ? module_assert_mutex_or_preempt+0x14/0x40 [7.887441] fmac.local kernel: ? btrfs_record_root_in_trans+0x44/0x70 [btrfs] [7.887443] fmac.local kernel: ? sched_clock_cpu+0xc/0xc0 [7.887458] fmac.local kernel: ? btrfs_record_root_in_trans+0x44/0x70 [btrfs] [7.887471] fmac.local kernel: btrfs_record_root_in_trans+0x44/0x70 [btrfs] [7.887486] fmac.local kernel: start_transaction+0x95/0x4f0 [btrfs] [7.887501] fmac.local kernel: btrfs_dirty_inode+0x44/0xd0 [btrfs] [7.887503] fmac.local kernel: file_update_time+0xeb/0x140 [7.887518] fmac.local kernel: btrfs_page_mkwrite+0xfe/0x570 [btrfs] [7.887520] fmac.local kernel: ? find_held_lock+0x32/0x90 [7.887522] fmac.local kernel: ? sched_clock+0x5/0x10 [7.887524] fmac.local kernel: do_page_mkwrite+0x2f/0x100 [7.887526] fmac.local kernel: do_wp_page+0x306/0x570 [7.887529] fmac.local kernel: __handle_mm_fault+0xce8/0x1730 [7.887532] fmac.local kernel: handle_mm_fault+0x16e/0x370 [7.887534] fmac.local kernel: do_user_addr_fault+0x1f9/0x480 [7.887536] fmac.local kernel: do_page_fault+0x33/0x210 [7.887538] fmac.local kernel: ? page_fault+0x8/0x30 [7.887540] fmac.local kernel: page_fault+0x1e/0x30 [7.887541] fmac.local kernel: RIP: 0033:0x7f97107ea383 [7.887544] fmac.local kernel: Code: ec 08 89 ee 49 89 d8 31 d2 6a 00 48 8b 4c 24 18 4c 89 f7 4c 8d 4c 24 30 e8 1a d8 ff ff 59 5e 85 c0 78 49 48 8b 44 24 20 31 d2 <48> 89 58 08 48 8b 5c 24 08 c7 40 01 00 00 00 00 66 89 50 05 c6 40 [7.887545] fmac.local kernel: RSP: 002b:7ffc79ed77a0 EFLAGS: 00010246 [7.887546] fmac.local kernel: RAX: 7f970eca84a8 RBX: 005d RCX: [7.887548] fmac.local kernel: RDX: RSI: 7f97107ea408 RDI: 55d0300e8160 [7.887549] fmac.local kernel: RBP: 0001 R08: 0001 R09: 55d0300e8160 [7.887550] fmac.local kernel: R10: 7ffc79f6a080 R11: 35fc R12: 7ffc79ed78c8 [7.887551] fmac.local kernel: R13: 7ffc79ed78c0 R14: 55d0300efa60 R15: 00c3d9ef -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] EFI loader partition unknown
On Wed, May 8, 2019 at 3:52 AM Lennart Poettering wrote: > > eOn Mo, 06.05.19 10:26, Chris Murphy (li...@colorremedies.com) wrote: > > > Waiting for device (parent + 2 partitions) to appear... > > Found writable 'root' partition (UUID > > 87d5a92987174be9ad216482074d1409) of type xfs without verity on > > partition #2 (/dev/vda2) > > Found writable 'esp' partition (UUID b5aa8c29b4ab4021b2b22326860bda97) > > of type vfat on partition #1 (/dev/vda1) > > [Detaching after fork from child process 8612] > > Successfully forked off '(sd-dissect)' as PID 8612. > > Mounting xfs on /tmp/dissect-h21Wp5 (MS_RDONLY|MS_NODEV "")... > > Failed to mount /dev/vda2 (type xfs) on /tmp/dissect-h21Wp5 > > (MS_RDONLY|MS_NODEV ""): Device or resource busy > > Failed to mount dissected image: Device or resource busy > > Failed to read /etc/hostname: No such file or directory > > /etc/machine-id file is empty. > > (sd-dissect) failed with exit status 1. > > Failed to acquire image metadata: Protocol error > > [Inferior 1 (process 8608) exited with code 01] > > (gdb) quit > > > > > > Looks like it wants to mount root, but it's already mounted and hence > > busy. Btrfs lets you do that, ext4 and XFS don't, they need to be bind > > mounted instead. Just a guess. > > OK, so this is misleading. "systemd-dissect" does two things: first it > tries to make sense of the partition table and what to do with > it. Then it tries to extract OS metadata from the file systems itself > (i.e. read /etc/machine-id + /etc/os-release). The latter part fails > for some reason (probably because the mount options are different than > what is already mounted, some file systems are more allergic to that > than others), but that shouldn't really matter, as that is > not used by systemd-gpt-generator. > > Hmm, one question, which boot loader are you using? Note that the ESP GRUB which I think is mainly as a bootmanager and to support Secure Boot, and EFI STUB as the actual bootloader. > mounting logic only works if the boot loader tells us which partition > it was booted from. This is an extra check to ensure that we only > mount the correct ESP, the one that was actually used. In other words, > this only works with a boot loader that implements the relevant part > of https://systemd.io/BOOT_LOADER_INTERFACE.html i.e. the I'm willing to bet 99% of the world's computers have one ESP. In fact I'd be surprised if it's even 1% that have 2 or more. I'm not convinced the UEFI spec really sanctions 2 or more ESPs even if it's not outright proscribed. The language of the spec consistently says there's one. > LoaderDevicePartUUID efi var. We probably should document that better > in systemd-gpt-generator(8) though, could you please file a bug about > that? > > In other words: use sd-boot, and all that stuff just works. With grub > it doesn't, it doesn't let us know any of the bits we want to know. OK requiring a specific bootloader really isn't consistent with the language used in the discoverable partitions specification. If in reality what's needed to automatically mount to /efi is not only a partition type GUID but some bootloader specific metadata inserted into memory at boot time, that's not a generic solution like the other discoverable partition types. https://www.freedesktop.org/wiki/Specifications/DiscoverablePartitionsSpec/ -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] EFI loader partition unknown
On Mon, May 6, 2019 at 10:26 AM Chris Murphy wrote: > Looks like it wants to mount root, but it's already mounted and hence > busy. Btrfs lets you do that, ext4 and XFS don't, they need to be bind > mounted instead. Just a guess. Nope, that's not correct. I can mount /dev/vda2 on /mnt just fine. /dev/vda2 on / type xfs (rw,relatime,seclabel,attr2,inode64,noquota) [snip] /dev/vda2 on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] EFI loader partition unknown
Waiting for device (parent + 2 partitions) to appear... Found writable 'root' partition (UUID 87d5a92987174be9ad216482074d1409) of type xfs without verity on partition #2 (/dev/vda2) Found writable 'esp' partition (UUID b5aa8c29b4ab4021b2b22326860bda97) of type vfat on partition #1 (/dev/vda1) [Detaching after fork from child process 8612] Successfully forked off '(sd-dissect)' as PID 8612. Mounting xfs on /tmp/dissect-h21Wp5 (MS_RDONLY|MS_NODEV "")... Failed to mount /dev/vda2 (type xfs) on /tmp/dissect-h21Wp5 (MS_RDONLY|MS_NODEV ""): Device or resource busy Failed to mount dissected image: Device or resource busy Failed to read /etc/hostname: No such file or directory /etc/machine-id file is empty. (sd-dissect) failed with exit status 1. Failed to acquire image metadata: Protocol error [Inferior 1 (process 8608) exited with code 01] (gdb) quit Looks like it wants to mount root, but it's already mounted and hence busy. Btrfs lets you do that, ext4 and XFS don't, they need to be bind mounted instead. Just a guess. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] EFI loader partition unknown
On Fri, May 3, 2019 at 2:23 AM Lennart Poettering wrote: > > On Fr, 03.05.19 00:37, Chris Murphy (li...@colorremedies.com) wrote: > > > systemd-242-3.git7a6d834.fc31.x86_64 > > > > With the Fedora /etc/fstab entry for /boot/efi commented out, systemd > > isn't discovering the EFI system partition and mounting it. The /efi > > directory exists, and I've tried to boot with both enforcing=0 and > > selinux=0 due to and selinux bug [1] but systemd doesn't even attempt > > to mount it. Should I file an upstream bug report? > > > > > > [6.275104] frawvm.local systemd-gpt-auto-generator[636]: Waiting > > for device (parent + 3 partitions) to appear... > > [6.281265] frawvm.local systemd-gpt-auto-generator[636]: EFI > > loader partition unknown. > > > > blkid reports: > > /dev/vda1: SEC_TYPE="msdos" UUID="927C-932C" TYPE="vfat" > > PARTLABEL="EFI System Partition" > > PARTUUID="0e3a48c0-3f1b-4ca7-99f4-32fd1d831cdc" > > > > gdisk reports: > > Partition number (1-3): 1 > > Partition GUID code: C12A7328-F81F-11D2-BA4B-00A0C93EC93B (EFI System) > > Partition unique GUID: 0E3A48C0-3F1B-4CA7-99F4-32FD1D831CDC > > First sector: 2048 (at 1024.0 KiB) > > Last sector: 411647 (at 201.0 MiB) > > Partition size: 409600 sectors (200.0 MiB) > > Attribute flags: > > Partition name: 'EFI System Partition' > > > > What does '/usr/lib/systemd/systemd-dissect /dev/vda' say? [chris@localhost ~]$ sudo /usr/lib/systemd/systemd-dissect /dev/vda Found writable 'root' partition (UUID 87d5a92987174be9ad216482074d1409) of type xfs without verity on partition #2 (/dev/vda2) Found writable 'esp' partition (UUID b5aa8c29b4ab4021b2b22326860bda97) of type vfat on partition #1 (/dev/vda1) Failed to acquire image metadata: Protocol error [chris@localhost ~]$ -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] EFI loader partition unknown
systemd-242-3.git7a6d834.fc31.x86_64 With the Fedora /etc/fstab entry for /boot/efi commented out, systemd isn't discovering the EFI system partition and mounting it. The /efi directory exists, and I've tried to boot with both enforcing=0 and selinux=0 due to and selinux bug [1] but systemd doesn't even attempt to mount it. Should I file an upstream bug report? [6.275104] frawvm.local systemd-gpt-auto-generator[636]: Waiting for device (parent + 3 partitions) to appear... [6.281265] frawvm.local systemd-gpt-auto-generator[636]: EFI loader partition unknown. blkid reports: /dev/vda1: SEC_TYPE="msdos" UUID="927C-932C" TYPE="vfat" PARTLABEL="EFI System Partition" PARTUUID="0e3a48c0-3f1b-4ca7-99f4-32fd1d831cdc" gdisk reports: Partition number (1-3): 1 Partition GUID code: C12A7328-F81F-11D2-BA4B-00A0C93EC93B (EFI System) Partition unique GUID: 0E3A48C0-3F1B-4CA7-99F4-32FD1D831CDC First sector: 2048 (at 1024.0 KiB) Last sector: 411647 (at 201.0 MiB) Partition size: 409600 sectors (200.0 MiB) Attribute flags: Partition name: 'EFI System Partition' [1] https://bugzilla.redhat.com/show_bug.cgi?id=1293725 -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] transient hang when starting cryptography setup for swap
I wonder if this is related to the late random seed loading, or possible a race with it? I'd expect only cryptsetup needing a key derived from unrandom would need random data; yet I'm seeing the hang always after cryptsetup and mkswap succeeds with the hang happening after mkswap succeeds without a corresponding swapon. I'm beginning to wonder if it's a kernel bug, but even with Fedora debug kernels I'm not seeing any information that explains these long gaps in the journal. [ 14.623373] flap.local systemd[1]: var-lib-systemd-random\x2dseed.mount: Collecting. ... [ 25.319006] flap.local audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-random-seed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' ... [ 25.336505] flap.local systemd[582]: systemd-random-seed.service: Executing: /usr/lib/systemd/systemd-random-seed load The first one seems late in startup; and then it's not for ~10s later that it's really executed. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] transient hang when starting cryptography setup for swap
OK the only thing running is `dev-mapper-eswap.device` everything else is waiting. But even looking at its status, it doesn't reveal why it's waiting. I've updated the bug. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] transient hang when starting cryptography setup for swap
On Mon, Mar 25, 2019 at 3:18 AM Lennart Poettering wrote: > > On Do, 21.03.19 19:36, Chris Murphy (li...@colorremedies.com) wrote: > > > Hi, > > > > Problem Summary (which I forgot to put in the bug): > > /etc/crypttab configured to use /dev/urandom to generate a new key > > each boot for encrypted device to be used as swap. The dmcrypt device > > is successfully created, and mkswap succeeds, but somewhere just > > before (?) swapon the job gets stuck and boot hangs indefinitely. > > There is no stuck swapon process listed by ps. > > > > This happens maybe 1 in 10 boots with Fedora 29. And maybe 1 in 2 > > boots with Fedora 30. I guess it could be a race of some kind? I'm not > > really sure. > > > > I filed a bug with attachments and details here: > > https://bugzilla.redhat.com/show_bug.cgi?id=1691589 > > > > But it's not a great bug report yet because I don't have enough > > information why it's hanging. Ordinarily I'd use > > `systemd.log_level=debug` but the problem never happens so far if I > > use it. So I'm looking for advice on getting more information why it's > > stuck. > > See: > > https://freedesktop.org/wiki/Software/systemd/Debugging/#index1h1 > > In particular, you might want to do the "systemctl enable > debug-shell.service" thing, so that you can do "systemctl list-jobs" > and similar when it hangs to figure out what's going on. Drat. I didn't see this until just now, so I didn't do list-jobs. However, I finally got a boot with systemd.log_level=debug to hang so maybe that's useful? I attached to the bug report https://bugzilla.redhat.com/show_bug.cgi?id=1691589 Next time it happens I'll do list-jobs. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] transient hang when starting cryptography setup for swap
Hi, Problem Summary (which I forgot to put in the bug): /etc/crypttab configured to use /dev/urandom to generate a new key each boot for encrypted device to be used as swap. The dmcrypt device is successfully created, and mkswap succeeds, but somewhere just before (?) swapon the job gets stuck and boot hangs indefinitely. There is no stuck swapon process listed by ps. This happens maybe 1 in 10 boots with Fedora 29. And maybe 1 in 2 boots with Fedora 30. I guess it could be a race of some kind? I'm not really sure. I filed a bug with attachments and details here: https://bugzilla.redhat.com/show_bug.cgi?id=1691589 But it's not a great bug report yet because I don't have enough information why it's hanging. Ordinarily I'd use `systemd.log_level=debug` but the problem never happens so far if I use it. So I'm looking for advice on getting more information why it's stuck. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] /var/log/journal full, journald is not removing journal files
On Thu, Jul 12, 2018 at 8:28 AM, Michal Koutný wrote: > Hi Chris. > > On 07/11/2018 09:44 PM, Chris Murphy wrote: >> Somehow journald would not >> delete its own files until I had deleted a few manually. > Indeed, see man page update [1] added recently for more details. I > assume your space was occupied by active journal files. Do you have any > detailed break down of /var/log/journal contents? > > Michal > > [1] https://github.com/systemd/systemd/commit/1a0d353b44e > It seems to be working automatically now that it's been cleaned out manually, and the SystemMaxUse=1200M, although this gets translated to 1.1G by journald, and there is ~400MB free space. Jul 17 06:42:38 f28h.local systemd-journald[472]: System journal (/var/log/journal/bbe68372db9f4c589a1f67f008e70864) is 1.1G, max 1.1G, 0B free. My original expectation was that SystemMaxUse=2G which is the same size as the file system, would be limited by the default behavior of SystemKeepFree= which the man page says is 15% of the file system size. In theory it would ensure 300MB free at all times. But that turns out to not be the case. It was definitely not deleting archived files in that configuration. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] /var/log/journal full, journald is not removing journal files
So I went back to default journald.conf and rebooted and still /var/log/journal is 100% full. So I manually deleted a bunch of files, getting to 60% full. Reboot again. And now journald does some kind of clean up, and /var/log/journal is 28% full. Somehow journald would not delete its own files until I had deleted a few manually. The default is too small so for now I've gone back to SystemMaxUse=1200M which causes systemd-journald at next reboot: Jul 11 13:34:30 f28h.local systemd-journald[482]: System journal (/var/log/journal/bbe68372db9f4c589a1f67f008e70864) is 216.2M, max 1.1G, 983.7M free. Not much to go on. Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] /var/log/journal full, journald is not removing journal files
systemd-238-8.git0e0aa59.fc28.x86_64 I'm really confused by what I'm seeing. Jul 10 09:13:40 f28h.local systemd-journald[493]: System journal (/var/log/journal/bbe68372db9f4c589a1f67f008e70864) is 1.2G, max 1.3G, 90.0M free. [chris@f28h ~]$ du -sh /var/log/journal/bbe68372db9f4c589a1f67f008e70864/ 1.6G/var/log/journal/bbe68372db9f4c589a1f67f008e70864/ [chris@f28h ~]$ [chris@f28h ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1p7 1.9G 1.9G 73M 97% /var/log journald.conf is Fedora default except SystemMaxUse=2G 1. Somehow systemd-journald is deciding to max at 1.3G, instead of the specified 2G, which is fine. But I don't know how it arrives at this. 2. Clearly max 1.3G is being busted, the contents of var/log/journal are greater than the max. non-journal files total ~83M and have not grown at all in the last week (and multiple reboots). 3. If I change SystemMaxUse=1300M there is no change. No attempt by journald to clean up /var/log/journal, and no errors, it uses /run/log instead and never switches to persistent logging. 4. My reading of man journald.conf is that that SystemKeepFree= defaults to 15% of the file system space, so even with SystemMaxUse=2G journald should have deleted journal files before getting to 100% full. 5. If I boot with systemd.log_level=debug, there are no journald entries that help understand why there's no transition from volatile to persistent storage, i.e. hey var/log/journal is full, and also that I can't delete files because $reasons, or whatever. Extra info: this is a 2G f2fs file system mounted at /var/log. Seems to be working well except for this little problem but I don't see it being the cause. journald isn't even attempting to delete its own journal files to free up space. Anyway the main thing that has me confused is the max 1.3G statement, which has been the same since the file system was 5% full upon creation, and then journals increased all the way to enospc without any of them being deleted. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal always corrupt
On Thu, Jun 7, 2018 at 10:32 AM, Mantas Mikulėnas wrote: > On Thu, Jun 7, 2018 at 4:21 AM Chris Murphy wrote: >> >> [chris@f28h ~]$ sudo journalctl --verify >> 15f1c8: Data object references invalid entry at 4855f8 >> File corruption detected at >> /run/log/journal/bbe68372db9f4c589a1f67f008e70864/system.journal:4854c0 >> (of 8388608 bytes, 56%). >> FAIL: /run/log/journal/bbe68372db9f4c589a1f67f008e70864/system.journal >> (Bad message) >> PASS: /var/log/journal/bbe68372db9f4c589a1f67f008e70864/system.journal >> PASS: /var/log/journal/bbe68372db9f4c589a1f67f008e70864/user-1000.journal >> [chris@f28h ~]$ ls -l /run/log/journal/bbe68372db9f4c589a1f67f008e70864/ >> total 8192 >> -rw-r-+ 1 root systemd-journal 8388608 Jun 6 14:28 system.journal >> [chris@f28h ~]$ >> >> systemd-238-8.git0e0aa59.fc28.x86_64 >> >> It doesn't seem to matter whether this is on volatile or persistent >> media, the very first journal file has corruption, subsequent ones >> don't. I'm not sure how to troubleshoot this. > > > More precisely, it's the *active* journal file, the one that journald is > currently writing to. If it has been just a few seconds since the last > write, you can probably safely assume that it's not fully flushed to disk > yet. (This can apply to user-* journals as well, but they're relatively low > traffic and so less likely to be online at the moment.) This is a recent behavior. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] journal always corrupt
[chris@f28h ~]$ sudo journalctl --verify 15f1c8: Data object references invalid entry at 4855f8 File corruption detected at /run/log/journal/bbe68372db9f4c589a1f67f008e70864/system.journal:4854c0 (of 8388608 bytes, 56%). FAIL: /run/log/journal/bbe68372db9f4c589a1f67f008e70864/system.journal (Bad message) PASS: /var/log/journal/bbe68372db9f4c589a1f67f008e70864/system.journal PASS: /var/log/journal/bbe68372db9f4c589a1f67f008e70864/user-1000.journal [chris@f28h ~]$ ls -l /run/log/journal/bbe68372db9f4c589a1f67f008e70864/ total 8192 -rw-r-+ 1 root systemd-journal 8388608 Jun 6 14:28 system.journal [chris@f28h ~]$ systemd-238-8.git0e0aa59.fc28.x86_64 It doesn't seem to matter whether this is on volatile or persistent media, the very first journal file has corruption, subsequent ones don't. I'm not sure how to troubleshoot this. [chris@f28h ~]$ sudo journalctl --verify 32ccc28: Invalid data object at hash entry 5179 of 233016 File corruption detected at /var/log/journal/bbe68372db9f4c589a1f67f008e70864/system.journal:32cca20 (of 58720256 bytes, 90%). FAIL: /var/log/journal/bbe68372db9f4c589a1f67f008e70864/system.journal (Bad message) PASS: /var/log/journal/bbe68372db9f4c589a1f67f008e70864/user-1000.journal -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] checking for resume= cmdline with hybrid suspend hibernate
I'm seeing this on a Fedora 28 system with systemd-238-8.git0e0aa59.fc28.x86_64 Jun 02 15:11:33 f28h.local systemd[1]: Starting Hybrid Suspend+Hibernate... [chris@f28h ~]$ sudo cat /proc/cmdline BOOT_IMAGE=/root28/boot/vmlinuz-4.17.0-0.rc7.git1.1.fc29.x86_64 root=UUID=2662057f-e6c7-47fa-8af9-ad933a22f6ec ro rootflags=subvol=root28 rhgb quiet rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0 enable_mtrr_cleanup=1 zswap.enabled=1 zswap.max_pool_percent=25 zswap.compressor=lz4 LANG=en_US.UTF-8 no_console_suspend There is no resume= for the kernel to find the resulting hibernation image (and I'm going to set aside that I'm pretty sure when Secure Boot is enabled, that Fedora's kernels don't even support restoring hibernation images). At one time I thought systemd had a check to see if the cmdline contained a resume hint for the kernel to find the hibernation image, and wouldn't do any variant of hibernation if that hint is not present. But I guess not? I'm pretty sure it's the DE that's asking for hybrid suspend+hibernate but I don't know whose domain it is to check if hibernate is actually supported. I'm vaguely aware of CanHibernate() but I don't know what all is included in that test. Thanks, -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel