from:"Chris Murphy"

Re: [systemd-devel] default journal retention policy

2022-12-22 Thread Chris Murphy




On Thu, Dec 22, 2022, at 11:00 AM, Lennart Poettering wrote:
> On Do, 22.12.22 10:56, Chris Murphy (li...@colorremedies.com) wrote:

>> Still another idea, we could add a new setting MinRetentionSec=90day
>> which would translate into "not less than 90 days" and would only
>> delete journal files once all the entries in a journal file are at
>> least 90 days old.
>
> Well, that naming would suggest it would override the size
> constraints, and it shouldn't. But yeah, ignoring the choice of name I
> think it would make sense to add that. Add an RFE.

Yeah I'm not sure what to call it. LessRetentionSec or PrefRetentionSec add 
jargon, but maybe that's adequately dealt with by updating the journald.conf 
man page?



-- 
Chris Murphy

[systemd-devel] default journal retention policy

2022-12-22 Thread Chris Murphy

Hi,

Fedora Workstation working group is considering reducing the journal retention
policy from upstream default.

This is the tracking issue
https://pagure.io/fedora-workstation/issue/213

This is the Fedora development list discussion thread
https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/thread/NDO5S2KUUDO5G6JLKZGQNFBXOW5KHPR5/#XATT3XYFV2UALPTJTL5RSQD3D4IVNSVO

As Lennart mentions in that devel thread, it's preferred that the change be
upstreamable, and the Fedora Workstation working group agrees.
https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/message/XATT3XYFV2UALPTJTL5RSQD3D4IVNSVO/

The consensus of the discussion is that there should be less retention. The
range of retention varies quite a bit, but I think 3-6 months is OK.

In practice, most configurations eventually up with 4G of journals, since
that's the cap. This typically is over a year of journals, but of course it
really depends on additional configuration, e.g. in my case I do a lot of
debugging, so I'm often enabling debug logging, therefore 4G worth of journal
files happens pretty quick, maybe 3 months.

As I understand it, rsyslog has a two week retention policy by default.

journald supports quite a lot of knobs related to journal total size, free
space, file sizes, and rentention time. My favorite simple idea of the moment
is to set a default MaxRetentionSec=100day which translates to "probably not
less than 90 days, but not more than 100 days" of retention. The policy looks
at entry age to determine if the retention threshold is met, but the garbage
collection affects journal files. So if a single entry in a file reaches 100
days, the whole file is deleted, which could plausibly be a week or two of
entries.

Still another idea, we could add a new setting MinRetentionSec=90day which
would translate into "not less than 90 days" and would only delete journal
files once all the entries in a journal file are at least 90 days old.

By leaving all the other settings alone, the 4G cap (or if less, the 10% of
file system size rule) still applies. So in no case would any use case end up
using more space for logs.

Any thoughts?

--
Chris Murphy

Re: [systemd-devel] How can we debug systemd-gpt-auto-generator failures?

2022-07-30 Thread Chris Murphy




On Thu, Jul 28, 2022, at 6:50 AM, Kevin P. Fleming wrote:
> I've got two systems that report a failure (exit code 1) every time
> systemd-gpt-auto-generator is run. There are a small number of reports
> of this affecting other users too:
>
> https://bugs.archlinux.org/task/73168
>
> This *may* be related to the use of ZFS, although I've got a
> half-dozen systems using ZFS and only two of them have this issue.

Are the two with the problem multiple device ZFS? And the rest are single 
device ZFS?


-- 
Chris Murphy

Re: [systemd-devel] No space left errors on shutdown with systemd-homed /home dir

2022-07-23 Thread Chris Murphy

On Mon, Jan 31, 2022, at 11:26 PM, Zygo Blaxell wrote:
> On Sat, Jan 29, 2022 at 10:53:00AM +0100, Goffredo Baroncelli wrote:

> It does suck that the kernel handles resizing below the minimum size of
> the filesystem so badly; however, even if it rejected the resize request
> cleanly with an error, it's not necessarily a good idea to attempt it.
> Pushing the lower limits of what is possible in resize to save a handful
> of GB is asking for trouble.  It's far better to overestimate generously
> than to underestimate the minimum size.

Yeah there's an inherent conflict with online shrink: the longer the time 
needed to relocate bg's, the more unpredictable operations can occur during 
that time to thwart any original estimations made about the shrink operation.

I wondered a bit ago about a shrink API that takes shrink size as a suggestion 
rather than as a definite, and then the file system does the best job it can. 
Either this API reports actual shrink size once it completes, or the requesting 
program needs to know to call BTRFS_IOC_FS_INFO and BTRFS_IOC_DEV_INFO to know 
the actual size. This hypothetical API could have boundaries outside of which 
if the kernel code estimates it's going to fall short of, could trigger a 
cancel of the shrink. This could be size or time based.  e.g. 
BTRFS_IOC_RESIZE_BEST (effort).

-- 
Chris Murphy

Re: [systemd-devel] No space left errors on shutdown with systemd-homed /home dir

2022-07-23 Thread Chris Murphy

On Wed, Jun 1, 2022, at 5:36 AM, Colin Guthrie wrote:
> Goffredo Baroncelli wrote on 31/05/2022 19:12:
> 
> > I suppose that colin.home is a sparse file, so even it has a length of 
> > 394GB, it consumes only 184GB. So to me these are valid values. It 
> > doesn't matter the length of the files. What does matter is the value 
> > returned by "du -sh".
> > 
> > Below I create a file with a length of 1000GB. However being a sparse 
> > file, it doesn't consume any space and "du -sh" returns 0
> > 
> > $ truncate -s 1000GB foo
> > $ du -sh foo
> > 0foo
> > $ ls -l foo
> > -rw-r--r-- 1 ghigo ghigo 1 May 31 19:29 foo
> 
> Yeah the file will be sparse.
> 
> That's not really an issue, I'm not worried about the fact it's not 
> consuming as much as it reports as that's all expected.
> 
> The issue is that systemd-homed (or btrfs's fallocate) can't handle this 
> situation and that user is effectively bricked unless migrated to a host 
> with more storage space!

Hopefully there's time for systemd-252 for a change still? That version is what 
I expect to ship in Fedora 37 [1] There's merit to sd-homed and I want it to be 
safe and reliable for users to keep using, in order to build momentum. 

I really think sd-homed must move the shrink on logout, to login.

When the user logs out, they are decently likely to immediately close the 
laptop lid thus suspend-to-ram; or shutdown. I don't know if shrink can be 
cancelled. But regardless, there's going to be a period of time where the file 
system and storage stacks are busy, right at the time the user is expecting 
*imminent* suspend or shutdown, which no matter what has to be inhibited until 
the shrink is cancelled or completed, and all pending writes are flushed to 
stable media.

Next, consider the low battery situation. Upon notification, anyone with an 18+ 
month old battery knows there may be no additional warnings, and you could in 
fact get a power failure next. In this scenario we have to depend on all 
storage stack layers, and the drive firmware, doing the exact correct thing in 
order for the file system to be in a consistent state to be mountable at next 
boot. I just think this is too much risk, and since sd-homed is targeted at 
laptop users primarily, all the more reason the fs resize operation should 
happen at login time, not logout.

In fact, sd-homed might want to inhibit a resize shrink operation if (a) AC 
power is not plugged in and (b) battery remaining is less than 30%, or some 
other reasonable value. The resize grow operation is sufficiently cheap and 
fast that I don't think it needs inhibiting.

Thoughts?

I also just found a few bug reports with a non-exhaustive search that also make 
me nervous about fs shrink at logout (also implying restart and shutdown) time.

On shutdown, homed resizes until it gets killed
https://github.com/systemd/systemd/issues/22901
Getting "New partition doesn't fit into backing storage, refusing"
https://github.com/systemd/systemd/issues/22255
fails to resize 
https://github.com/systemd/systemd/issues/22124

[1]
Branch from Rawhide August 9, the earliest release date would be October 18.

--
Chris Murphy

Re: [systemd-devel] No space left errors on shutdown with systemd-homed /home dir

2022-07-23 Thread Chris Murphy

[sorry had to resend to get it on btrfs list, due to html in the original :\]

On Wed, Jun 1, 2022, at 5:36 AM, Colin Guthrie wrote:
> Goffredo Baroncelli wrote on 31/05/2022 19:12:
> 
> > I suppose that colin.home is a sparse file, so even it has a length of 
> > 394GB, it consumes only 184GB. So to me these are valid values. It 
> > doesn't matter the length of the files. What does matter is the value 
> > returned by "du -sh".
> > 
> > Below I create a file with a length of 1000GB. However being a sparse 
> > file, it doesn't consume any space and "du -sh" returns 0
> > 
> > $ truncate -s 1000GB foo
> > $ du -sh foo
> > 0foo
> > $ ls -l foo
> > -rw-r--r-- 1 ghigo ghigo 1 May 31 19:29 foo
> 
> Yeah the file will be sparse.
> 
> That's not really an issue, I'm not worried about the fact it's not 
> consuming as much as it reports as that's all expected.
> 
> The issue is that systemd-homed (or btrfs's fallocate) can't handle this 
> situation and that user is effectively bricked unless migrated to a host 
> with more storage space!

Hopefully there's time for systemd-252 for a change still? That version is what 
I expect to ship in Fedora 37 [1] There's merit to sd-homed and I want it to be 
safe and reliable for users to keep using, in order to build momentum. 

I really think sd-homed must move the shrink on logout, to login.

When the user logs out, they are decently likely to immediately close the 
laptop lid thus suspend-to-ram; or shutdown. I don't know if shrink can be 
cancelled. But regardless, there's going to be a period of time where the file 
system and storage stacks are busy, right at the time the user is expecting 
*imminent* suspend or shutdown, which no matter what has to be inhibited until 
the shrink is cancelled or completed, and all pending writes are flushed to 
stable media.

Next, consider the low battery situation. Upon notification, anyone with an 18+ 
month old battery knows there may be no additional warnings, and you could in 
fact get a power failure next. In this scenario we have to depend on all 
storage stack layers, and the drive firmware, doing the exact correct thing in 
order for the file system to be in a consistent state to be mountable at next 
boot. I just think this is too much risk, and since sd-homed is targeted at 
laptop users primarily, all the more reason the fs resize operation should 
happen at login time, not logout.

In fact, sd-homed might want to inhibit a resize shrink operation if (a) AC 
power is not plugged in and (b) battery remaining is less than 30%, or some 
other reasonable value. The resize grow operation is sufficiently cheap and 
fast that I don't think it needs inhibiting.

Thoughts?

I also just found a few bug reports with a non-exhaustive search that also make 
me nervous about fs shrink at logout (also implying restart and shutdown) time.

On shutdown, homed resizes until it gets killed
https://github.com/systemd/systemd/issues/22901
Getting "New partition doesn't fit into backing storage, refusing"
https://github.com/systemd/systemd/issues/22255
fails to resize 
https://github.com/systemd/systemd/issues/22124

[1]
Branch from Rawhide August 9, the earliest release date would be October 18.

--
Chris Murphy

Re: [systemd-devel] No space left errors on shutdown with systemd-homed /home dir

2022-01-29 Thread Chris Murphy

On Sat, Jan 29, 2022 at 2:53 AM Goffredo Baroncelli  wrote:
>
> I think that for the systemd uses cases (singled device FS), a simpler
> approach would be:
>
>  fstatfs(fd, )
>  needed = sfs.f_blocks - sfs.f_bavail;
>  needed *= sfs.f_bsize
>
>  needed = roundup_64(needed, 3*(1024*1024*1024))
>
> Comparing the original systemd-homed code, I made the following changes
> - 1) f_bfree is replaced by f_bavail (which seem to be more consistent to the 
> disk usage; to me it seems to consider also the metadata chunk allocation)
> - 2) the needing value is rounded up of 3GB in order to consider a further 1 
> data chunk and 2 metadata chunk (DUP))
>
> Comments ?

I'm still wondering if such a significant shrink is even indicated, in
lieu of trim. Isn't it sufficient to just trim on logout, thus
returning unused blocks to the underlying filesystem? And then do an
fs resize (shrink or grow) as needed on login, so that the user home
shows ~80% of the free space in the underlying file system?

homework-luks.c:3407:/* Before we
shrink, let's trim the file system, so that we need less space on disk
during the shrinking */



-- 
Chris Murphy

Re: [systemd-devel] No space left errors on shutdown with systemd-homed /home dir

2022-01-27 Thread Chris Murphy

On Wed, Jan 26, 2022 at 4:19 PM Boris Burkov  wrote:
>
> On Thu, Jan 27, 2022 at 12:07:53AM +0200, Apostolos B. wrote:
> >  This is what homectl inspect user reports:
> >
> >   Disk Size: 128.0G
> >   Disk Usage: 3.8G (= 3.1%)
> >   Disk Free: 124.0G (= 96.9%)
> >
> > and this is what btrfs usage reports:
> >
> > sudo btrfs filesystem usage /home/toliz
> >
> > Overall:
> > Device size: 127.98GiB
> > Device allocated:   4.02GiB
> > Device unallocated: 123.96GiB
> > Device missing: 0.00B
> > Used:   1.89GiB
> > Free (estimated): 124.10GiB(min: 62.12GiB)
> > Free (statfs, df): 124.10GiB
> > Data ratio:  1.00
> > Metadata ratio:  2.00
> > Global reserve:   5.14MiB(used: 0.00B)
> > Multiple profiles:no
> >
> > Data,single: Size:2.01GiB, Used:1.86GiB (92.73%)
> >/dev/mapper/home-toliz   2.01GiB
> >
> > Metadata,DUP: Size:1.00GiB, Used:12.47MiB (1.22%)
> >/dev/mapper/home-toliz   2.00GiB
> >
> > System,DUP: Size:8.00MiB, Used:16.00KiB (0.20%)
> >/dev/mapper/home-toliz  16.00MiB
> >
> > Unallocated:
> >/dev/mapper/home-toliz 123.96GiB
> >
>
> OK, there is plenty of unallocated space, thanks for confirming.
>
> Looking at the stack trace a bit more, the only thing that really sticks
> out as suspicious to me is btrfs_shrink_device, I'm not sure who would
> want to do that or why.

systemd-homed by default uses btrfs on LUKS on loop mount, with a
backing file. On login, it grows the user home file system with some
percentage (I think 80%) of the free space of the underlying file
system. And on logout, it does both fstrim and shrinks the fs. I don't
know why it does both, it seems adequate to do only fstrim on logout
to return unused blocks to the underlying file system; and to do an fs
resize on login to either grow or shrink the user home file system.

But also, we don't really have a great estimator of the minimum size a
file system can be. `btrfs inspect-internal min-dev-size` is pretty
broken right now.
https://github.com/kdave/btrfs-progs/issues/271

I'm not sure if systemd folks would use libbtrfsutil facility to
determine the minimum device shrink size? But also even the kernel
doesn't have a very good idea of how small a file system can be
shrunk. Right now it basically has to just start trying, and does it
one block group at a time.

Adding systemd-devel@

-- 
Chris Murphy

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-31 Thread Chris Murphy

On Thu, Dec 30, 2021 at 3:59 PM Chris Murphy  wrote:
>
> ZFS uses volume and user properties which we could probably mimic with
> xattr. I thought I asked about xattr instead of subvolume names at one
> point in the thread but I don't see it. So instead of using subvolume
> names, what about stuffing this information in xattr? My gut instinct
> is this is less transparent and user friendly, it requires more tools
> to know how to user to troubleshoot and fix, etc.

Separate from whether the obscurity of an xattr is a good idea or not,
read-only snapshots can't have xattr added, removed, or modified. We
can rename read-only snapshots, however.

While we could unset the ro property, that also wipes received UUID
used by send/receive. And while we could make an rw snapshot of the ro
snapshot, modify the xattr, then make an ro snapshot of the rw
snapshot, this alters the parent UUI also used by send/receive. So it
complicates send/receive workflows as a potential update mechanism, or
for backup/restore, for anything that tracks these UUIDs, e.g. btrbk.

-- 
Chris Murphy

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-30 Thread Chris Murphy

(I'm sorta not doing a great job of using "sub-volume" to mean
generically any of Btrfs subvolume or a directory or a logical volume,
so hopefully anyone still following can make the leap that I don't
intend this spec to be Btrfs specific. I like it being general
purpose.)

On Tue, Dec 21, 2021 at 6:57 AM Ludwig Nussel  wrote:
>
> The way btrfs is used in openSUSE is based on systems from ten years
> ago. A lot has changed since then. Now with the idea to have /usr on a
> separate read-only subvolume the current model doesn't really work very
> well anymore IMO. So I think there's a window of opportunity to change
> the way openSUSE does things :-)

ZFS uses volume and user properties which we could probably mimic with
xattr. I thought I asked about xattr instead of subvolume names at one
point in the thread but I don't see it. So instead of using subvolume
names, what about stuffing this information in xattr? My gut instinct
is this is less transparent and user friendly, it requires more tools
to know how to user to troubleshoot and fix, etc.

--
Chris Murphy

Re: [systemd-devel] Antw: [EXT] Re: the need for a discoverable sub-volumes specification

2021-12-28 Thread Chris Murphy

On Mon, Dec 27, 2021 at 3:40 AM Ulrich Windl
 wrote:
>
> >>> Ludwig Nussel  schrieb am 21.12.2021 um 14:57 in
> Nachricht <662e1a92-beb4-e1f1-05c9-e0b38e40e...@suse.de>:
>
> ...
> > The way btrfs is used in openSUSE is based on systems from ten years
> > ago. A lot has changed since then. Now with the idea to have /usr on a
> > separate read-only subvolume the current model doesn't really work very
> > well anymore IMO. So I think there's a window of opportunity to change
> > the way openSUSE does things :-)
>
> Oh well, while you are doing so: Also improve support for a separate /boot
> volume when snapshotting.

Yeah how to handle /boot gives me headaches. We have a kind of
rollback, the possibility of choosing among kernels. But which kernels
are bootable depends on the /usr its paired with. We need a mechanism
to match /boot and /usr together, so that the user doesn't get stuck
choosing a kernel version for which the modules don't exist in an
older generation /usr. And then does this imply some additional
functionality in the bootloader to achieve it, or should this
information be fully encapsulated in Boot Loader Spec compliant
snippets?

-- 
Chris Murphy

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-28 Thread Chris Murphy

On Tue, Dec 21, 2021 at 6:57 AM Ludwig Nussel  wrote:
>
> Chris Murphy wrote:

> > The part I'm having a hard time separating is the implicit case (use
> > some logic to assemble the correct objects), versus explicit (the
> > bootloader snippet points to a root and the root contains an fstab -
> > nothing about assembly is assumed). And should both paradigms exist
> > concurrently in an installed system, and how to deconflict?
>
> Not sure there is a conflict. The discovery logic is well defined after
> all. Also I assume normal operation wouldn't mix the two. Package
> management or whatever installs updates would automatically do the right
> thing suitable for the system at hand.

rootflags=subvol/subvolid= should override the discoverable
sub-volumes generator

I don't expect rootflags is normally used in a discoverable
sub-volumes workflow, but if the user were to add it for some reason,
we'd want it to be favored.

>
> > Further, (open)SUSE tends to define the root to boot via `btrfs
> > subvolume set-default` which is information in the file system itself,
> > neither in the bootloader snipper nor in the naming convention. It's
> > neat, but also not discoverable. If users are trying to
>
> The way btrfs is used in openSUSE is based on systems from ten years
> ago. A lot has changed since then. Now with the idea to have /usr on a
> separate read-only subvolume the current model doesn't really work very
> well anymore IMO. So I think there's a window of opportunity to change
> the way openSUSE does things :-)

I think the transactional model can accommodate better anyway, and is
the direction I'd like to go in with Fedora. Make updates/upgrades
happen out of band (in a container on a snapshot). We can apply
resource control limits so that the upgrade process doesn't negatively
impact the user's higher priority workload. If the update fails to
complete or fails a set of simple tests - the snapshot is simply
discarded. No harm done to the running system. If it passes checks,
then its name is changed to indicate it's the favored "next root"
following reboot. And we don't have to keep a database to snapshot,
assemble, and discard things, it can all be done by naming scheme.

I think the naming scheme should include some sort of "in-progress"
tag, so it's discoverable such a sub-volume is (a) not active (b) in
some state of flux that potentially was interrupted (c) isn't critical
to the system. Such a sub-volume should either be destroyed (failed
update) or renamed (update succeeds). If the owning process were to
fail (crash, powerfailure), the next time it runs to check for
updates, it would discover this "in-progress" sub-volume and remove it
(assume it's in a failed state).

-- 
Chris Murphy

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-10 Thread Chris Murphy

On Thu, Nov 4, 2021 at 9:39 AM Lennart Poettering
 wrote:

> 3. Inside the "@auto" dir of the "super-root" fs, have dirs named
>[:]. The type should have a similar vocubulary
>as the GPT spec type UUIDs, but probably use textual identifiers
>rater than UUIDs, simply because naming dirs by uuids is
>weird. Examples:
>
>/@auto/root-x86-64:fedora_36.0/
>/@auto/root-x86-64:fedora_36.1/
>/@auto/root-x86-64:fedora_37.1/
>/@auto/home/
>/@auto/srv/
>/@auto/tmp/
>
>Which would be assembled by the initrd into the following via bind
>mounts:
>
>/ → /@auto/root-x86-64:fedora_37.1/
>/home/→ /@auto/home/
>/srv/ → /@auto/srv/
>/var/tmp/ → /@auto/tmp/

What about arbitrary mountpoints and their subvolumes? Things we can't
predict in advance for all use cases? For example:

For my non-emphemeral systems:

* /var/log is a directory contained in subvolume "varlog-x86-64:fedora.35"
* /var/lib/libvirt/images is a directory contained in subvolume
"varlibvirtimages-x86-64:fedora.35"
* /var/lib/flatpak is a directory contained in a subvolume
"varlibflatpak-x86-64:any" - as it isn't Fedora specific, uses its own
versioning so in this case I'd expect it gets mounted with any
distribution.

These exist so they are excluded from a snapshot and rollback regime
that applies to "root-x86-64:fedora.35" which contains usr/ var/ etc/
A rollback of root does not rollback the systemd journal, VM images,
or flatpaks.

Is space a valid separator in the name of the subvolume? Or
underscore? This would become / to define the path to the mount point.

Additionally, I'm noticing that none of 'journalctl -o verbose' or
json or export shows what subvolume was mounted at each mount point. I
need to use systemd debug for this information to be included in the
journal. Assembly of versioned roots is probably useful logging
information by default. e.g.

Dec 10 10:45:00 fovo.local systemd[1]: Mounting
'@auto/root-x86-64:fedora.35' at /sysroot...
Dec 10 10:45:11 fovo.local systemd[1]: Mounting '@auto/home' at /home...
Dec 10 10:45:11 fovo.local systemd[1]: Mounting '@auto/varlibflatpak'
at /var/lib/flatpak...
Dec 10 10:45:11 fovo.local systemd[1]: Mounting
'@auto/varlibvirtimages-x86-64:fedora.35 at /var/lib/libvirt/images...
Dec 10 10:45:11 fovo.local systemd[1]: Mounting
'@auto/varlog-x86-64:fedora.35' at /var/log...
Dec 10 10:45:11 fovo.local systemd[1]: Mounting '@auto/swap' at /var/swap...

-- 
Chris Murphy

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-10 Thread Chris Murphy

On Thu, Nov 11, 2021 at 12:28 PM Lennart Poettering
 wrote:

> That said: naked squashfs sucks. Always wrap your squashfs in a GPT
> wrapper to make things self-descriptive.

Do you mean the image file contains a GPT, and the squashfs is a
partition within the image? Does this recommendation apply to any
image? Let's say it's a Btrfs image. And in the context of this
thread, the GPT partition type GUID would be the "super-root" GUID?

-- 
Chris Murphy

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-10 Thread Chris Murphy

On Tue, Nov 9, 2021 at 8:48 AM Ludwig Nussel  wrote:
>
> Lennart Poettering wrote:
> > Or to say this explicitly: we could define the spec to say that if
> > we encounter:
> >
> >/@auto/root-x86-64:fedora_36.0+3-0
> >
> > on first boot attempt we'd rename it:
> >
> >/@auto/root-x86-64:fedora_36.0+2-1
> >
> > and so on. Until boot succeeds in which case we'd rename it:
> >
> >/@auto/root-x86-64:fedora_36.0
> >
> > i.e. we'd drop the counting suffix.
>
> Thanks for the explanation and pointer!
>
> Need to think aloud a bit :-)
>
> That method basically works for systems with read-only root. Ie where
> the next OS to boot is in a separate snapshot, eg MicroOS.
> A traditional system with rw / on btrfs would stay on the same subvolume
> though. Ie the "root-x86-64:fedora_36.0" volume in the example. In
> openSUSE package installation automatically leads to ro snapshot
> creation. In order to fit in I suppose those could then be named eg.
> "root-x86-64:fedora_36.N+0" with increasing N. Due to the +0 the
> subvolume would never be booted.

Yeah the N+0 subvolumes could be read-only snapshots, their purpose is
only to be used as an immutable checkpoint from which to produce
derivatives, read-write subvolumes. But what about the case of being
in a preboot environment, and have no way (yet) to rename or create a
new snapshot to boot, and you need to boot one of these read-only
snapshots? What if the bootloader was smart enough to add the proper
volatile overlay arrangement anytime an N+0 subvolume is chosen for
boot? Is that plausible and useful?

> Anyway, let's assume the ro case and both efi partition and btrfs volume
> use this scheme. That means each time some packages are updated we get a
> new subvolume. After reboot the initrd in the efi partition would try to
> boot that new subvolume. If it reaches systemd-bless-boot.service the
> new subvolume becomes the default for the future.
>
> So far so good. What if I discover later that something went wrong
> though? Some convenience tooling to mark the current version bad again
> would be needed.
>
> But then having Tumbleweed in mind it needs some capability to boot any
> old snapshot anyway. I guess the solution here would be to just always
> generate a bootloader entry, independent of whether a kernel was
> included in an update. Each entry would then have to specify kernel,
> initrd and the root subvolume to use.

The part I'm having a hard time separating is the implicit case (use
some logic to assemble the correct objects), versus explicit (the
bootloader snippet points to a root and the root contains an fstab -
nothing about assembly is assumed). And should both paradigms exist
concurrently in an installed system, and how to deconflict?

Further, (open)SUSE tends to define the root to boot via `btrfs
subvolume set-default` which is information in the file system itself,
neither in the bootloader snipper nor in the naming convention. It's
neat, but also not discoverable. If users are trying to
learn+understand+troubleshoot how systems boot and assemble
themselves, to what degree are they owed transparency without needing
extra tools or decoder rings to reveal settings? The default subvolume
is uniquely btrfs, and without an equivalent anywhere else (so far as
I'm aware) I'm reluctant to use that for day to day boots. I can see
the advantage of this for btrfs for some sort of rescue/emergency boot
subvolume however...  where it doesn't contain the parameter
"rootflags=subvol=$root" (which acts as an override for the default
subvolume set in the fs itself) then the btrfs default subvolume would
be used. I'm struggling with its role in all of this though.

-- 
Chris Murphy

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-10 Thread Chris Murphy

On Fri, Nov 19, 2021 at 4:17 AM Lennart Poettering
 wrote:
>
> On Do, 18.11.21 14:51, Chris Murphy (li...@colorremedies.com) wrote:
>
> > How to do swapfiles?
>
> Is this really a concept that deserves too much attention?

*shrug* Only insofar as I like order, and like the idea of agreeing on
where things belong if there's going to appear somewhere.

> I mean, I
> have the suspicion that half the benefit of swap space is that it can
> act as backing store for hibernation.

Yes and that's a terrible conflation. The swapfile/device is for
anonymous pages. And hibernation images are not anon pages, and even
have special rules like must be contained in contiguous physical
device blocks. It may turn out that 'swsusp' (Swap Suspend) in the
kernel shouldn't be deprecated, and instead focus future effort on
'uswsusp'. But discussions around signed and authenticated hibernation
images for UEFI Secure Boot and kernel lockdown compatibility, have
all been around the kernel implementation.

https://www.kernel.org/doc/Documentation/power/swsusp.rst
https://www.kernel.org/doc/Documentation/power/userland-swsusp.rst

> But swap files are icky for that
> since that means the resume code has to mount the fs first, but given
> the fs is dirty during the hibernation state this is highly problematic.

It's sufficiently complicated and non-fail-safe (it's fail danger)
that it's broken. On btrfs, it's more tedious but less broken because
you must use both

resume=UUID=$uuid resume_offset=$physicaloffsethibernationimage

In effect the kernel does not need to mount ro the btrfs file system
at all, it gets the hint for the physical location of the hibernation
image from kernel boot parameter. Other file systems support discovery
of the physical offset once the file system is mounted ro. On Btrfs
you can see the swapfile as having a punch through mechanism. It's a
reservation of blocks, and page outs happen directly to that
reservation of blocks, not via the file system itself. This is why
there are all these limitations: balance doesn't touch block groups
containing any swapfile blocks, you can't do any kind of multiple
device stuff, you can't snapshot/reflink the swapfile, etc.

Which is why I'm in favor of just ceding this entire territory over to
systemd to manage correctly. But as a prerequisite, the hibernation
image should be separate from the swapfile. And should have a metadata
format so we can pair file system state to hibernation image state,
that way for sure we aren't running into catastrophic nonsense like
this right at the top of
https://www.kernel.org/doc/Documentation/power/swsusp.rst

   **BIG FAT WARNING**

   If you touch anything on disk between suspend and resume...
...kiss your data goodbye.

   If you do resume from initrd after your filesystems are mounted...
...bye bye root partition.

Horrible.

> Hence, I have the suspicion that if you do swap you should probably do
> swap partitions, not swap files, because it can cover all usecase:
> paging *and* hibernation.

I agree only insofar as it's the most reliable thing we have right
now. Not that it's an efficient or safe design, you still can have
problems if you rw mount a file system, and then resume from a
hibernation image. The kernel has no concept of matching a file system
state to that of a hibernation image, so that the hibernation image
can be invalidated, thus avoiding subsequent corruption.

> > Currently I'm creating a "swap" subvolume in the top-level of the file
> > system and /etc/fstab looks like this
> >
> > UUID=$FSUUID/var/swap   btrfs   noatime,subvol=swap 0 0
> > /var/swap/swapfile1 none swap defaults 0 0
> >
> > This seems to work reliably after hundreds of boots.
> >
> > a. Is this naming convention for the subvolume adequate? Seems like it
> > can just be "swap" because the GPT method is just a single partition
> > type GUID that's shared by multiboot Linux setups, i.e. not arch or
> > distro specific
>
> I'd still put it one level down, and marke it with some non-typical
> character so that it is less likely to clash with anything else.

I'm not sure I understand "one level down". The "swap" subvolume would
be in the top-level of the Btrfs file system, just like Fedora's
existing "root" and "home" subvolumes are in the top level.

>
> > b. Is the mount point, /var/swap, OK?
>
> I see no reason why not.

OK super.

>
> > c. What should the additional naming convention be for the swapfile
> > itself so swapon happens automatically?
>
> To me it appears these things should be distinct: if automatic
> activation of swap files is desirable, then there should probably be a
> systemd generator that finds all suitable files in /var/swap/ and
> generates .swap units for them. This would then w

Re: [systemd-devel] Antw: [EXT] Re: [systemd‑devel] the need for a discoverable sub‑volumes specification

2021-12-10 Thread Chris Murphy

On Mon, Nov 22, 2021 at 3:02 AM Ulrich Windl
 wrote:
>
> >>> Lennart Poettering  schrieb am 19.11.2021 um 10:17
> in
> Nachricht :
> > On Do, 18.11.21 14:51, Chris Murphy (li...@colorremedies.com) wrote:
> >
> >> How to do swapfiles?
> >
> > Is this really a concept that deserves too much attention? I mean, I
> > have the suspicion that half the benefit of swap space is that it can
> > act as backing store for hibernation. But swap files are icky for that
> > since that means the resume code has to mount the fs first, but given
> > the fs is dirty during the hibernation state this is highly problematic.
> >
> > Hence, I have the suspicion that if you do swap you should probably do
> > swap partitions, not swap files, because it can cover all usecase:
> > paging *and* hibernation.
>
> Out of curiosity: What about swap LVs, possibly thin-provisioned ones?

I don't think that's supported.
https://listman.redhat.com/archives/linux-lvm/2020-November/msg00039.html


-- 
Chris Murphy

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-18 Thread Chris Murphy

On Thu, Nov 18, 2021 at 2:51 PM Chris Murphy  wrote:
>
> How to do swapfiles?
>
> Currently I'm creating a "swap" subvolume in the top-level of the file
> system and /etc/fstab looks like this
>
> UUID=$FSUUID/var/swap   btrfs   noatime,subvol=swap 0 0
> /var/swap/swapfile1 none swap defaults 0 0
>
> This seems to work reliably after hundreds of boots.
>
> a. Is this naming convention for the subvolume adequate? Seems like it
> can just be "swap" because the GPT method is just a single partition
> type GUID that's shared by multiboot Linux setups, i.e. not arch or
> distro specific
> b. Is the mount point, /var/swap, OK?
> c. What should the additional naming convention be for the swapfile
> itself so swapon happens automatically?

Actually I'm thinking of something different suddenly... because
without user ownership of swapfiles, and instead systemd having domain
over this, it's perhaps more like:

/x-systemd.auto/swap -> /run/systemd/swap

And then systemd just manages the files in that directory per policy,
e.g. do on demand creation of swapfiles with variable size increments,
as well as cleanup.


-- 
Chris Murphy

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-18 Thread Chris Murphy

How to do swapfiles?

Currently I'm creating a "swap" subvolume in the top-level of the file
system and /etc/fstab looks like this

UUID=$FSUUID/var/swap   btrfs   noatime,subvol=swap 0 0
/var/swap/swapfile1 none swap defaults 0 0

This seems to work reliably after hundreds of boots.

a. Is this naming convention for the subvolume adequate? Seems like it
can just be "swap" because the GPT method is just a single partition
type GUID that's shared by multiboot Linux setups, i.e. not arch or
distro specific
b. Is the mount point, /var/swap, OK?
c. What should the additional naming convention be for the swapfile
itself so swapon happens automatically?


Also, instead of /@auto/ I'm wondering if we could have
/x-systemd.auto/ ? This makes it more clearly systemd's namespace, and
while I'm a big fan of the @ symbol for typographic history reasons,
it's being used in the subvolume/snapshot regimes rather haphazardly
for different purposes which might be confusing? e.g. Timeshift
expects subvolumes it manages to be prefixed with @. Meanwhile SUSE
uses @ for its (visible) root subvolume in which everything else goes.
And still ZFS uses @ for their (read-only) snapshots.

--
Chris Murphy

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-03 Thread Chris Murphy

Lennart most recently (about a year ago) wrote on this in a mostly
unrelated Fedora devel@ thread. I've found the following relevant
excerpts and provide the source URL as well.

BTW, we once upon a time added a TODO list item of adding a btrfs
generator to systemd, similar to the existing GPT generator: it would
look at the subvolumes of the root btrfs fs, and then try to mount
stuff it finds if it follows a certain naming scheme.
https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/message/M756KVDNY65VONU3GA5CSXB4LBJD3ZIW/

All I am asking for is to make this simple and robust and forward
looking enough so that we can later add something like the generator I
proposed without having to rerrange anything. i.e. make the most basic
stuff self-describing now, even if the automatic discovering/mounting
of other subvols doesn't happen today, or even automatic snapshotting.

By doing that correctly now, you can easily extend things later
incrementally without breaking stuff, just by *adding* stuff. And you
gain immediate compat with "systemd-nspawn --image=" right-away as the
basic minimum, which already is great.
https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/message/JB2PMFPPRS4YII3Q4BMHW3V33DM2MT44/

We manage to name RPMs with versions, epochs, archs and so on, I doubt
we need much more for naming subvolumes to auto-assemble.
https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/message/VBVFQOG5EYI73CGFVCLMGX72IZUCQEYG/

--
Chris Murphy

[systemd-devel] the need for a discoverable sub-volumes specification

2021-11-03 Thread Chris Murphy

There is a Discoverable Partitions Specification
http://systemd.io/DISCOVERABLE_PARTITIONS/

The problem with this for Btrfs, ZFS, and LVM is a single volume can
represent multiple use cases via multiple volumes: subvolumes (btrfs),
datasets (ZFS), and logical volumes (LVM). I'll just use the term
sub-volume for all of these, but I'm open to some other generic term.

None of the above volume managers expose the equivalent of GPT's
partition type GUID per sub-volume.

One possibility that's available right now is the sub-volume's name.
All we need is a spec for that naming convention.

An early prototype of this idea was posted by Lennart:
https://0pointer.net/blog/revisiting-how-we-put-together-linux-systems.html

Lennart previously mentioned elsewhere that this is probably outdated.
So let's update it and bring it more in line with the purpose and goal
set out in the discoverable partition spec, which is to obviate the
need for /etc/fstab.


-- 
Chris Murphy

Re: [systemd-devel] [EXT] Re: consider dropping defrag of journals on btrfs

2021-02-09 Thread Chris Murphy

On Tue, Feb 9, 2021 at 8:02 AM Phillip Susi  wrote:
>
>
> Chris Murphy writes:
>
> > Basically correct. It will merge random writes such that they become
> > sequential writes. But it means inserts/appends/overwrites for a file
> > won't be located with the original extents.
>
> Wait, I thoguht that was only true for metadata, not normal file data
> blocks?  Well, maybe it becomes true for normal data if you enable
> compression.  Or small files that get leaf packed into the metadata
> chunk.

Both data and metadata.

>
> If it's really combining streaming writes from two different files into
> a single interleaved write to the disk, that would be really silly.

It's not interleaving. It uses delayed allocation to make random
writes into sequential writes. It's tries harder to keep file blocks
together for the nossd case. It's a bit more opportunistic with the
ssd mount option. And it also depends on the pattern of the writer.
There's btrfs heatmap to get a visual idea of these behaviors.
https://github.com/knorrie/btrfs-heatmap

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-08 Thread Chris Murphy

On Mon, Feb 8, 2021 at 8:20 AM Phillip Susi  wrote:
>
>
> Chris Murphy writes:
>
> > I showed that the archived journals have way more fragmentation than
> > active journals. And the fragments in active journals are
> > insignificant, and can even be reduced by fully allocating the journal
>
> Then clearly this is a problem with btrfs: it absolutely should not be
> making the files more fragmented when asked to defrag them.

I've asked. We'll see..

> > file to final size rather than appending - which has a good chance of
> > fragmenting the file on any file system, not just Btrfs.
>
> And yet, you just said the active journal had minimal fragmentation.

Yes, the extents are consistently 8MB in the nodatacow case, old and
new file system alike. Same as ext4 and XFS.

> That seems to mean that the 8mb fallocates that journald does is working
> well.  Sure, you could proabbly get fewer fragments by fallocating the
> whole 128 mb at once, but there are tradeoffs to that that are not worth
> it.  One fragment per 8 mb isn't a big deal.  Ideally a filesystem will
> manage to do better than that ( didn't btrfs have a persistent
> reservation system for this purpose? ), but it certainly should not
> commonly do worse.

I don't think any of the file systems guarantee a contiguous block
range upon fallocate, they only guarantee that writes to fallocated
space will succeed. i.e. it's a space reservation. But yeah in
practice, 8MB is small enough that chances are you'll see one 8MB
extent.

And I agree 8MB isn't a big deal. Does anyone complain about journal
fragmentation on ext4 or xfs? If not, then we come full circle to my
second email in the thread which is don't defragment when nodatacow,
only defragment when datacow. Or use BTRFS_IOC_DEFRAG_RANGE and
specify 8MB length. That does seem to consistently no op on nodatacow
journals which have 8MB extents.

> > Further, even *despite* this worse fragmentation of the archived
> > journals, bcc-tools fileslower shows no meaningful latency as a
> > result. I wrote this in the previous email. I don't understand what
> > you want me to show you.
>
> *Of course* it showed no meaningful latency because you did the test on
> an SSD, which has no meaningful latency penalty due to fragmentation.
> The question is how bad is it on HDD.

The reason I'm dismissive is because the nodatacow fragment case is
the same as ext4 and XFS; the datacow fragment case is both
spectacular and non-deterministic. The workload will matter where
these random 4KiB journal writes end up on an HDD. I've seen journals
with hundreds to thousands of extents. I'm not sure what we learn from
me doing a single isolated test on an HDD.

And also, only defragmenting on rotation strikes me as leaving
performance on the table, right? If there is concern about fragmented
archived journals, then isn't there concern about fragmented active
journals?

But it sounds to me like you want to learn what the performance is of
journals defragmented with BTFS_IOC_DEFRAG specifically? I don't think
it's interesting because you're still better off leaving nodatacow
journals alone, and something still has to be done in the datacow
case. It's two extremes. What the performance is doesn't matter, it's
not going to tell you anything you can't already infer from the two
layouts.

> > And since journald offers no ability to disable the defragment on
> > Btrfs, I can't really do a longer term A/B comparison can I?
>
> You proposed a patch to disable it.  Test before and after the patch.

Is there a test mode for journald to just dump a bunch of random stuff
into the journal to age it? I don't want to wait weeks to get a dozen
journal files.

>
> > I did provide data. That you don't like what the data shows: archived
> > journals have more fragments than active journals, is not my fault.
> > The existing "optimization" is making things worse, in addition to
> > adding a pile of unnecessary writes upon journal rotation.
>
> If it is making things worse, that is definately a bug in btrfs.  It
> might be nice to avoid the writes on SSD though since there is no
> benefit there.

Agreed.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-08 Thread Chris Murphy

On Mon, Feb 8, 2021 at 7:56 AM Phillip Susi  wrote:
>
>
> Chris Murphy writes:
>
> >> It sounds like you are arguing that it is better to do the wrong thing
> >> on all SSDs rather than do the right thing on ones that aren't broken.
> >
> > No I'm suggesting there isn't currently a way to isolate
> > defragmentation to just HDDs.
>
> Yes, but it sounded like you were suggesting that we shouldn't even try,
> not just that it isn't 100% accurate.  Sure, some SSDs will be stupid
> and report that they are rotational, but most aren't stupid, so it's a
> good idea to disable the defragmentation on drives that report that they
> are non rotational.

So far I've seen, all USB devices report rotational. All USB flash
drives, and any SSD in an enclosure.

Maybe some way of estimating rotational based on latency standard
deviation, and stick that in sysfs, instead of trusting device
reporting. But in the meantime, the imperfect rule could be do not
defragment unless it's SCSI/SATA/SAS and it reports it's rotational.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] [EXT] Re: consider dropping defrag of journals on btrfs

2021-02-08 Thread Chris Murphy

On Mon, Feb 8, 2021 at 1:24 AM Ulrich Windl
 wrote:
>
> I didn't follow the thread tightly, but there was a happy mix of IOps,
> fragments (and no bandwidth),
> but I wonder here: Isn't it concept of BtrFS that writes are fragmented if
> there is no contiguous free space?
> The idea was *not* to spend time trying to find a goot spoace to write to, but
> use the next available one.

Basically correct. It will merge random writes such that they become
sequential writes. But it means inserts/appends/overwrites for a file
won't be located with the original extents.

> >> If you want an optimization that's actually useful on Btrfs,
> >> /var/log/journal/ could be a nested subvolume. That would prevent any
>
> Actually I stil ldidn't get the benefit of a BtrFS subvolume, but that 's a
> different topic:
> Don't all wrtites end up in a single storage pool?

Subvolumes/snapshots are file b-trees. It's the granularity of
snapshots, send/receive, and the fsync log tree. And at least user
space tools don't do recursive snapshotting, so they stop at subvolume
boundaries which can be important in some cases if the intent is to
use nodatacow. Snapshots results in nodatacow files being subject to
cow.

--
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-06 Thread Chris Murphy

On Fri, Feb 5, 2021 at 8:23 AM Phillip Susi  wrote:

> Chris Murphy writes:
>
> > But it gets worse. The way systemd-journald is submitting the journals
> > for defragmentation is making them more fragmented than just leaving
> > them alone.
>
> Wait, doesn't it just create a new file, fallocate the whole thing, copy
> the contents, and delete the original?

Same inode, so no. As to the logic, I don't know. I'll ask upstream to
document it.

?How can that possibly make
> fragmentation *worse*?

I'm only seeing this pattern with journald journals, and
BTRFS_IOC_DEFRAG. But I'm also seeing it with all archived journals.

Meanwhile, active journals exhibit no different pattern from ext4 and
xfs, no worse fragmentation.

Consid other storage technologies where COW and snapshots come into
play. For example anything based on device-mapper thin provisioning is
going to run into these issues. How it allocates physical extents
isn't up to the file system. Duplicate a file and delete the original,
you might get a more fragmented file as well. The physical layout is
entirely decoupled from the file system - where the filesystem could
tell you "no fragmentation" and yet it is highly fragmented, or vice
versa. These problems are not unique to Btrfs.

Is there a VFS API for handling these isues? Should there be? I really
don't think any application, including journald, should be having to
micromanage these kinds of things on a case by case basis. General
problems like this need general solutions.

> > All of those archived files have more fragments (post defrag) than
> > they had when they were active. And here is the FIEMAP for the 96MB
> > file which has 92 fragments.
>
> How the heck did you end up with nearly 1 frag per mb?

I didn't do anything special, it's a default configuration. I'll ask
Btrfs developers about it. Maybe it's one of those artifacts of FIEMAP
I mentioned previously. Maybe it's not that badly fragmented to a
drive that's going to reorder reads anyway, to be more efficient about
it.

> > If you want an optimization that's actually useful on Btrfs,
> > /var/log/journal/ could be a nested subvolume. That would prevent any
> > snapshots above from turning the nodatacow journals into datacow
> > journals, which does significantly increase fragmentation (it would in
> > the exact same case if it were a reflink copy on XFS for that matter).
>
> Wouldn't that mean that when you take snapshots, they don't include the
> logs?

That's a snapshot/rollback regime design and policy question.

If you snapshot the subvolume that contains the journals, the journals
will be in the snapshot. The user space tools do not have an option
for recursive snapshots, so snapshotting does end at subvolume
boundaries. If you want journals snapshot, then their enclosing
subvolume would need to be snapshot.

> That seems like an anti feature that violates the principal of
> least surprise.  If I make a snapshot of my root, I *expect* it to
> contain my logs.

You can only rollback that which you snapshot. If you snapshot a root
without excluding journals, if you rollback, you rollback the
journals. That's data loss.

(open)suse has a snapshot/rollback regime configured and enabled by
default out of the box. Logs are excluded from it, same as the
bootloader. (Although I'll also note they default to volatile systemd
journals, and use rsyslogd for persistent logs.) Fedora meanwhile does
have persistent journald journals in the root subvolume, but there's
no snapshot/rollback regime enabled out of the box. I'm inclined to
have them excluded, not so much to avoid cow of the nodatacow
journals, but avoiding discontinuity in the journals upon rollback.

>
> > I don't get the iops thing at all. What we care about in this case is
> > latency. A least noticeable latency of around 150ms seems reasonable
> > as a starting point, that's where users realize a delay between a key
> > press and a character appearing. However, if I check for 10ms latency
> > (using bcc-tools fileslower) when reading all of the above journals at
> > once:
> >
> > $ sudo journalctl -D
> > /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager
> >
> > Not a single report. None. Nothing took even 10ms. And those journals
> > are more fragmented than your 20 in a 100MB file.
> >
> > I don't have any hard drives to test this on. This is what, 10% of the
> > market at this point? The best you can do there is the same as on SSD.
>
> The above sounded like great data, but not if it was done on SSD.

Right. But also I can't disable the defragmentation in order to do a
proper test on HDD.

> > You can't depend on sysfs to conditionally do defragmentation on only
> > rotational media, too many fragile media claim to be rotating.

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-06 Thread Chris Murphy

More data points.

1.
An ext4 file system with a 112M system.journal, it has 15 extents.
>From FIEMAP we can pretty much see it's really made from 14 8MB
extents, consistent with multiple appends. And it's the exact same
behavior seen on Btrfs with nodatacow journals.

https://pastebin.com/6vuufwXt

2.
A Btrfs file system with a 24MB system.journal, nodatacow, 4 extents.
The fragments are consistent with #1 as a result of nodatacow
journals.

https://pastebin.com/Y18B2m4h

3.
Continuing from #2, 'journalctl --rotate'

strace shows this results in:
ioctl(31, BTRFS_IOC_DEFRAG) = 0

filefrag shows the result, 17 extents. But this is misleading because
9 of them are in the same position as before, so it seems to be a
minimalist defragment. Btrfs did what was requested but with both
limited impact and efficacy, at least on nodatacow files having
minimal fragmentation to begin with.
https://pastebin.com/1ufErVMs

4.
Continuing from #3, 'btrfs fi defrag -l 32M' pointed to this same file
results in a single extent file.

strace shows this uses
ioctl(3, BTRFS_IOC_DEFRAG_RANGE, {start=0, len=33554432, flags=0,
extent_thresh=33554432, compress_type=BTRFS_COMPRESS_NONE}) = 0

and filefrag shows the single extent mapping:
https://pastebin.com/429fZmNB

While this is a numeric improvement (no fragmentation), again there's
no proven advantage of defragmenting nodatacow journals on Btrfs. It's
just needlessly contributing to write amplification.

--

The original commit description only mentions COW, it doesn't mention
being predicated on nodatacow. In effect commit
f27a386430cc7a27ebd06899d93310fb3bd4cee7 is obviated by commit
3a92e4ba470611ceec6693640b05eb248d62e32d four months later. I don't
think they were ever intended to be used together, and combining them
seems accidental.

Defragmenting datacow files makes some sense on rotating media. But
that's the exception, not the rule.

--
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Chris Murphy

On Fri, Feb 5, 2021 at 3:55 PM Lennart Poettering
 wrote:
>
> On Fr, 05.02.21 20:58, Maksim Fomin (ma...@fomin.one) wrote:
>
> > > You know, we issue the btrfs ioctl, under the assumption that if the
> > > file is already perfectly defragmented it's a NOP. Are you suggesting
> > > it isn't a NOP in that case?
> >
> > So, what is the reason for defragmenting journal is BTRFS is
> > detected? This does not happen at other filesystems. I have read
> > this thread but has not found a clear answer to this question.
>
> btrfs like any file system fragments files with nocow a bit. Without
> nocow (i.e. with cow) it fragments files horribly, given our write
> pattern (wich is: append something to the end, and update a few
> pointers in the beginning). By upstream default we set nocow, some
> downstreams/users undo that however. (this is done via tmpfiles,
> i.e. journald doesn't actually set nocow ever).

I don't see why it's upstream's problem to solve downstream decisions.
If they want to (re)enable datacow, then they can also setup some kind
of service to defragment /var/log/journal/ on a schedule, or they can
use autodefrag.

> When we archive a journal file (i.e stop writing to it) we know it
> will never receive any further writes. It's a good time to undo the
> fragmentation (we make no distinction whether heavily fragmented,
> little fragmented or not at all fragmented on this) and thus for the
> future make access behaviour better, given that we'll still access the
> file regularly (because archiving in journald doesn't mean we stop
> reading it, it just means we stop writing it — journalctl always
> operates on the full data set). defragmentation happens in the bg once
> triggered, it's a simple ioctl you can invoke on a file. if the file
> is not fragmented it shouldn't do anything.

ioctl(3, BTRFS_IOC_DEFRAG_RANGE, {start=0, len=16777216, flags=0,
extent_thresh=33554432, compress_type=BTRFS_COMPRESS_NONE}) = 0

What 'len' value does journald use?

> other file systems simply have no such ioctl, and they never fragment
> as terribly as btrfs can fragment. hence we don't call that ioctl.

I did explain how to avoid the fragmentation in the first place, to
obviate the need to defragment.

1. nodatacow. journald does this already
2. fallocate the intended final journal file size from the start,
instead of growing them in 8MB increments.
3. Don't reflink copy (including snapshot) the journals. This arguably
is not journald's responsibility but as it creates both the journal/
directory and $MACHINEID directory, it could make one or both of them
as subvolumes instead to ensure they're not subject to snapshotting
from above.

> I'd even be fine dropping it
> entirely, if someone actually can show the benefits of having the
> files unfragmented when archived don't outweigh the downside of
> generating some iops when executing the defragmentation.

I showed that the archived journals have way more fragmentation than
active journals. And the fragments in active journals are
insignificant, and can even be reduced by fully allocating the journal
file to final size rather than appending - which has a good chance of
fragmenting the file on any file system, not just Btrfs.

Further, even *despite* this worse fragmentation of the archived
journals, bcc-tools fileslower shows no meaningful latency as a
result. I wrote this in the previous email. I don't understand what
you want me to show you.

And since journald offers no ability to disable the defragment on
Btrfs, I can't really do a longer term A/B comparison can I?

>i.e. someone
> does some profiling, on both ssd and rotating media. Apparently noone
> who cares about this apparently wants to do such research though, and
> hence I remain deeply unimpressed. Let's not try to do such
> optimizations without any data that actually shows it betters things.

I did provide data. That you don't like what the data shows: archived
journals have more fragments than active journals, is not my fault.
The existing "optimization" is making things worse, in addition to
adding a pile of unnecessary writes upon journal rotation.

Conversely, you have not provided data proving that nodatacow
fallocated files on Btrfs are any more fragmented than fallocated
files on ext4 or XFS.

2-17 fragments on ext4:
https://pastebin.com/jiPhrDzG
https://pastebin.com/UggEiH2J

That behavior is no different for nodatacow fallocated journals on
Btrfs. There's no point in defragmenting these no matter the file
system. I don't have to profile this on HDD, I know that even in the
best case you're not likely to get (certainly not guaranteed) to get
fewer fragments than this. Defrag on Btrfs is for the thousands of
fragments case, which is what you get with datacow journals.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] udev and btrfs multiple devices

2021-02-04 Thread Chris Murphy

On Thu, Feb 4, 2021 at 6:28 AM Lennart Poettering
 wrote:
>
> On Mi, 03.02.21 22:32, Chris Murphy (li...@colorremedies.com) wrote:
> > It doesn't. It waits indefinitely.
> >
> > [* ] A start job is running for
> > /dev/disk/by-uuid/cf9c9518-45d4-43d6-8a0a-294994c383fa (12min 36s / no
> > limit)
>
> Is this on encrypted media?

No. Plain partitions.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-04 Thread Chris Murphy

ad access patterns becuase the archived files are fragment.

Right. So pick a size for the journal file, I don't really care what
it is but they seem to get upwards of 128MB in size so just use that.
Make a 128MB file from the very start, fallocate it, and then when
full, rotate and create a new one. Stop the anti-pattern of tacking on
in 8MB increments. And stop defragmenting them. That is the best
scenario for HDD, USB sticks, and NVMe.

Looking at the two original commits, I think they were always in
conflict with each other, happening within months of each other. They
are independent ways of dealing with the same problem, where only one
of them is needed. And the best of the two is fallocate+nodatacow
which makes the journals behave the same as on ext4 where you also
don't do defragmentation.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] udev and btrfs multiple devices

2021-02-03 Thread Chris Murphy

On Wed, Feb 3, 2021 at 10:32 PM Chris Murphy  wrote:
>
> On Thu, Jan 28, 2021 at 7:18 AM Lennart Poettering
>  wrote:
> >
> > On Mi, 27.01.21 17:19, Chris Murphy (li...@colorremedies.com) wrote:
> >
> > > Is it possible for a udev rule to have a timeout? For example:
> > > /usr/lib/udev/rules.d/64-btrfs.rules
> > >
> > > This udev rule will wait indefinitely for a missing device to
> > > appear.
> >
> > Hmm, no, that's a mis understaning. "rules" can't "wait". The
> > activation of the btrfs file system won't happen, but that should then
> > be caught by systemd mount timeouts and put you into recovery mode.
>
> It doesn't. It waits indefinitely.
>
> [* ] A start job is running for
> /dev/disk/by-uuid/cf9c9518-45d4-43d6-8a0a-294994c383fa (12min 36s / no
> limit)

https://github.com/systemd/systemd/issues/18466


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-03 Thread Chris Murphy

On Wed, Feb 3, 2021 at 9:46 AM Lennart Poettering
 wrote:
>
> Performance is terrible if cow is used on journal files while we write
> them.

I've done it for a year on NVMe. The latency is so low, it doesn't matter.

> It would be great if we could turn datacow back on once the files are
> archived, and then take benefit of compression/checksumming and
> stuff. not sure if there's any sane API for that in btrfs besides
> rewriting the whole file, though. Anyone knows?

A compressed file results in a completely different encoding and
extent size, so it's a complete rewrite of the whole file, regardless
of the cow/nocow status.

Without compression it'd be a rewrite because in effect it's a
different extent type that comes with checksums. i.e. a reflink copy
of a nodatacow file can only be a nodatacow file; a reflink copy of a
datacow file can only be a datacow file. The conversion between them
is basically 'cp --reflink=never' and you get a complete rewrite.

But you get a complete rewrite of extents by submitting for
defragmentation too, depending on the target extent size.

It is possible to do what you want by no longer setting nodatacow on
the enclosing dir. Create a 0 length journal file, set nodatacow on
that file, then fallocate it. That gets you a nodatacow active
journal. And then you can just duplicate it in place with a new name,
and the result will be datacow and automatically compressed if
compression is enabled.

But the write hit has already happened by writing journal data into
this journal file during its lifetime. Just rename it on rotate.
That's the least IO impact possible at this point. Defragmenting it
means even more writes, and not much of a gain if any, unless it's
datacow which isn't the journald default.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-03 Thread Chris Murphy

On Wed, Feb 3, 2021 at 9:41 AM Lennart Poettering
 wrote:
>
> On Di, 05.01.21 10:04, Chris Murphy (li...@colorremedies.com) wrote:
>
> > f27a386430cc7a27ebd06899d93310fb3bd4cee7
> > journald: whenever we rotate a file, btrfs defrag it
> >
> > Since systemd-journald sets nodatacow on /var/log/journal the journals
> > don't really fragment much. I typically see 2-4 extents for the life
> > of the journal, depending on how many times it's grown, in what looks
> > like 8MiB increments. The defragment isn't really going to make any
> > improvement on that, at least not worth submitting it for additional
> > writes on SSD. While laptop and desktop SSD/NVMe can handle such a
> > small amount of extra writes with no meaningful impact to wear, it
> > probably does have an impact on much more low end flash like USB
> > sticks, eMMC, and SD Cards. So I figure, let's just drop the
> > defragmentation step entirely.
>
> Quite frankly, given how iops-expensive btrfs is, one probably
> shouldn't choose btrfs for such small devices anyway. It's really not
> where btrfs shines, last time I looked.

Btrfs aggressively delays metadata and data allocation, so I don't
agree that it's expensive. There is a wandering trees problem that can
result in write amplification, that's a different problem. But via
native compression overall writes are proven to significantly reduce
overall writes.

But in any case, reading a journal file and rewriting it out, which is
what defragment does, doesn't really have any benefit given the file
doesn't fragment much anyway due to (a) nodatacow and (b) fallocate,
which is what systemd-journald does on Btrfs.

It'd make more sense to defragment only if the file is datacow. At
least then it also gets compressed, which isn't the case when it's
nodatacow.

>
> > Further, since they are nodatacow, they can't be submitted for
> > compression. There was a quasi-bug in Btrfs, now fixed, where
> > nodatacow files submitted for decompression were compressed. So we no
> > longer get that unintended benefit. This strengthens the case to just
> > drop the defragment step upon rotation, no other changes.
> >
> > What do you think?
>
> Did you actually check the iops this generates?

I don't understand the relevance.

>
> Not sure it's worth doing these kind of optimizations without any hard
> data how expensive this really is. It would be premature.

Submitting the journal for defragment in effect duplicates the
journal. Read all extents, and rewrite those blocks to a new location.
It's doubling the writes for that journal file. It's not like the
defragment is free.

> That said, if there's actual reason to optimize the iops here then we
> could make this smart: there's actually an API for querying
> fragmentation: we could defrag only if we notice the fragmentation is
> really too high.

FIEMAP isn't going to work in the case the files are being fragmented.
The Btrfs extent size becomes 128KiB in that case, and it looks like
massive fragmentation. So that needs to be made smarter first.

I don't have a problem submitting the journal for a one time
defragment upon rotation if it's datacow, if empty journal-nocow.conf
exists.

But by default, the combination of fallocate and nodatacow already
avoids all meaningful fragmentation, so long as the journals aren't
being snapshot. If they are, well, that too is a different problem. If
the user does that and we're still defragmenting the files, it'll
explode their space consumption because defragment is not snapshot
aware, it results in all shared extents becoming unshared.

> But quite frankly, this sounds polishing things after the horse
> already left the stable: if you want to optimize iops, then don't use
> btrfs. If you bought into btrfs, then apparently you are OK with the
> extra iops it generates, hence also the defrag costs.

Somehow I think you're missing what I've asking for, which is to stop
the unnecessary defragment step because it's not an optimization. It
doesn't meaningfully reduce fragmentation at all, it just adds write
amplification.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] udev and btrfs multiple devices

2021-02-03 Thread Chris Murphy

On Thu, Jan 28, 2021 at 7:18 AM Lennart Poettering
 wrote:
>
> On Mi, 27.01.21 17:19, Chris Murphy (li...@colorremedies.com) wrote:
>
> > Is it possible for a udev rule to have a timeout? For example:
> > /usr/lib/udev/rules.d/64-btrfs.rules
> >
> > This udev rule will wait indefinitely for a missing device to
> > appear.
>
> Hmm, no, that's a mis understaning. "rules" can't "wait". The
> activation of the btrfs file system won't happen, but that should then
> be caught by systemd mount timeouts and put you into recovery mode.

It doesn't. It waits indefinitely.

[* ] A start job is running for
/dev/disk/by-uuid/cf9c9518-45d4-43d6-8a0a-294994c383fa (12min 36s / no
limit)


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] udev and btrfs multiple devices

2021-01-28 Thread Chris Murphy

On Thu, Jan 28, 2021 at 1:03 AM Greg KH  wrote:
>
> On Wed, Jan 27, 2021 at 05:19:38PM -0700, Chris Murphy wrote:
> >
> > Next, is it possible to enhance udev so that it can report the number
> > of devices expected for a Btrfs file system? This information is
> > currently in the Btrfs superblock found on each device in the
> > num_devices field.
> > https://github.com/storaged-project/udisks/pull/838#issuecomment-768372627
>
> It's not up to udev to report that, but rather have either the kernel
> export that, or have the tool that udev calls determine that.

I mean expose in udevadm info, e.g.

E: ID_BTRFS_NUM_DEVICES=4


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] udev and btrfs multiple devices

2021-01-27 Thread Chris Murphy

Is it possible for a udev rule to have a timeout? For example:
/usr/lib/udev/rules.d/64-btrfs.rules

This udev rule will wait indefinitely for a missing device to appear.
It'd be better if it gives up at some point and drops to a dracut
shell. Is that possible? The only alternative right now is the user
has to force power off, and boot with something like
rd.break=pre-mount, although I'm not 100% certain that'll break soon
enough to avoid the hang.

Next, is it possible to enhance udev so that it can report the number
of devices expected for a Btrfs file system? This information is
currently in the Btrfs superblock found on each device in the
num_devices field.
https://github.com/storaged-project/udisks/pull/838#issuecomment-768372627


Thanks,

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-01-26 Thread Chris Murphy

On Tue, Jan 5, 2021 at 10:04 AM Chris Murphy  wrote:
>
> f27a386430cc7a27ebd06899d93310fb3bd4cee7
> journald: whenever we rotate a file, btrfs defrag it
>
> Since systemd-journald sets nodatacow on /var/log/journal the journals
> don't really fragment much. I typically see 2-4 extents for the life
> of the journal, depending on how many times it's grown, in what looks
> like 8MiB increments. The defragment isn't really going to make any
> improvement on that, at least not worth submitting it for additional
> writes on SSD. While laptop and desktop SSD/NVMe can handle such a
> small amount of extra writes with no meaningful impact to wear, it
> probably does have an impact on much more low end flash like USB
> sticks, eMMC, and SD Cards. So I figure, let's just drop the
> defragmentation step entirely.
>
> Further, since they are nodatacow, they can't be submitted for
> compression. There was a quasi-bug in Btrfs, now fixed, where
> nodatacow files submitted for decompression were compressed. So we no
> longer get that unintended benefit. This strengthens the case to just
> drop the defragment step upon rotation, no other changes.
>
> What do you think?

A better idea.

Default behavior: journals are nodatacow and are not defragmented.

If '/etc/tmpfiles.d/journal-nocow.conf ` exists, do the reverse.
Journals are datacow, and files are defragmented (and compressed, if
it's enabled).


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] Antw: [EXT] emergency shutdown, don't wait for timeouts

2021-01-05 Thread Chris Murphy

On Mon, Jan 4, 2021 at 12:43 PM Phillip Susi  wrote:
>
>
> Reindl Harald writes:
>
> > i have seen "user manager" instances hanging for way too long and way
> > more than 3 minutes over the last 10 years
>
> The default timeout is 3 minutes iirc, so at that point it should be
> forcibly killed.

Hi,

This is too long for a desktop or laptop use case. It should be around
15-20 seconds. It's completely reasonable for users to reach for the
power button and force it off by 30 seconds.

Fedora Workstation Working Group is tracking an issue expressly to get
to around 20 seconds (or better).
https://pagure.io/fedora-workstation/issue/163

It is a given there will be some kind of state or data loss by just
forcing a shutdown. I think what we need is the console, revealed by
ESC, needs to contain sufficient information on what and why the
reboot/shutdown is being held back. So that we can figure out why
those processes aren't terminating fast enough and get them fixed.

A journaled file system should just do log replay at the next mount
and the file system itself will be fine. Fine means consistent. But
for overwriting file systems, files could be left in an in-between
state. It just depends on what's being written to and when and how. A
COW file system can better survive an abrupt poweroff since nothing is
being overwritten. But I'm skeptical just virtually pulling the power
cord is such a great idea to depend on. And for offline updates, we'd
want to inhibit the aggressive reboot/shutdown, to ensure updating is
complete and all writes are on stable media.

But for the aggressive shutdown case, some way of forcing remount ro?
Or possibly FIFREEZE/FITHAW?

Some boot/bootloader folks have asked fs devs for an atomic
freeze+thaw ioctl, i.e. one that is guaranteed to return to thaw. But
this has been rebuffed so far. While thaw seems superfluous for the
use case under discussion, it's possible poweroff command will be
blocked by the freeze. And the thaw itself can be blocked by the
freeze, when sysroot is the file system being frozen.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] consider dropping defrag of journals on btrfs

2021-01-05 Thread Chris Murphy

f27a386430cc7a27ebd06899d93310fb3bd4cee7
journald: whenever we rotate a file, btrfs defrag it

Since systemd-journald sets nodatacow on /var/log/journal the journals
don't really fragment much. I typically see 2-4 extents for the life
of the journal, depending on how many times it's grown, in what looks
like 8MiB increments. The defragment isn't really going to make any
improvement on that, at least not worth submitting it for additional
writes on SSD. While laptop and desktop SSD/NVMe can handle such a
small amount of extra writes with no meaningful impact to wear, it
probably does have an impact on much more low end flash like USB
sticks, eMMC, and SD Cards. So I figure, let's just drop the
defragmentation step entirely.

Further, since they are nodatacow, they can't be submitted for
compression. There was a quasi-bug in Btrfs, now fixed, where
nodatacow files submitted for decompression were compressed. So we no
longer get that unintended benefit. This strengthens the case to just
drop the defragment step upon rotation, no other changes.

What do you think?


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] btrfs raid not ready but systemd tries to mount it anyway

2020-10-12 Thread Chris Murphy

On Mon, Oct 12, 2020 at 1:33 AM Lennart Poettering
 wrote:
>
> On So, 11.10.20 14:57, Chris Murphy (li...@colorremedies.com) wrote:
>
> > Hi,
> >
> > A Fedora 32 (systemd-245.8-2.fc32) user has a 10-drive Btrfs raid1 set
> > to mount in /etc/fstab:
> >
> > UUID=f89f0a16-  /srv   btrfs  defaults,nofail,x-systemd.requires=/ 
> >  0 0
> >
> > For some reason, systemd is trying to mount this file system before
> > all ten devices are ready. Supposedly this rule applies:
> > https://github.com/systemd/systemd/blob/master/rules.d/64-btrfs.rules.in
>
> udev calls the btrfs ready ioctl whenever a new btrfs fs block deice
> shows up. The ioctl will fail as long as not all devices that make up
> the fs have shown up. It succeeds once all devices for the fs are
> there. i.e. for n=10 devices it will return failure 9 times, and
> sucess the 1 final time.
>
> When precisely it returns success or failure is entirely up to the btrfs 
> kernel
> code. systemd/udev doesn't have any control on that. The udev btrfs
> builtin is too trivial for that: it just calls the ioctl and that
> pretty much is it.

What does this line mean? Does it mean the 'btrfs ready' ioctl has
been called at this moment and the device is ready? i.e. this specific
device is ready now, but not before now?

[   30.923721] kernel: BTRFS: device label BTRFS_RAID1_srv devid 1
transid 60815 /dev/sdg scanned by systemd-udevd (710)

Because I see six such lines for this file system before the mount
attempt. And four such lines after the mount attempt. If "all devices
ready" is not true until the last such line appears, then the mount is
happening too soon for some reason.

> For historical reasons udev log level is independent from the rest of
> systemd log level. Thus use udev.log_priority=debug to turn on udev
> debug logging.

I'll have him retry with udev.log_priority=debug and if I get a moment
I'll try to reproduce. The difficulty is reproducing truly missing
devices is easy and appears to work, whereas in this case they are
merely late being scanned for whatever reason (maybe they take longer
to spin up, maybe the HBA they're connected to is just slow or has a
later loading driver, etc)

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] btrfs raid not ready but systemd tries to mount it anyway

2020-10-12 Thread Chris Murphy

On Sun, Oct 11, 2020 at 11:56 PM Andrei Borzenkov  wrote:
>
> 11.10.2020 23:57, Chris Murphy пишет:
> > Hi,
> >
> > A Fedora 32 (systemd-245.8-2.fc32) user has a 10-drive Btrfs raid1 set
> > to mount in /etc/fstab:
> >
> > UUID=f89f0a16-  /srv   btrfs  defaults,nofail,x-systemd.requires=/ 
> >  0 0
> >
> > For some reason, systemd is trying to mount this file system before
> > all ten devices are ready. Supposedly this rule applies:
> > https://github.com/systemd/systemd/blob/master/rules.d/64-btrfs.rules.in
> >
> > Fedora does have /usr/lib/udev/rules.d/64-btrfs.rules but I find no
> > reference at all to this rule when the user boots with 'rd.udev.debug
> > systemd.log_level=debug'. The entire journal is here:
> >
> > https://drive.google.com/file/d/1jVHjAQ8CY9vABtM2giPTB6XeZCclm7R-/view
> >
>
> Educated guess - rule is missing in initrd and you do not run udev
> trigger after switch to root.

I will ask the user to double check their initrd, but mine definitely
has it without any initrd/dracut related customizations.

$ sudo lsinitrd initramfs-5.8.8-200.fc32.x86_64.img | grep btrfs
btrfs

-rw-r--r--   1 root root  616 May 29 12:35
usr/lib/udev/rules.d/64-btrfs.rules

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] btrfs raid not ready but systemd tries to mount it anyway

2020-10-11 Thread Chris Murphy

Hi,

A Fedora 32 (systemd-245.8-2.fc32) user has a 10-drive Btrfs raid1 set
to mount in /etc/fstab:

UUID=f89f0a16-  /srv   btrfs  defaults,nofail,x-systemd.requires=/  0 0

For some reason, systemd is trying to mount this file system before
all ten devices are ready. Supposedly this rule applies:
https://github.com/systemd/systemd/blob/master/rules.d/64-btrfs.rules.in

Fedora does have /usr/lib/udev/rules.d/64-btrfs.rules but I find no
reference at all to this rule when the user boots with 'rd.udev.debug
systemd.log_level=debug'. The entire journal is here:

https://drive.google.com/file/d/1jVHjAQ8CY9vABtM2giPTB6XeZCclm7R-/view

I expect a workaround would be to use mount option:

x-systemd.automount,noauto,nofail,x-systemd.requires=/

In fact, I'm not sure x-systemd.requires is needed because / must be
mounted successfully to read /etc/fstab in the first place; in order
to know to mount this file system at /srv

Anyway I'm mainly confused why the btrfs udev rule is seemingly not
applied in this case.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] [Help] Can't log in to homed user account: "No space left on device"

2020-08-24 Thread Chris Murphy

On Mon, Aug 24, 2020 at 2:44 AM Andrii Zymohliad
 wrote:
>
> > I suspect that the workaround until
> > this is figured out why the fallocate fails (I suspect shared extents,
> > there's evidence this home file has been snapshot and I don't know
> > what effect that has on fallocating the file) is to use
> > --luks-discard=true ? That should avoid the need to fallocate when
> > opening.
>
> Just to confirm, "homectl update --luks-discard=on azymohliad" fixed the 
> issue for me. I can log in again. Thanks a lot to Chris and Andrei!
>
>
> > And the user will have to be vigilant about the space usage
> > of user home because it's now possible to overcommit.
>
> I guess it's better to reduce home size (to something like 300-350G in my 
> case) to decrease the probability of overcommit?
>

Yes. But mainly you'll just want to keep an eye on the underlying file
system free space. If it goes empty, the home file system won't know
this and can try to write anyway, and then we get a pretty icky kind
of failure with the home file system, possibly very bad.

Of course this is a temporary situation, we need to find a long term
solution because it's definitely not intended that the user babysit
the two file systems like this. There should be built-in logic to make
sure It Just Works.

But also in your case you should try to find out why there are shared
extents. That's more of a btrfs question so i'll resume that
conversation on the btrfs list.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] [Help] Can't log in to homed user account: "No space left on device"

2020-08-24 Thread Chris Murphy

On Sun, Aug 23, 2020 at 6:47 AM Andrei Borzenkov  wrote:
>
> 23.08.2020 15:34, Andrii Zymohliad пишет:
> >> Here is the log after authentication attempt: 
> >> https://gitlab.com/-/snippets/2007113
> >> And just in case here is the full log since boot: 
> >> https://gitlab.com/-/snippets/2007112
> >
> > Sorry, links are broken, re-uploaded:
> >
> > Authentication part: https://gitlab.com/-/snippets/2007123
> > Full log: https://gitlab.com/-/snippets/2007124
> >
>
> Yes, as suspected:
>
> > сер 23 14:12:48 az-wolf-pc systemd-homed[917]: Not enough disk space
> to fully allocate home.
>
> This comes from
>
> if (fallocate(backing_fd, FALLOC_FL_KEEP_SIZE, 0, st->st_size) <
> 0) {
>
> ...
> if (ERRNO_IS_DISK_SPACE(errno)) {
> log_debug_errno(errno, "Not enough disk space to
> fully allocate home.");
> return -ENOSPC; /* make recognizable */
> }
>
> return log_error_errno(errno, "Failed to allocate
> backing file blocks: %m");
> }
>
> So fallocate syscall failed. Try manually
>
> fallocate -l 403G -n /home/azymohliad.home
>
> if it fails too, the question is better asked on btrfs list.

User reports this from 'homectl inspect'

LUKS Discard: online=no offline=yes

Does this mean 'fstrim' is issued before luksClose? And 'fallocate' is
issued before luksOpen?

If so, it seems it'd be trivial to run into a fatal attempt to
activate, just by deactivating a user home of default size. And then
consuming free space on the underlying volume, such that there isn't
enough free space to fallocate the home file before opening it again.
What am I missing?

What it seems homed needs to do if fallocate fails for whatever reason
is to have some kind of fallback. Otherwise the user is stuck being
unable to open their user home. I suspect that the workaround until
this is figured out why the fallocate fails (I suspect shared extents,
there's evidence this home file has been snapshot and I don't know
what effect that has on fallocating the file) is to use
--luks-discard=true ? That should avoid the need to fallocate when
opening. And the user will have to be vigilant about the space usage
of user home because it's now possible to overcommit.



--
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] kernel messages not making it to journal

2020-06-13 Thread Chris Murphy

On Thu, Jun 4, 2020 at 5:30 AM Michal Koutný  wrote:
>
> Hi.
>
> On Mon, Jun 01, 2020 at 07:11:15PM -0600, Chris Murphy 
>  wrote:
> > But journalctl does not show it at all. Seems like it might be a bug,
> > I expect it to be recorded in the journal, not only found in dmesg.
> Journald fetches dmesg messages too (see jounrald.conf:ReadKMsg=). It's
> not clear whether you run journalctl as root or non-privileged user that
> may not have access to the system-wide kernel messages.
>
> If you don't see the messages in journal as root and you can reproduce
> it, I suggest you file an issue on Github [1].

It's 100% reproducible.

https://github.com/systemd/systemd/issues/16173


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] kernel messages not making it to journal

2020-06-01 Thread Chris Murphy

dmesg shows this:

[   22.947118] systemd-journald[629]: File
/var/log/journal/26336922e1044e80ae4bd42e1d6b9099/user-1000.journal
corrupted or uncleanly shut down, renaming and replacing.
[   22.953883] systemd-journald[629]: Creating journal file
/var/log/journal/26336922e1044e80ae4bd42e1d6b9099/user-1000.journal on
a btrfs file system, and copy-on-write is e

But journalctl does not show it at all. Seems like it might be a bug,
I expect it to be recorded in the journal, not only found in dmesg.

systemd-245.4-1.fc32.x86_64

Thanks,

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] How to use systemd-repart partitions?

2020-05-30 Thread Chris Murphy

On Wed, May 20, 2020 at 4:01 AM Lennart Poettering
 wrote:
>
> On Mi, 20.05.20 00:12, Tobias Hunger (tobias.hun...@gmail.com) wrote:
>
> >
> > The one thing that is frustrating is to get a machine image generated
> > by my build server onto a new piece of hardware. So I wanted to see
> > how far I can get with systemd-repart and co. to get this initial
> > deployment to new hardware more automated after booting with the
> > machine image from an USB stick.
>
> So I eventually want to cover three usecases with systemd-repart:
>
> 1. building OS images
> 2. installing an OS from an installer medium onto a host medium
> 3. adapting an OS images to the medium it has been dd'ed to on first
>boot
>
> I think the 3rd usecase we already cover quite OK.
>
> To deliver the others, I want to add Encrypt= and Format= as
> mentioned. To cover the 2nd usecase I then also want to provide
> CopyBlocks= and CopyFiles=. The former would stream data into a
> partition that is created on the block level. Primary usecase would be
> to copy a partition 1:1 from the installer medium onto the host
> medium. The latter would copy in a file tree on the fs level.

Future feature for the former case:
- Btrfs seed/sprout feature expressly supports this use case for
replicating a seed image when destination is also Btrfs.
# mount /dev/seed /mnt
# btrfs device add /dev/sprout /mnt
# mount -o remount,rw /mnt
# btrfs device remove /dev/seed /mnt

This results in replication happening, from seed to sprout device.

Future feature to consider for the latter case, maybe it's more of an
optimization:
- ability to create btrfs subvolumes
- ability to 'cp -a --reflink'

Possibly an unaware installer just copies files over blindly. But then
the "repart" task is to create a snapshottable layout after the fact
without having to recopy everything.


> On first boot, "systemd-repart" would run to add in swap or so, maybe
> scaled by RAM size or so, and maybe format and encrypt /home.

Or add in zram-generator configuration if the generator is available,
and not worry about creating a persistent device.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] location of user-1000.journal

2020-03-19 Thread Chris Murphy

Hi,

I'm wondering if user journals are better being located in ~/.var by
default? In particular in a systemd-homed context when ~/ is
encrypted.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] homed, LUKS2 passphrase encoding, and recovery key

2020-01-23 Thread Chris Murphy

Thanks for the answer, it's very useful. When I asked the question, I
didn't fully appreciate the cryptographic and anti-forensic
capabilities in LUKS that almost certainly should not be
re-implemented elsewhere.

I'd like to better understand what it would take to support UTF-8
passphrases for LUKS (luksFormat, luksOpen). Consistently and
reliably, in a portable user home context. Of course the keyboard
could change. Locale could, thus default local language of the host
system could be different.

That's the short version. Everything below this line is a super
verbose explanation how I'm arriving at the above.

I assume users want their login passphrase to use local characters.
Germans should of course be allowed to create a login passphrase using
characters with umlaut; Japanese should of course be allowed to create
passphrases using kanji. And so on. I further assume that this same
login passphrase is what should be used for `cryptsetup
luksFormat/luksOpen' in order to *avoid* more indirection, and being
forced to invent new crypto, which entails a lot of work and risk:
security and interoperability.

Many users are conditioned to accept a restriction to the 95 printable
of the first 128 ASCII characters for a LUKS passphrase. That's
because the typical workflow demands volume unlocking in an
environment with a limited input stack (initramfs and plymouth). But I
assume a global user isn't prepared for, and shouldn't have to accept,
such a limitation for their login password. And in a systemd-homed
context, that means the login password, if it's what's handed off to
cryptsetup, are the same and also cannot be limited to ASCII.

So the question comes full circle.

What are all of the things that can affect the encoding between the
user's passphrase as it exists in their mind, and as handed off to
cryptsetup? How to store that metadata? That way it can be tested
against, and provide the user with helpful hints about why
authentication isn't succeeding. Right now it's a one size fits all.
There's no difference in the error between wrong passphrase (this user
is not authentic) and encoding failure due to keyboard change, or
keymapping change, or whatever else can affect encoding.

Small problem? :D

--
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] homed, LUKS2 passphrase encoding, and recovery key

2019-12-11 Thread Chris Murphy

I stumbled onto a LUKS2 keymapping story on the dm-crypt list [1] that
nearly ended in user data loss. The two suggestions for how to avoid
such problems is to use either ASCII or modhex based passphrases. [3]

I'm curious about whether this is something homed can help deal with:
users who want to use a single login+encryption passphrase in their
native language, keyboard mapping, and character set (likey UTF-8). Or
otherwise, enforce limits on the passphrase characters so that the
user doesn't unwittingly get stuck unable to access their data or even
login.

The implementation details don't really concern me. But I am
interested in whether there's a role for homed to play; or some other
component; and maybe even time frame for a near term policy vs down
the road enhancement.

Maybe near term just enforce ASCII or modhex (or make it
configurable)? That's very biased against non-Latin languages, which I
don't like, but I'd rather see that restriction enforced than users
getting dumped on an island where they may very well just give up and
lose data over it.

A longer term strategy is homed adds another layer of indirection
where the LUKS2 passphrase is random, modhex based. And then itself
encrypted, and protected by a KEK based on the user's passphrase where
all the things that can affect the encoding of that passphrase are
included in the identity metadata. That way if that state differs, the
user can be informed or even given a hint what aspect of the system
has changed compared to when the passphrase was originally set.

Also, somewhat unrelated, is if homed can provide a mechanism for
setting up a recovery key for LUKS2 images? I'm not fussy on whether
this would or should be something a desktop first boot program (or
create new user panel) would opt into, or opt out of. I think it'd be
sane if homed just always created one, reporting the random passphrase
by varlink to the desktop's setup/create user program, and if that
program wants to drop this information - well whatever. But the DE
could offer to save it to a USB stick, display it for manual recording
or screenshotting, or display it as a printable+scannable code. Or
perhaps a variation on this for yubikey setup is the option to setup
two keys.



[1]
the setup
https://www.saout.de/pipermail/dm-crypt/2019-December/006279.html
the cause
https://www.saout.de/pipermail/dm-crypt/2019-December/006283.html

[2]
modhex
https://www.saout.de/pipermail/dm-crypt/2019-December/006285.html
ASCII, although actually personally excludes upper case
https://www.saout.de/pipermail/dm-crypt/2019-December/006287.html


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] perform fsck on everyt boot

2019-11-21 Thread Chris Murphy

On Wed, Nov 20, 2019, 11:58 PM Belisko Marek 
wrote:

> On Thu, Nov 21, 2019 at 7:25 AM Chris Murphy 
> wrote:
> >
> > On Tue, Nov 12, 2019 at 3:52 AM Belisko Marek 
> wrote:
> > >
> > > On Mon, Nov 11, 2019 at 4:47 PM Lennart Poettering
> > >  wrote:
> > > >
> > > > On Mo, 11.11.19 13:33, Belisko Marek (marek.beli...@gmail.com)
> wrote:
> > > > 65;5802;1c
> > > > > Hi,
> > > > >
> > > > > I'm using systemd 234 (build by yocto) and I've setup automount of
> > > > > sdcard in fstab. This works perfectly fine. But I have seen from
> time
> > > > > to time when system goes to emergency mode because sdcard
> filesystem
> > > > > (ext4) have an issue and cannot be mounted. I was thinking about
> > > > > forcing fsck for every boot. Reading manual it should be enough to
> set
> > > > > passno (6th column in fstab) to anything higher then 0. I set ti
> to 2
> > > > > but inspecting logs it doesn't seems fsck is performed. Am I still
> > > > > missing something? Thanks.
> > > >
> > > > Well, note that ext4's fsck only does an actual file system check
> > > > every now and then. Hence: how did you determine fsck wasn't started?
> > > >
> > > > Do you see the relevant fsck in "systemctl -t service | grep
> > > > systemd-fsck@"?
> > > I just saw in log:
> > > [  OK  ] Found device /dev/mmcblk1p1.
> > > Mounting /mnt/sdcard...
> > > [8.339072] EXT4-fs (mmcblk1p1): VFS: Found ext4 filesystem with
> > > invalid superblock checksum.  Run e2fsck?
> > > [FAILED] Failed to mount /mnt/sdcard.
> >
> > This isn't normal. Your effort should be on finding out why this
> > problem is happening in the first place. This doesn't strike me as the
> > (somewhat) ordinary case of unclean unmount, which results in journal
> > replay at next mount attempt. But something considerably more serious.
> Problem is it's very hard to reproduce and this is not rootfs just
> external SDcard for storing some data.
> If I hit this system goes to emergency mode and device is dead and I
> would like to prevent that in first place.
> IMO fsck should help to recover this issue and should continue without
> issues. Thanks.
>

Possibly adding nofail option in fstab will prevent startup from going into
rescue.target.

I'm skeptical of unattended use of fsck. That's what journal replay is for,
and if replay can't fix the problem, then the underlying problem needs to
be fixed rather than papered over with fsck.

You might consider testing this SDCard with f3, which will check for
corruption, and fake flash. Reformat, mount, f3write /mountpoint, f3read
/mountpoint.

I don't trust consumer SDCards for anything. I've had name brand stuff
fail.

The systemd journal should show evidence of either umount success or
failure for this SDCard on restart or shutdown. Do the corruptions only
happen on shutdowns? It both shutdown and restart? SDCards can get really
fussy, exhibiting corruptions, or just brick themselves, when power is
removed while writes are still happening internally. Cheaper flash may be
slower to flush to stable media. You can give it more time by manually
unmounting this SDCard before reboot or shutdown.

Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] perform fsck on everyt boot

2019-11-20 Thread Chris Murphy

On Tue, Nov 12, 2019 at 3:52 AM Belisko Marek  wrote:
>
> On Mon, Nov 11, 2019 at 4:47 PM Lennart Poettering
>  wrote:
> >
> > On Mo, 11.11.19 13:33, Belisko Marek (marek.beli...@gmail.com) wrote:
> > 65;5802;1c
> > > Hi,
> > >
> > > I'm using systemd 234 (build by yocto) and I've setup automount of
> > > sdcard in fstab. This works perfectly fine. But I have seen from time
> > > to time when system goes to emergency mode because sdcard filesystem
> > > (ext4) have an issue and cannot be mounted. I was thinking about
> > > forcing fsck for every boot. Reading manual it should be enough to set
> > > passno (6th column in fstab) to anything higher then 0. I set ti to 2
> > > but inspecting logs it doesn't seems fsck is performed. Am I still
> > > missing something? Thanks.
> >
> > Well, note that ext4's fsck only does an actual file system check
> > every now and then. Hence: how did you determine fsck wasn't started?
> >
> > Do you see the relevant fsck in "systemctl -t service | grep
> > systemd-fsck@"?
> I just saw in log:
> [  OK  ] Found device /dev/mmcblk1p1.
> Mounting /mnt/sdcard...
> [8.339072] EXT4-fs (mmcblk1p1): VFS: Found ext4 filesystem with
> invalid superblock checksum.  Run e2fsck?
> [FAILED] Failed to mount /mnt/sdcard.

This isn't normal. Your effort should be on finding out why this
problem is happening in the first place. This doesn't strike me as the
(somewhat) ordinary case of unclean unmount, which results in journal
replay at next mount attempt. But something considerably more serious.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] RFC: luksSuspend support in sleep/sleep.c

2019-11-05 Thread Chris Murphy

On Fri, Nov 1, 2019 at 2:31 PM Matthew Garrett  wrote:
>
> The initrd already contains a UI stack in order to permit disk unlock
> at boot time, so this really doesn't seem like a problem?

It's a very small and limited UI stack. At least the GNOME developers
I've discussed it with, this environment has all kinds of a11y, i18n,
and keymapping problems. To solve it means either baking a significant
portion of the GNOME stack into the initramfs, or some kind of magic
because the resources don't exist to create a minimized GNOME stack
that could achieve this. And so far the effort has been to try and
make the initramfs smaller, and more generic.

I have no idea how either Apple or Microsoft solve this problem.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] RFC: luksSuspend support in sleep/sleep.c

2019-11-05 Thread Chris Murphy

On Thu, Oct 31, 2019 at 4:55 PM Lennart Poettering
 wrote:
> Hmm, so far this all just worked for me, I didn't run into any trouble
> with suspending just $HOME?

What about /var and /home sharing the same volume? I'm pretty sure the
default layout for Fedora Silverblue is a separate var volume, mounted
at /var, with /var/home bind mounted to /home.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] systemd-analyze shows long firmware time with sd-boot

2019-10-02 Thread Chris Murphy

Hi,

systemd-243-2.gitfab6f01.fc31.x86_64

systemd-analyze reports very high firmware times, seemingly related to
uptime since last cold boot. That is, if I poweroff then boot, the
time for firmware seems reasonable; whereas reboots appear to show
cumulative time for only the firmware value.

This is in a qemu-kvm VM, using edk2-ovmf-20190501stable-4.fc31.noarch

reboot (warm boot)

$ systemd-analyze
Startup finished in 8min 2.506s (firmware) + 12ms (loader) + 1.414s
(kernel) + 1.383s (initrd) + 8.962s (userspace) = 8min 14.278s
graphical.target reached after 8.946s in userspace
[chris@localhost-live ~]$

# systemd-analyze
Startup finished in 18min 13.786s (firmware) + 11ms (loader) + 1.420s
(kernel) + 1.361s (initrd) + 8.855s (userspace) = 18min 25.434s
graphical.target reached after 8.840s in userspace
[root@localhost-live ~]#

# systemd-analyze
Startup finished in 20min 22.976s (firmware) + 11ms (loader) + 1.410s
(kernel) + 1.360s (initrd) + 8.801s (userspace) = 20min 34.560s
graphical.target reached after 8.790s in userspace
[root@localhost-live ~]#

# systemd-analyze
Startup finished in 51min 8.018s (firmware) + 12ms (loader) + 1.415s
(kernel) + 1.370s (initrd) + 8.836s (userspace) = 51min 19.653s
graphical.target reached after 8.821s in userspace
[root@localhost-live ~]#

poweroff+boot (cold boot)

# systemd-analyze
Startup finished in 2.402s (firmware) + 15ms (loader) + 1.498s
(kernel) + 1.358s (initrd) + 8.723s (userspace) = 13.998s
graphical.target reached after 8.709s in userspace
[root@localhost-live ~]#


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] sd-boot kickstart

2019-10-02 Thread Chris Murphy

On Tue, Oct 1, 2019 at 1:05 AM Damian Ivanov  wrote:
>
> Hello,
>
> I watched the video and presentation
> https://cfp.all-systems-go.io/media/sdboot-asg2019.pdf
> I could not agree more! Anaconda/Kickstart install grub as the
> bootloader. Is there some hidden option to use sd-boot instead or is
> it necessary to install sd-boot manually after the OS is deployed?

I only see extlinux as an alternative to grub:
https://pykickstart.readthedocs.io/en/latest/

I think it's a question for Anaconda developers how to support sd-boot:
https://www.redhat.com/mailman/listinfo/anaconda-devel-list



-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] systemd243rc2, sysd-coredump is not triggered on segfaults

2019-09-02 Thread Chris Murphy

Maybe it's something unique to gnome-shell segfaults, that's the only
thing I have crashing right now. But I've got a pretty good reproducer
to get it to crash and I never have any listings with coredumpctl.

process segfaults but systemd-coredump does not capture it
https://bugzilla.redhat.com/show_bug.cgi?id=1748145


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] systemd backlight:acpi_video0 fails, no such device

2019-09-02 Thread Chris Murphy

On Mon, Sep 2, 2019 at 1:56 AM Hans de Goede  wrote:
>
> Hi,
>
> On 02-09-19 07:17, Mantas Mikulėnas wrote:
> > On Mon, Sep 2, 2019 at 7:34 AM Chris Murphy  > <mailto:li...@colorremedies.com>> wrote:
> >
> > systemd-243~rc2-2.fc31.x86_64
> > kernel-5.3.0-0.rc6.git1.1.fc32.x86_64
> >
> > This might be a regression, at least I don't remember this happening
> > before. I can use the expected keys for built-in display brightness,
> > and built-in keyboard brightness. But the service unit fails with an
> > out of the box installation.
> >
> >
> > [chris@fmac ~]$ sudo systemctl status
> > systemd-backlight@backlight:acpi_video0.service
> > ● systemd-backlight@backlight:acpi_video0.service - Load/Save Screen
> > Backlight Brightness of backlight:acpi_video0
> > Loaded: loaded (/usr/lib/systemd/system/systemd-backlight@.service;
> > static; vendor preset: disabled)
> > Active: failed (Result: exit-code) since Sun 2019-09-01 19:57:37
> > MDT; 8min ago
> >   Docs: man:systemd-backlight@.service(8)
> >Process: 667 ExecStart=/usr/lib/systemd/systemd-backlight load
> > backlight:acpi_video0 (code=exited, status=1/FAILURE)
> >   Main PID: 667 (code=exited, status=1/FAILURE)
> >
> > Sep 01 19:57:37 fmac.local systemd[1]: Starting Load/Save Screen
> > Backlight Brightness of backlight:acpi_video0...
> > Sep 01 19:57:37 fmac.local systemd-backlight[667]: Failed to get
> > backlight or LED device 'backlight:acpi_video0': No such device
> > Sep 01 19:57:37 fmac.local systemd[1]:
> > systemd-backlight@backlight:acpi_video0.service: Main process exited,
> > code=exited, status=1/FAILURE
> > Sep 01 19:57:37 fmac.local systemd[1]:
> > systemd-backlight@backlight:acpi_video0.service: Failed with result
> > 'exit-code'.
> > Sep 01 19:57:37 fmac.local systemd[1]: Failed to start Load/Save
> > Screen Backlight Brightness of backlight:acpi_video0.
> > [chris@fmac ~]$
> >
> > # find /sys -name "*video0*"
> > /sys/class/video4linux/video0
> > /sys/devices/pci:00/:00:1a.7/usb1/1-2/1-2:1.0/video4linux/video0
> > # ls -l /sys/class/backlight/
> > total 0
> > lrwxrwxrwx. 1 root root 0 Sep  1 19:57 gmux_backlight ->
> > ../../devices/pnp0/00:03/backlight/gmux_backlight
> > lrwxrwxrwx. 1 root root 0 Sep  1 19:57 intel_backlight ->
> > 
> > ../../devices/pci:00/:00:02.0/drm/card0/card0-LVDS-1/intel_backlight
> >
> >
> > Could it be that acpi_backlight is loaded at first, but gets replaced by 
> > intel_backlight before systemd could react?
>
> Maybe, the gmux_backlight suggests that this is a macbook.

It is a 2011 Macbook Pro.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] systemd backlight:acpi_video0 fails, no such device

2019-09-01 Thread Chris Murphy

systemd-243~rc2-2.fc31.x86_64
kernel-5.3.0-0.rc6.git1.1.fc32.x86_64

This might be a regression, at least I don't remember this happening
before. I can use the expected keys for built-in display brightness,
and built-in keyboard brightness. But the service unit fails with an
out of the box installation.


[chris@fmac ~]$ sudo systemctl status
systemd-backlight@backlight:acpi_video0.service
● systemd-backlight@backlight:acpi_video0.service - Load/Save Screen
Backlight Brightness of backlight:acpi_video0
   Loaded: loaded (/usr/lib/systemd/system/systemd-backlight@.service;
static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Sun 2019-09-01 19:57:37
MDT; 8min ago
 Docs: man:systemd-backlight@.service(8)
  Process: 667 ExecStart=/usr/lib/systemd/systemd-backlight load
backlight:acpi_video0 (code=exited, status=1/FAILURE)
 Main PID: 667 (code=exited, status=1/FAILURE)

Sep 01 19:57:37 fmac.local systemd[1]: Starting Load/Save Screen
Backlight Brightness of backlight:acpi_video0...
Sep 01 19:57:37 fmac.local systemd-backlight[667]: Failed to get
backlight or LED device 'backlight:acpi_video0': No such device
Sep 01 19:57:37 fmac.local systemd[1]:
systemd-backlight@backlight:acpi_video0.service: Main process exited,
code=exited, status=1/FAILURE
Sep 01 19:57:37 fmac.local systemd[1]:
systemd-backlight@backlight:acpi_video0.service: Failed with result
'exit-code'.
Sep 01 19:57:37 fmac.local systemd[1]: Failed to start Load/Save
Screen Backlight Brightness of backlight:acpi_video0.
[chris@fmac ~]$

# find /sys -name "*video0*"
/sys/class/video4linux/video0
/sys/devices/pci:00/:00:1a.7/usb1/1-2/1-2:1.0/video4linux/video0
# ls -l /sys/class/backlight/
total 0
lrwxrwxrwx. 1 root root 0 Sep  1 19:57 gmux_backlight ->
../../devices/pnp0/00:03/backlight/gmux_backlight
lrwxrwxrwx. 1 root root 0 Sep  1 19:57 intel_backlight ->
../../devices/pci:00/:00:02.0/drm/card0/card0-LVDS-1/intel_backlight
# find /sys -name "*acpi*"
/sys/kernel/debug/acpi
/sys/bus/platform/drivers/acpi-fan
/sys/bus/platform/drivers/axp288_pmic_acpi
/sys/bus/acpi
/sys/bus/acpi/drivers/acpi_als
/sys/firmware/acpi
/sys/module/rtc_cmos/parameters/use_acpi_alarm
/sys/module/acpi_als
/sys/module/industrialio/holders/acpi_als
/sys/module/pci_hotplug/parameters/debug_acpi
/sys/module/kfifo_buf/holders/acpi_als
/sys/module/acpiphp
/sys/module/libata/parameters/noacpi
/sys/module/libata/parameters/acpi_gtf_filter
/sys/module/acpi
/sys/module/acpi/parameters/acpica_version

OK so maybe no expected hook to discover the brightness to save it or
load it is all? *shrug*


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] shutdown on service unit timeout?

2019-08-19 Thread Chris Murphy

Hi,

Is it possible for a systemd service file to ask for a poweroff upon
service timeout? If not, could it be done; or suggest an alternative?

Here's the use case:

No Screensaver/Powerdown after Inactivity at LUKS Password Prompt
https://bugzilla.redhat.com/show_bug.cgi?id=1742953

The summary is: plymouth waits indefinitely with a prompt for a
passphrase, leads to excessive power consumption including battery if
it's a laptop (it'll wait until the battery dies), and screen burn in.
This can happen unattended if e.g. Fedora is the default boot, but the
user dual boots Windows which has a tendency to wake up, do updates at
"offline" times, and reboots... to Fedora where it waits indefinitely
for a LUKS passphrase.

I'm sure there are other examples. Plausibly anything that hangs
during startup would have this behavior; only once we're at gdm (or
equivalent on other DE's) is there a timer that will at least blank
the screen, and possibly also optionally trigger suspend to RAM.

Or alternative to a service unit opt in method, a way for systemd
itself to opt into a "power off after X minutes unless Y process
reports it's started successfully" type of behavior. In any case, it's
up to the distro to decide the policy, with a way for the user to opt
out of that by setting the applicable timeout value to something like
0, to indicate they really want an indefinite wait.

Thanks,


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] startup hang at 'load/save random seed'

2019-08-06 Thread Chris Murphy

[   10.281769] fmac.local systemd[1]: Starting Update UTMP about
System Boot/Shutdown...
[   10.295504] fmac.local audit[806]: SYSTEM_BOOT pid=806 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='
comm="systemd-update-utmp" exe="/usr/lib/systemd/systemd-update-utmp"
hostname=? addr=? terminal=? res=success'
[   10.305289] fmac.local systemd[1]: Started Update UTMP about System
Boot/Shutdown.
[   10.305527] fmac.local audit[1]: SERVICE_START pid=1 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-update-utmp comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=?
res=success'
[   15.264423] fmac.local systemd[1]: systemd-rfkill.service: Succeeded.
[   15.268231] fmac.local audit[1]: SERVICE_STOP pid=1 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-rfkill comm="systemd" exe="/usr/lib/systemd/systemd"
hostname=? addr=? terminal=? res=success'
[  286.296649] fmac.local kernel: random: crng init done
[  286.301223] fmac.local kernel: random: 7 urandom warning(s) missed
due to ratelimiting
[  286.319857] fmac.local systemd[1]: Started Load/Save Random Seed.
[  286.322850] fmac.local audit[1]: SERVICE_START pid=1 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-random-seed comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=?
res=success'
[  286.323576] fmac.local systemd[1]: Reached target System Initialization.


I don't know why there's ratelimiting on urandom warnings, I have
printk.devkmsg=on

This also seems relevant.

[chris@fmac ~]$ sudo journalctl -b -o short-monotonic | grep -i seed
[8.870985] fmac.local systemd[1]: Starting Load/Save Random Seed...
[9.021818] fmac.local systemd-random-seed[619]: Kernel entropy
pool is not initialized yet, waiting until it is.
[  286.319857] fmac.local systemd[1]: Started Load/Save Random Seed.
[  286.322850] fmac.local audit[1]: SERVICE_START pid=1 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-random-seed comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=?
res=success'
[chris@fmac ~]$


---
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] startup hang at 'load/save random seed'

2019-08-06 Thread Chris Murphy

This is a new problem I'm seeing just today on Fedora Rawhide

5.3.0-0.rc3.git0.1.fc31.x86_64+debug
systemd-243~rc1-1.fc31.x86_64

The problem doesn't happen when reverting to
systemd-242-6.git9d34e79.fc31.x86_64

The hang lasts about 4-5 minutes, then boot proceeds.

Or if I head to early-debug shell and start typing, it almost
immediately clears up and boot proceeds.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] systemd.journald.forward_to doesn't forward all journal messages

2019-08-05 Thread Chris Murphy

On Thu, Aug 1, 2019 at 12:43 AM Stefan Tatschner  wrote:
>
> On Wed, 2019-07-31 at 16:27 -0600, Chris Murphy wrote:
> > The Obi Wan quote seems to apply here. "Who's the more foolish, the
> > fool, or the fool who follows him?"
> >
> > You're invited to stop following me.
>
> Calm down folks. I am certain he wanted to be sure that's not an xy
> problem: http://xyproblem.info/
>
> Sometimes it makes sense to ask these questions in order to potentially
> save everyone's time. Maybe without the “ass” word, but anyway…

Could be.

X = systemd-journald captured and recorded messages in system.journal;
for any number of reasons this might not be available if early
boot/startup crashes.
Y = what I'd accept as a fallback to 'journalctl' is the complete
"pretty" version of what systemd records by either forwarding the
journal to kmsg or to console

The problem I describe is, the contents of forwarding don't match up
with what's in the journal, and it seems like a bug or missing
functionality that so much (roughly 1/3 of messages) just don't get
forwarded. For example, set rd.debug, and then boot. Only a few early
dracut debug messages are forwarded to console or kmsg, the
overwhelming bulk of rd.debug messages are simply dropped. But they're
in the journal. So if I don't have the journal, and they're not
forwarded, it's a problem to debug an early startup failure.

And I freely admit that it's rare indeed that both the journal is
unavailable, and there are no useful hints at all forwarded to console
or kmsg. But I'm not even sure it's expected most all of the journal
messages should be forwarded with the forward parameter.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] systemd.journald.forward_to doesn't forward all journal messages

2019-07-31 Thread Chris Murphy

On Wed, Jul 31, 2019 at 12:30 PM Greg Oliver  wrote:

> I do not mean to sound like an ass here - especially since you have spent 
> hours of time ripping this stuff apart on Fedora live media (I actually 
> thought initially since you were digging so deep, you worked there), but I 
> mean really - what is the point?

That you do not understand the point or lack the imagination for a
point doesn't mean there isn't a point.

> Are you planning on using a Fedora Live Media image as production?  If so, 
> you clearly have not encountered all the stuff that usually does not work on 
> live media and should create your own anyhow.

The forwarding problem is reproducible on installed systems.

> I apologize, but I just do not get the effort spent here by anyone.

The Obi Wan quote seems to apply here. "Who's the more foolish, the
fool, or the fool who follows him?"

You're invited to stop following me.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] systemd.journald.forward_to doesn't forward all journal messages

2019-07-30 Thread Chris Murphy

On Mon, Jul 29, 2019 at 1:26 AM Lennart Poettering
 wrote:
>
> On So, 28.07.19 22:11, Chris Murphy (li...@colorremedies.com) wrote:
>
> > Using either of the following:
> >
> > systemd.log_level=debug systemd.journald.forward_to_kmsg log_buf_len=8M
> >
> > systemd.log_level=debug systemd.log_target=kmsg log_buf_len=8M
>
> Note that this is not sufficient. You also have to pass
> "printk.devkmsg=on" too, otherwise the kernel ratelimits log output
> from usperspace ridiculously a lot, and you will see lots of dropped
> messages.
>
> I have documented this now here:
>
> https://github.com/systemd/systemd/pull/13208

BOOT_IMAGE=/images/pxeboot/vmlinuz
root=live:CDLABEL=Fedora-WS-Live-rawh-20190728-n-1 rd.live.image
systemd.wants=zram-swap.service systemd.log_level=debug
systemd.journald.forward_to_kmsg log_buf_len=8M printk.devkmsg=on

Many messages I see in the journal still do not appear in kmsg. For
example from /dev/kmsg

6,20619,201107529,-;zram: Cannot change disksize for initialized device
12,23154,208596765,-;org.fedoraproject.Anaconda.Modules.Network[2498]:
DEBUG:anaconda.modules.network.network:Applying boot options
KernelArguments([('BOOT_IMAGE', '/images/pxeboot/vmlinuz'), ('root',
'live:CDLABEL=Fedora-WS-Live-rawh-20190728-n-1'), ('rd.live.image',
None), ('systemd.wants', 'zram-swap.service'), ('systemd.log_level',
'debug'), ('systemd.journald.forward_to_kmsg', None), ('log_buf_len',
'8M'), ('printk.devkmsg', 'on')])
12,25049,210822858,-;org.fedoraproject.Anaconda.Modules.Storage[2498]:
DEBUG:anaconda.modules.storage.disk_selection.selection:Protected
devices are set to '['/dev/zram0']'.
^C
[root@localhost-live liveuser]# journalctl -o short-monotonic | grep zram
[  203.224915] localhost-live systemd[1477]: Added job
dev-zram0.device/nop to transaction.
[  203.225017] localhost-live systemd[1477]: dev-zram0.device:
Installed new job dev-zram0.device/nop as 295
[  203.225143] localhost-live systemd[1477]: Added job
sys-devices-virtual-block-zram0.device/nop to transaction.
[  203.225245] localhost-live systemd[1477]:
sys-devices-virtual-block-zram0.device: Installed new job
sys-devices-virtual-block-zram0.device/nop as 296
[  203.225355] localhost-live systemd[1477]:
sys-devices-virtual-block-zram0.device: Job 296
sys-devices-virtual-block-zram0.device/nop finished, result=done
[  203.225570] localhost-live systemd[1477]: dev-zram0.device: Job 295
dev-zram0.device/nop finished, result=done
[  208.959944] localhost-live systemd[1477]: Added job
dev-zram0.device/nop to transaction.
[  208.961015] localhost-live systemd[1477]: dev-zram0.device:
Installed new job dev-zram0.device/nop as 340
[  208.961324] localhost-live systemd[1477]: Added job
sys-devices-virtual-block-zram0.device/nop to transaction.
[  208.961508] localhost-live systemd[1477]:
sys-devices-virtual-block-zram0.device: Installed new job
sys-devices-virtual-block-zram0.device/nop as 341
[  208.961789] localhost-live systemd[1477]:
sys-devices-virtual-block-zram0.device: Job 341
sys-devices-virtual-block-zram0.device/nop finished, result=done
[  208.962021] localhost-live systemd[1477]: dev-zram0.device: Job 340
dev-zram0.device/nop finished, result=done
[  209.822448] localhost-live systemd[1477]: Added job
dev-zram0.device/nop to transaction.
[  209.822625] localhost-live systemd[1477]: dev-zram0.device:
Installed new job dev-zram0.device/nop as 377
[  209.822757] localhost-live systemd[1477]: Added job
sys-devices-virtual-block-zram0.device/nop to transaction.
[  209.822861] localhost-live systemd[1477]:
sys-devices-virtual-block-zram0.device: Installed new job
sys-devices-virtual-block-zram0.device/nop as 378
[  209.822983] localhost-live systemd[1477]:
sys-devices-virtual-block-zram0.device: Job 378
sys-devices-virtual-block-zram0.device/nop finished, result=done
[  209.823106] localhost-live systemd[1477]: dev-zram0.device: Job 377
dev-zram0.device/nop finished, result=done
[  213.866820] localhost-live anaconda[2490]: blivet:
DeviceTree.get_device_by_path: path: /dev/zram0 ; incomplete: False ;
hidden: False ;
[  213.868392] localhost-live anaconda[2490]: blivet: failed to
resolve '/dev/zram0'
[root@localhost-live liveuser]#


Literally zero of those lines appear in kmsg

6,20619,201107529,-;zram: Cannot change disksize for initialized device
12,23154,208596765,-;org.fedoraproject.Anaconda.Modules.Network[2498]:
DEBUG:anaconda.modules.network.network:Applying boot options
KernelArguments([('BOOT_IMAGE', '/images/pxeboot/vmlinuz'), ('root',
'live:CDLABEL=Fedora-WS-Live-rawh-20190728-n-1'), ('rd.live.image',
None), ('systemd.wants', 'zram-swap.service'), ('systemd.log_level',
'debug'), ('systemd.journald.forward_to_kmsg', None), ('log_buf_len',
'8M'), ('printk.devkmsg', 'on')])
12,25049,210822858,-;org.fedoraproject.Anaconda.Modules.Storage[2498]:
DEBUG:anaconda.modules.storage.disk_selection.selection:Protected
devices are set to '['/dev/zram0']'.

The fir

[systemd-devel] systemd.journald.forward_to doesn't forward all journal messages

2019-07-28 Thread Chris Murphy

Using either of the following:

systemd.log_level=debug systemd.journald.forward_to_kmsg log_buf_len=8M

systemd.log_level=debug systemd.log_target=kmsg log_buf_len=8M

There's quite a lot of messages in the journal, but not in kmsg. As
in, so many missing messages that the feature is nearly useless for
debugging. Is it expected or should I file a bug?

In fact when I use systemd.log_level=kmsg there are messages missing
out of both the journal and kmsg; when I do not use that parameter,
the expected messages are in the journal. So it's like something is
trying to forward a subset of messages to kmsg but then they get
dropped? I don't know how to debug this...


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] journald deleting logs on LiveOS boots

2019-07-23 Thread Chris Murphy

On Sun, Jul 21, 2019 at 11:48 PM Ulrich Windl
 wrote:
>
> >>> Chris Murphy  schrieb am 18.07.2019 um 17:55 in
> Nachricht
> :
> > On Thu, Jul 18, 2019 at 4:50 AM Uoti Urpala  wrote:
> >>
> >> On Mon, 2019-07-15 at 14:32 -0600, Chris Murphy wrote:
> >> > So far nothing I've tried gets me access to information that would
> >> > give a hint why systemd-journald thinks there's no free space and yet
> >> > it still decides to create a single 8MB system journal, which then
> >> > almost immediately gets deleted, including all the evidence up to that
> >> > point.
> >>
> >> Run journald under strace and check the results of the system calls
> >> used to query space? (One way to run it under strace would be to change
> >> the unit file to use "strace -D -o /run/output systemd-journald" as the
> >> process to start.)
> >
> > It's a good idea but strace isn't available on Fedora live media. So I
> > either have to learn how to create a custom live media locally (it's a
> > really complicated process) or convince Fedora to add strace to live
> > media...
>
> Wouldn't it be easer to scp the binary from a compatible system?

What binary? The problem with the strace idea is that /run/output is 0
length. There's nothing to scp. For whatever reason strace creates
/run/output but isn't writing anything to it. But based on this PR it
looks like some aspect of this problem is understood by systemd
developers.

https://github.com/systemd/systemd/pull/13120


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] journald deleting logs on LiveOS boots

2019-07-23 Thread Chris Murphy

On Fri, Jul 19, 2019 at 8:45 AM Uoti Urpala  wrote:
>
> On Thu, 2019-07-18 at 21:52 -0600, Chris Murphy wrote:
> > # df -h
> > ...
> > /dev/mapper/live-rw  6.4G  5.7G  648M  91% /
> >
> > And in the log:
> > 47,19636,16754831,-;systemd-journald[905]: Fixed min_use=1.0M
> > max_use=648.7M max_size=81.0M min_size=512.0K keep_free=973.1M
> > n_max_files=100
> >
> > Why is keep_free bigger than available free space? Is that the cause
> > of the vacuuming?
>
> The default value for keep_free is the smaller of 4 GiB or 15% of total
> filesystem size. Since the filesystem is small and has less than 15%
> free, it's already over the default limit. Those defaults are defined
> in src/journal/journal_file.c. When over the limit, journald still uses
> at least DEFAULT_MIN_USE (increased to the initial size of journals on
> disk if any). But it looks suspicious that this is 1 MiB while
> FILE_SIZE_INCREASE is 8 MiB - doesn't this imply that any use at all
> immediately goes over 1 MiB?
>
> You can probably work around the issue by setting a smaller
> SystemKeepFree in journald.conf.

I don't really need a work around, this is live media and that file
will be read only. What I need is to understand exactly why the
journal vacuuming is being triggered. Either that's simply not being
recorded in the journal in the first place, and must be inferred by
esoteric knowledge, or the reason is in the journal that has been
vacuum, and is thus lost.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] journald deleting logs on LiveOS boots

2019-07-18 Thread Chris Murphy

This is suspicious:

# df -h
...
/dev/mapper/live-rw  6.4G  5.7G  648M  91% /

And in the log:
47,19636,16754831,-;systemd-journald[905]: Fixed min_use=1.0M
max_use=648.7M max_size=81.0M min_size=512.0K keep_free=973.1M
n_max_files=100

Why is keep_free bigger than available free space? Is that the cause
of the vacuuming?

47,19867,16908013,-;systemd-journald[905]:
/var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee/system.journal:
Allocation limit reached, rotating.
47,19868,16908029,-;systemd-journald[905]: Rotating...

And then

47,27860,22417049,-;systemd-journald[905]: Vacuuming...
47,27861,22427712,-;systemd-journald[905]: Deleted archived journal
/var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee/system@daa6e38474b84afc8404527c7b204c24-0001-00058dff129d34ae.journal
(8.0M).
47,27862,22427724,-;systemd-journald[905]: Vacuuming done, freed 8.0M
of archived journals from
/var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee.

That vacuuming event is the direct cause of the data loss. But does it
happen because keep_free is greater than free space, and if so then
why is keep_free greater than free space?

Slightly off topic but why are there over 18000 (that's not a typo) of
these identical lines? It seems excessive.

47,27869,22428300,-;systemd-journald[905]: Journal effective settings
seal=no compress=yes compress_threshold_bytes=512B

I've updated the bug report to include an attachment of the journal
I've captured by forwarding to kmsg.

https://bugzilla.redhat.com/show_bug.cgi?id=1715699#c17


Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] journald deleting logs on LiveOS boots

2019-07-18 Thread Chris Murphy

On Thu, Jul 18, 2019 at 4:36 PM Greg Oliver  wrote:

> I am assuming you have ripped apart the initramfs to see exactly how RedHat 
> is invoking systemd in the live images?

I have not. No idea what I'd be looking for.

# ps aux | grep systemd

Installed system (Fedora 30):
root 1  0.0  0.1 171444 14696 ?Ss   12:05   0:09
/usr/lib/systemd/systemd --switched-root --system --deserialize 18

Live system (Rawhide):
root 1  0.8  0.4 171516 14708 ?Ss   19:57   0:07
/usr/lib/systemd/systemd --switched-root --system --deserialize 32



-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] journald deleting logs on LiveOS boots

2019-07-18 Thread Chris Murphy

On Thu, Jul 18, 2019 at 4:02 PM Chris Murphy  wrote:
>
> On Thu, Jul 18, 2019 at 10:18 AM Dave Howorth  wrote:
> >
> > On Thu, 18 Jul 2019 09:55:51 -0600
> > Chris Murphy  wrote:
> > > On Thu, Jul 18, 2019 at 4:50 AM Uoti Urpala 
> > > wrote:
> > > >
> > > > On Mon, 2019-07-15 at 14:32 -0600, Chris Murphy wrote:
> > > > > So far nothing I've tried gets me access to information that would
> > > > > give a hint why systemd-journald thinks there's no free space and
> > > > > yet it still decides to create a single 8MB system journal, which
> > > > > then almost immediately gets deleted, including all the evidence
> > > > > up to that point.
> > > >
> > > > Run journald under strace and check the results of the system calls
> > > > used to query space? (One way to run it under strace would be to
> > > > change the unit file to use "strace -D -o /run/output
> > > > systemd-journald" as the process to start.)
> > >
> > > It's a good idea but strace isn't available on Fedora live media. So I
> > > either have to learn how to create a custom live media locally (it's a
> > > really complicated process) or convince Fedora to add strace to live
> > > media...
> >
> > I'm not a fedora user, but I don't think it's that difficult to run
> > strace.
> >
> > To run it once, start your live image and type:
> >
> > # yum install strace
> >
> > You will need to reinstall it if you reboot.
> >
> > To permanently install it apparently you need to configure your USB
> > with persistent storage. I haven't looked up how to do that.
>
> I thought about that, but this is a substantial alteration from the
> original ISO in terms of the storage layout and how everything gets
> assembled. But it's worth a shot. If it is a systemd bug, then it
> should still reproduce. If it doesn't reproduce, then chances are it's
> some kind of assembly related problem.
>
> Still seems like a systemd-journald bug that neither forward to
> console nor to kmsg includes any useful systemd or dracut debugging.

I used livecd-iso-to-disk to create the persistent boot media, the
problem does reproduce. But I needed a custom initramfs that has the
modified systemd-journald.service unit file in it, as well as strace,
because this problem happens that early in the startup process. Thing
is, that custom initramfs blows up trying to assemble the persistent
overlay. Looking at the compose logs for this image, I see a rather
complex dracut command is needed:

dracut --nomdadmconf --nolvmconf --xz --add livenet dmsquash-live
convertfs pollcdrom qemu qemu-net --omit plymouth --no-hostonly
--debug --no-early-microcode --force
boot/initramfs-5.3.0-0.rc0.git4.1.fc31.x86_64.img
5.3.0-0.rc0.git4.1.fc31.x86_64

Fine. Now I can boot. Next gotcha is that /run/output is a 0 byte
length file. I don't know why. I've booted with enforcing=0 in case
selinux doesn't like strace writing to /run/ but that hasn't made a
difference.

The change I made in /usr/lib/systemd/system/systemd-journald.service is
ExecStart=/usr/bin/strace -D -o /run/output /usr/lib/systemd/systemd-journald

I've also tried booting with the original initrd which doesn't have
the systemd-journald modification, relying only on the one on
persistent storage. I still get /run/output as a 0 length file. I've
tried booting with enforcing=0 in case selinux is denying the write,
but that doesn't solve the problem.

So I'm still stuck.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] journald deleting logs on LiveOS boots

2019-07-18 Thread Chris Murphy

On Thu, Jul 18, 2019 at 10:18 AM Dave Howorth  wrote:
>
> On Thu, 18 Jul 2019 09:55:51 -0600
> Chris Murphy  wrote:
> > On Thu, Jul 18, 2019 at 4:50 AM Uoti Urpala 
> > wrote:
> > >
> > > On Mon, 2019-07-15 at 14:32 -0600, Chris Murphy wrote:
> > > > So far nothing I've tried gets me access to information that would
> > > > give a hint why systemd-journald thinks there's no free space and
> > > > yet it still decides to create a single 8MB system journal, which
> > > > then almost immediately gets deleted, including all the evidence
> > > > up to that point.
> > >
> > > Run journald under strace and check the results of the system calls
> > > used to query space? (One way to run it under strace would be to
> > > change the unit file to use "strace -D -o /run/output
> > > systemd-journald" as the process to start.)
> >
> > It's a good idea but strace isn't available on Fedora live media. So I
> > either have to learn how to create a custom live media locally (it's a
> > really complicated process) or convince Fedora to add strace to live
> > media...
>
> I'm not a fedora user, but I don't think it's that difficult to run
> strace.
>
> To run it once, start your live image and type:
>
> # yum install strace
>
> You will need to reinstall it if you reboot.
>
> To permanently install it apparently you need to configure your USB
> with persistent storage. I haven't looked up how to do that.

I thought about that, but this is a substantial alteration from the
original ISO in terms of the storage layout and how everything gets
assembled. But it's worth a shot. If it is a systemd bug, then it
should still reproduce. If it doesn't reproduce, then chances are it's
some kind of assembly related problem.

Still seems like a systemd-journald bug that neither forward to
console nor to kmsg includes any useful systemd or dracut debugging.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] journald deleting logs on LiveOS boots

2019-07-18 Thread Chris Murphy

On Thu, Jul 18, 2019 at 4:50 AM Uoti Urpala  wrote:
>
> On Mon, 2019-07-15 at 14:32 -0600, Chris Murphy wrote:
> > So far nothing I've tried gets me access to information that would
> > give a hint why systemd-journald thinks there's no free space and yet
> > it still decides to create a single 8MB system journal, which then
> > almost immediately gets deleted, including all the evidence up to that
> > point.
>
> Run journald under strace and check the results of the system calls
> used to query space? (One way to run it under strace would be to change
> the unit file to use "strace -D -o /run/output systemd-journald" as the
> process to start.)

It's a good idea but strace isn't available on Fedora live media. So I
either have to learn how to create a custom live media locally (it's a
really complicated process) or convince Fedora to add strace to live
media...

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] journald deleting logs on LiveOS boots

2019-07-17 Thread Chris Murphy

This problem of missing early boot messages is now happening on a
default boot of Fedora Live images. I don't have to change any boot
parameters to trigger it.

This image:
https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20190715.n.1/compose/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-Rawhide-20190715.n.1.iso

Boot it in a VM, launch Terminal and:

[liveuser@localhost-live ~]$ sudo journalctl -o short-monotonic
[sudo] password for liveuser:
-- Logs begin at Tue 2019-07-16 11:58:09 EDT, end at Tue 2019-07-16
15:58:09 EDT. --
[   24.017711] localhost-live systemd[1472]: Startup finished in 293ms.
[   24.447054] localhost-live dbus-broker-launch[1496]: Service file
'/usr/share/dbus-1/servi...

And neither a forwarded to console, nor directed to kmsg, do I see
/run/log or /var/log anywhere, so quite a lot of information is not
being forwarded and is just lost because of this bug.

Further, I discovered an note in an earlier bug I filed in February
2019 about this problem: "OK I'm not seeing this with
systemd-241~rc2-2.fc30.x86_64 in the 20190211 Live media" so that
leads me to believe this is a systemd regression, but I can't prove it
because of this bug. I have no systemd debug messages available to
look at...

Fedora Basic Release Criteria
"A system logging infrastructure must be available, enabled by
default, and working." I really think logging is not working, and this
should block. But without a really specific cause + solution it's
questionable to block.


---
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] journald deleting logs on LiveOS boots

2019-07-17 Thread Chris Murphy

On Mon, Jul 15, 2019 at 2:32 PM Chris Murphy  wrote:
>
> This is still a problem with systemd-242-5.git7a6d834.fc31.x86_64
>
> If I boot using 'systemd.journald.forward_to_console=1
> console=ttyS0,38400 console=tty1 systemd.log_level=debug rd.debug
> rd.udev.debug'
>
> There is no debug output forwarded to console, only kernel messages
> and normal systemd logging is forwarded.

Going to a working/installed system so I don't have to deal with this
bug; I'm still finding all kinds of messages that appear on
virt-manager's console, despite the forward to console boot param,
that do not get forwarded to ttyS0 when I'm connected using 'virsh
console $(vmname)' So?? Is that a bug? It seems like a bug.

Example from "virsh console" output:

[   25.009292] systemd[1]: Started NTP client/server.
[   40.797229] input: spice vdagent tablet as /devices/virtual/input/input5

But on the virt-manager console, there's a pile of messages from 26
seconds through 40 seconds. Why are they missing?

--
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] journald deleting logs on LiveOS boots

2019-07-15 Thread Chris Murphy

On Mon, Jul 15, 2019 at 2:32 PM Chris Murphy  wrote:
>
> If I boot using 'systemd.log_level=debug rd.debug rd.udev.debug
> systemd.log_target=kmsg log_buf_len=64M printk.devkmsg=on'

Another data point. Is kmsg dumped into a file is 5MiB, after the
point in time when journald had done vacuuming on /var/log/journal
which was already 8+ MiB in size. So at least 3MiB of journal messages
were not being sent to kmsg.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] journald deleting logs on LiveOS boots

2019-07-15 Thread Chris Murphy

This is still a problem with systemd-242-5.git7a6d834.fc31.x86_64

If I boot using 'systemd.journald.forward_to_console=1
console=ttyS0,38400 console=tty1 systemd.log_level=debug rd.debug
rd.udev.debug'

There is no debug output forwarded to console, only kernel messages
and normal systemd logging is forwarded. And of course this bug means
that those debug messages are lost once the vaccuuming happens.

If I boot using 'systemd.log_level=debug rd.debug rd.udev.debug
systemd.log_target=kmsg log_buf_len=64M printk.devkmsg=on'

For sure a bunch of dracut messages are not being forwarded to kmsg,
none of the rd.live.image debug stuff is listed, so I can't even see
how things are being assembled for live boots, with time stamps, to
see if that stuff might not be ready at the time journald switches
from using /run/log to /var/log. I can't even see in kmsg the journald
switch from /run/log to /var/log. That itself seems like a bug, given
systemd.log_target=kmsg, I'd like to think that should cause an exact
copy to dump to kmsg, of what's going to system.journal but apparently
that's not the case.

So far nothing I've tried gets me access to information that would
give a hint why systemd-journald thinks there's no free space and yet
it still decides to create a single 8MB system journal, which then
almost immediately gets deleted, including all the evidence up to that
point.

For sure sysroot and / are available rw by these points:
<31>[   10.898648] systemd[1]: sysroot.mount: About to execute:
/usr/bin/mount /dev/mapper/live-rw /sysroot
...
<31>[   12.061370] systemctl[879]: Switching root - root: /sysroot; init: n/a


This is the loss of the journal up to this point:
<47>[   24.318297] systemd-journald[905]:
/var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee/system.journal:
Allocation limit reached, rotating.
<47>[   24.318315] systemd-journald[905]: Rotating...
<47>[   24.332853] systemd-journald[905]: Reserving 147626 entries in
hash table.
<47>[   24.367396] systemd-journald[905]: Vacuuming...
<47>[   24.389952] systemd-journald[905]: Deleted archived journal
/var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee/system@2f2d06548b5f4c259693b56558cc89c6-0001-00058dbdb33d1f5e.journal
(8.0M).
<47>[   24.389965] systemd-journald[905]: Vacuuming done, freed 8.0M
of archived journals from
/var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee.
<47>[   24.390015] systemd-journald[905]: Journal effective settings
seal=no compress=yes compress_threshold_bytes=512B
<47>[   24.390126] systemd-journald[905]: Retrying write.

Retrying what write and why does it need to retry? What failed?

--
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] swap on zram service unit, using Conflicts=umount

2019-06-26 Thread Chris Murphy

OK so I'm seeing with systemd.log_level debug at shutdown that there
is a swapoff

[  406.997210] fmac.local systemd[1]: dev-zram0.swap: About to
execute: /sbin/swapoff /dev/zram0
[  406.997844] fmac.local systemd[1]: dev-zram0.swap: Forked
/sbin/swapoff as 1966
[  406.998308] fmac.local systemd[1]: dev-zram0.swap: Changed active
-> deactivating
[  406.998332] fmac.local systemd[1]: Deactivating swap /dev/zram0...
[  406.999489] fmac.local systemd[1966]: dev-zram0.swap: Executing:
/sbin/swapoff /dev/zram0

This happens after /home and /boot/efi are umounted, but before /boot
and / are umounted. This is not unexpected, but I like the idea of no
swapoff at reboot if that's possible. Any system under sufficient
memory and swap pressure could very well have at least delayed
shutdown if systemd shutdown/reboot actually waits for swapoff to exit
successfully.

Is there any interest in a systemd included swap on zram service to
obviate the need for others basically reinventing this wheel? I kinda
put it in the category of fstrim.service and the existing systemd
cryptswap logic. I expect it wouldn't be enabled by default, but
distros can enable or start it as their use cases dictate.


Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] swap on zram service unit, using Conflicts=umount

2019-06-25 Thread Chris Murphy

On Tue, Jun 25, 2019 at 3:30 AM Zbigniew Jędrzejewski-Szmek
 wrote:
>
> On Tue, Jun 25, 2019 at 10:55:27AM +0200, Lennart Poettering wrote:
> > On Mo, 24.06.19 13:16, Zbigniew Jędrzejewski-Szmek (zbys...@in.waw.pl) 
> > wrote:
> >
> > > > So for tmpfs mounts that don't turn off DefaultDependencies= we
> > > > implicit add in an After=swap.target ordering dep. The thinking was
> > > > that there's no point in swapping in all data of a tmpfs because we
> > > > want to detach the swap device when we are going to flush it all out
> > > > right after anyway. This made quite a difference to some folks.
> > >
> > > But we add Conflicts=umount.target, Before=umount.target, so we do
> > > swapoff on all swap devices, which means that swap in the data after all.
> > > Maybe that's an error, and we should remove this, at least for
> > > normal swap partitions (not files)?
> >
> > We never know what kind of weird storage swap might be on, I'd
> > probably leave that in, as it's really hard to figure out correctly
> > when leaving swap on would be safe and when not.
> >
> > Or to say this differently: if people want to micro-optimize that,
> > they by all means should, but in that case they should probably drop
> > in their manually crafted .swap unit with DefaultDependencies=no and
> > all the ordering in place they need, and nothing else. i.e. I believe
> > this kind of optimization is nothing we need to cater for in the
> > generic case when swap is configured with /etc/fstab or through GPT
> > enumeration.
>
> Not swapping off would make a nice optimization. Maybe we should
> invert this, and "drive" this from the other side: if we get a stop
> job for the storage device, then do the swapoff. Then if there are
> devices which don't need to stop, we wouldn't swapoff. This would cover
> the common case of swap on partition.
>
> I haven't really thought about the details, but in principle this
> should already work, if all the dependencies are declared correctly.

I like the sound of this. The gotcha with current swap on zram units
(there are a few floating out there including this one), is they
conflate two different things: setup and teardown of the zram device,
and swapon/swapoff. What's probably better and more maintainable would
be a way to setup the zram device with a service unit, and then
specify it in /etc/fstab as a swap device so that the usual systemd
swapon/off behavior is used. Any opinion here?

(Somewhat off topic: I wish zswap was not still experimental: by
enabling zswap with an ordinary swap device, it creates a memory pool
(which you can define a percentage of total RAM to use) that's
compressed for swap before it hits the backing device. Basically, it's
like a RAM cache for swap. It'll swap to memory first, and then
overflow to a swap partition or file. It also lacks all the weird
interfaces of zram.)

https://wiki.archlinux.org/index.php/Improving_performance#Zram_or_zswap

> > zswap is different: we know exactly that the swap data is located in
> > RAM, not on complex storage, hence it's entirely safe to not
> > disassemble it at all, iiuc.
>
> Agreed. It seems that any Conflicts= (including the one I proposed) are
> unnecessary/harmful.

OK I'll negate the commit that inserts it.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] swap on zram service unit, using Conflicts=umount

2019-06-24 Thread Chris Murphy

On Mon, Jun 24, 2019 at 6:11 AM Lennart Poettering
 wrote:
> That said, I don't really grok zram, and not sure why there's any need
> to detach it at all. I mean, if at shutdown we lose compressed RAM
> or lose uncompressed RAM shouldn't really matter. Hence from my
> perspective there's no need for Conflicts= at all, but maybe I am
> missing something?

Huh yeah, possibly if anything there could be low memory systems just
barely getting by with swap on zram, and even swapoff at reboot time
would cause it to get stuck. It might just be better to clobber it at
reboot time?

I'd like to allow a user to 'systemctl stop zram' which does swapoff
and removes the zram device. But is there something that could go into
the unit file that says "don't wait for swapoff, if everything else is
ready for shutdown, go ahead and reboot now?"

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] swap on zram service unit, using Conflicts=umount

2019-06-22 Thread Chris Murphy

Hi,

I've got a commit to add 'Conflicts=umount.target' to this zram
service based on a bug comment I cited in the comment. But I'm not
certain I understand if it's a good idea or necessary.

https://src.fedoraproject.org/fork/chrismurphy/rpms/zram/c/63900c455e8a53827aed697b9f602709b7897eb2?branch=devel

I figure it's plausible at shutdown time that something is swapped
out, and a umount before swapoff could hang (briefly or indefinitely I
don't know), and therefore it's probably better to cause swapoff to
happen before umount.

?

Thanks,

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] journald deleting logs on LiveOS boots

2019-06-19 Thread Chris Murphy

On Wed, Jun 19, 2019 at 6:15 AM Lennart Poettering
 wrote:
>
> On Di, 18.06.19 20:34, Chris Murphy (li...@colorremedies.com) wrote:
>
> > When I boot Fedora 30/Rawhide Workstation (LiveOS install media) in a
> > VM with ~2GiB memory, using any combination of systemd, dracut, or
> > udev debugging, events are missing from the journal.
> >
> > systemd-241-7.gita2eaa1c.fc30.x86_64
> > systemd-242-3.git7a6d834.fc31.x86_64
> >
> > 'journalctl -b -o short-monotonic' starts at ~22s monotonic time. i.e.
> > all the messages before that are not in the journal. This problem
> > doesn't happen with netinstalls. Fedora LiveOS boots setup a memory
> > based persistent overlay, and /var/log/journal exists and
>
> what do you mean by "memory based persistent overlay"? if its in
> memory it's not persistant, is it? /me is confused...

I agree it is an oxymoron. The device-mapper rw ext4 volume where all
writes go is memory based, as the install media and the base image are
read-only. Since this volume is rw, and it's mounted at /, that makes
/var/log/journal rw, and appears to be persistent from the
systemd-journald point of view, so it uses it instead of
/run/log/journal

[9.127691] localhost systemd[1]: Starting Flush Journal to
Persistent Storage...

>
> is this LVM stuff or overlayfs?

Neither, only device mapper.

[root@localhost-live liveuser]# lvs
[root@localhost-live liveuser]# vgs
[root@localhost-live liveuser]# pvs
[root@localhost-live liveuser]# dmsetup status
live-base: 0 13635584 linear
live-rw: 0 13635584 snapshot 163328/67108864 648
[root@localhost-live liveuser]#

All of this is setup by dracut with the boot parameter 'rd.live.image'


>
> > systemd-journald tries to flush there, whereas on netinstalls
> > /var/log/journal does not exist.
> >
> > Using systemd.log_target=kmsg I discovered that systemd-journald is
> > deleting logs in the LiveOS case, but I don't know why, / has ~750M
> > free
> >
> > [   24.910792] systemd-journald[922]: Vacuuming...
> > [   24.921802] systemd-journald[922]: Deleted archived journal
> > /var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee/system@818f3f40f19849e08a1b37b9c1e304f1-0001-00058ba31bec725e.journal
> > (8.0M).
> > [   24.921808] systemd-journald[922]: Vacuuming done, freed 8.0M of
> > archived journals from
> > /var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee.
> >
> > I filed a bug here and have nominated it as a blocker bug for the
> > F31 cycle.
>
> Note that journald logs early on how much disk space it assumes to
> have free, you might see that in dmesg? (if not, boot with
> systemd.journald.forward_to_kmsg=1 on the kernel cmdline)

Using systemd.log_level=debug and systemd.journald.forward_to_kmsg=1 I only see:

[4.224599] systemd-journald[327]: Fixed min_use=1.0M
max_use=145.8M max_size=18.2M min_size=512.0K keep_free=218.7M
n_max_files=100

If I boot with defaults (no debug options), I now see the following,
which is otherwise lost (not in the journal and not in dmesg, despite
systemd.journald.forward_to_kmsg=1)

[9.124045] localhost systemd-journald[933]: Runtime journal
(/run/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee) is 8.0M, max
146.6M, 138.6M free.
...snip...
[9.179050] localhost systemd-journald[933]: Time spent on flushing
to /var is 17.824ms for 936 entries.
[9.179050] localhost systemd-journald[933]: System journal
(/var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee) is 8.0M, max 8.0M,
0B free.

Why does it say 0B free?

[root@localhost-live liveuser]# df -h
Filesystem   Size  Used Avail Use% Mounted on
devtmpfs 1.5G 0  1.5G   0% /dev
tmpfs1.5G  4.0K  1.5G   1% /dev/shm
tmpfs1.5G  1.1M  1.5G   1% /run
tmpfs1.5G 0  1.5G   0% /sys/fs/cgroup
/dev/sr0 1.9G  1.9G 0 100% /run/initramfs/live
/dev/mapper/live-rw  6.4G  5.5G  879M  87% /
tmpfs1.5G  4.0K  1.5G   1% /tmp
vartmp   1.5G 0  1.5G   0% /var/tmp
tmpfs294M 0  294M   0% /run/user/0
tmpfs294M  4.0K  294M   1% /run/user/1000
[root@localhost-live liveuser]# free -m
  totalusedfree  shared  buff/cache   available
Mem:   2932 2231754   1 9542435
Swap: 0   0   0
[root@localhost-live liveuser]# cat /etc/systemd/journald.conf
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.
#
# Entries in this file show the compile time defaults.
# You can change settings by ed

[systemd-devel] journald deleting logs on LiveOS boots

2019-06-18 Thread Chris Murphy

When I boot Fedora 30/Rawhide Workstation (LiveOS install media) in a
VM with ~2GiB memory, using any combination of systemd, dracut, or
udev debugging, events are missing from the journal.

systemd-241-7.gita2eaa1c.fc30.x86_64
systemd-242-3.git7a6d834.fc31.x86_64

'journalctl -b -o short-monotonic' starts at ~22s monotonic time. i.e.
all the messages before that are not in the journal. This problem
doesn't happen with netinstalls. Fedora LiveOS boots setup a memory
based persistent overlay, and /var/log/journal exists and
systemd-journald tries to flush there, whereas on netinstalls
/var/log/journal does not exist.

Using systemd.log_target=kmsg I discovered that systemd-journald is
deleting logs in the LiveOS case, but I don't know why, / has ~750M
free

[ 24.910792] systemd-journald[922]: Vacuuming...
[ 24.921802] systemd-journald[922]: Deleted archived journal
/var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee/system@818f3f40f19849e08a1b37b9c1e304f1-0001-00058ba31bec725e.journal
(8.0M).
[ 24.921808] systemd-journald[922]: Vacuuming done, freed 8.0M of
archived journals from
/var/log/journal/05d0a9c86a0e4bbcb36c5e0082b987ee.

I filed a bug here and have nominated it as a blocker bug for the F31 cycle.

LiveOS boot, journalctl is missing many early messages
https://bugzilla.redhat.com/show_bug.cgi?id=1715699

Comments 8 and 9 most relevent at this point (rather noisy
troubleshooting before that).

A related question is whether it's even appropriate or
/var/log/journal to exist on LiveOS boots, rather than just have
journals go to /run/log/journal? Obviously we must have the entire
journal for LiveOS boots, from monotonic time 0.0, no matter the debug
options chosen, I'd say for at least 5 minutes? Almost immediately
dropping the first 30s just because I've got debug options set is a
big problem for troubleshooting other problems with early startup of
install images.

Thanks,

--
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] transient hang when starting cryptography setup for swap

2019-06-08 Thread Chris Murphy

This is still a bug on Fedora 30. I can only reproduce it on one
computer, I'm not sure why.

The working case

[   23.310555] flap.local systemd[1]:
dev-disk-by\x2duuid-dcae3053\x2d1cc2\x2d4890\x2da33b\x2d6d71b3dc97df.device:
Changed dead -> plugged
[   23.310639] flap.local systemd[1]:
dev-disk-by\x2did-dm\x2dname\x2dcryptswap.device: Changed dead ->
plugged
[   23.310681] flap.local systemd[1]: dev-mapper-cryptswap.device:
Changed dead -> plugged
[   23.310724] flap.local systemd[1]: dev-mapper-cryptswap.device: Job
165 dev-mapper-cryptswap.device/start finished, result=done
[   23.310765] flap.local systemd[1]: Found device /dev/mapper/cryptswap.

In the non-working case, that never happens. systemd just doesn't see
the swap device appear. But in early debug shell, 'blkid' sees it.




Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] 5.2rc2, circular lock warning systemd-journal and btrfs_page_mkwrite

2019-06-04 Thread Chris Murphy

] fmac.local kernel:  ?
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
[7.887408] fmac.local kernel:  __mutex_lock+0x92/0x930
[7.887422] fmac.local kernel:  ?
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
[7.887425] fmac.local kernel:  ? rcu_read_lock_sched_held+0x6b/0x80
[7.887427] fmac.local kernel:  ? module_assert_mutex_or_preempt+0x14/0x40
[7.887441] fmac.local kernel:  ?
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
[7.887443] fmac.local kernel:  ? sched_clock_cpu+0xc/0xc0
[7.887458] fmac.local kernel:  ?
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
[7.887471] fmac.local kernel:  btrfs_record_root_in_trans+0x44/0x70 [btrfs]
[7.887486] fmac.local kernel:  start_transaction+0x95/0x4f0 [btrfs]
[7.887501] fmac.local kernel:  btrfs_dirty_inode+0x44/0xd0 [btrfs]
[7.887503] fmac.local kernel:  file_update_time+0xeb/0x140
[7.887518] fmac.local kernel:  btrfs_page_mkwrite+0xfe/0x570 [btrfs]
[7.887520] fmac.local kernel:  ? find_held_lock+0x32/0x90
[7.887522] fmac.local kernel:  ? sched_clock+0x5/0x10
[7.887524] fmac.local kernel:  do_page_mkwrite+0x2f/0x100
[7.887526] fmac.local kernel:  do_wp_page+0x306/0x570
[7.887529] fmac.local kernel:  __handle_mm_fault+0xce8/0x1730
[7.887532] fmac.local kernel:  handle_mm_fault+0x16e/0x370
[7.887534] fmac.local kernel:  do_user_addr_fault+0x1f9/0x480
[7.887536] fmac.local kernel:  do_page_fault+0x33/0x210
[7.887538] fmac.local kernel:  ? page_fault+0x8/0x30
[7.887540] fmac.local kernel:  page_fault+0x1e/0x30
[7.887541] fmac.local kernel: RIP: 0033:0x7f97107ea383
[7.887544] fmac.local kernel: Code: ec 08 89 ee 49 89 d8 31 d2 6a
00 48 8b 4c 24 18 4c 89 f7 4c 8d 4c 24 30 e8 1a d8 ff ff 59 5e 85 c0
78 49 48 8b 44 24 20 31 d2 <48> 89 58 08 48 8b 5c 24 08 c7 40 01 00 00
00 00 66 89 50 05 c6 40
[7.887545] fmac.local kernel: RSP: 002b:7ffc79ed77a0 EFLAGS: 00010246
[7.887546] fmac.local kernel: RAX: 7f970eca84a8 RBX:
005d RCX: 
[7.887548] fmac.local kernel: RDX:  RSI:
7f97107ea408 RDI: 55d0300e8160
[7.887549] fmac.local kernel: RBP: 0001 R08:
0001 R09: 55d0300e8160
[7.887550] fmac.local kernel: R10: 7ffc79f6a080 R11:
35fc R12: 7ffc79ed78c8
[7.887551] fmac.local kernel: R13: 7ffc79ed78c0 R14:
55d0300efa60 R15: 00c3d9ef

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] EFI loader partition unknown

2019-05-09 Thread Chris Murphy

On Wed, May 8, 2019 at 3:52 AM Lennart Poettering
 wrote:
>
> eOn Mo, 06.05.19 10:26, Chris Murphy (li...@colorremedies.com) wrote:
>
> > Waiting for device (parent + 2 partitions) to appear...
> > Found writable 'root' partition (UUID
> > 87d5a92987174be9ad216482074d1409) of type xfs without verity on
> > partition #2 (/dev/vda2)
> > Found writable 'esp' partition (UUID b5aa8c29b4ab4021b2b22326860bda97)
> > of type vfat on partition #1 (/dev/vda1)
> > [Detaching after fork from child process 8612]
> > Successfully forked off '(sd-dissect)' as PID 8612.
> > Mounting xfs on /tmp/dissect-h21Wp5 (MS_RDONLY|MS_NODEV "")...
> > Failed to mount /dev/vda2 (type xfs) on /tmp/dissect-h21Wp5
> > (MS_RDONLY|MS_NODEV ""): Device or resource busy
> > Failed to mount dissected image: Device or resource busy
> > Failed to read /etc/hostname: No such file or directory
> > /etc/machine-id file is empty.
> > (sd-dissect) failed with exit status 1.
> > Failed to acquire image metadata: Protocol error
> > [Inferior 1 (process 8608) exited with code 01]
> > (gdb) quit
> >
> >
> > Looks like it wants to mount root, but it's already mounted and hence
> > busy. Btrfs lets you do that, ext4 and XFS don't, they need to be bind
> > mounted instead. Just a guess.
>
> OK, so this is misleading. "systemd-dissect" does two things: first it
> tries to make sense of the partition table and what to do with
> it. Then it tries to extract OS metadata from the file systems itself
> (i.e. read /etc/machine-id + /etc/os-release). The latter part fails
> for some reason (probably because the mount options are different than
> what is already mounted, some file systems are more allergic to that
> than others), but that shouldn't really matter, as that is
> not used by systemd-gpt-generator.
>
> Hmm, one question, which boot loader are you using? Note that the ESP

GRUB which I think is mainly as a bootmanager and to support Secure
Boot, and EFI STUB as the actual bootloader.

> mounting logic only works if the boot loader tells us which partition
> it was booted from. This is an extra check to ensure that we only
> mount the correct ESP, the one that was actually used. In other words,
> this only works with a boot loader that implements the relevant part
> of https://systemd.io/BOOT_LOADER_INTERFACE.html i.e. the

I'm willing to bet 99% of the world's computers have one ESP. In fact
I'd be surprised if it's even 1% that have 2 or more. I'm not
convinced the UEFI spec really sanctions 2 or more ESPs even if it's
not outright proscribed. The language of the spec consistently says
there's one.

> LoaderDevicePartUUID efi var. We probably should document that better
> in systemd-gpt-generator(8) though, could you please file a bug about
> that?
>
> In other words: use sd-boot, and all that stuff just works. With grub
> it doesn't, it doesn't let us know any of the bits we want to know.


OK requiring a specific bootloader really isn't consistent with the
language used in the discoverable partitions specification. If in
reality what's needed to automatically mount to /efi is not only a
partition type GUID but some bootloader specific metadata inserted
into memory at boot time, that's not a generic solution like the other
discoverable partition types.
https://www.freedesktop.org/wiki/Specifications/DiscoverablePartitionsSpec/



-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] EFI loader partition unknown

2019-05-06 Thread Chris Murphy

On Mon, May 6, 2019 at 10:26 AM Chris Murphy  wrote:

> Looks like it wants to mount root, but it's already mounted and hence
> busy. Btrfs lets you do that, ext4 and XFS don't, they need to be bind
> mounted instead. Just a guess.

Nope, that's not correct. I can mount /dev/vda2 on /mnt just fine.

/dev/vda2 on / type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
[snip]
/dev/vda2 on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)



-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] EFI loader partition unknown

2019-05-06 Thread Chris Murphy

Waiting for device (parent + 2 partitions) to appear...
Found writable 'root' partition (UUID
87d5a92987174be9ad216482074d1409) of type xfs without verity on
partition #2 (/dev/vda2)
Found writable 'esp' partition (UUID b5aa8c29b4ab4021b2b22326860bda97)
of type vfat on partition #1 (/dev/vda1)
[Detaching after fork from child process 8612]
Successfully forked off '(sd-dissect)' as PID 8612.
Mounting xfs on /tmp/dissect-h21Wp5 (MS_RDONLY|MS_NODEV "")...
Failed to mount /dev/vda2 (type xfs) on /tmp/dissect-h21Wp5
(MS_RDONLY|MS_NODEV ""): Device or resource busy
Failed to mount dissected image: Device or resource busy
Failed to read /etc/hostname: No such file or directory
/etc/machine-id file is empty.
(sd-dissect) failed with exit status 1.
Failed to acquire image metadata: Protocol error
[Inferior 1 (process 8608) exited with code 01]
(gdb) quit


Looks like it wants to mount root, but it's already mounted and hence
busy. Btrfs lets you do that, ext4 and XFS don't, they need to be bind
mounted instead. Just a guess.

--
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] EFI loader partition unknown

2019-05-05 Thread Chris Murphy

On Fri, May 3, 2019 at 2:23 AM Lennart Poettering
 wrote:
>
> On Fr, 03.05.19 00:37, Chris Murphy (li...@colorremedies.com) wrote:
>
> > systemd-242-3.git7a6d834.fc31.x86_64
> >
> > With the Fedora /etc/fstab entry for /boot/efi commented out, systemd
> > isn't discovering the EFI system partition and mounting it. The /efi
> > directory exists, and I've tried to boot with both enforcing=0 and
> > selinux=0 due to and selinux bug [1] but systemd doesn't even attempt
> > to mount it. Should I file an upstream bug report?
> >
> >
> > [6.275104] frawvm.local systemd-gpt-auto-generator[636]: Waiting
> > for device (parent + 3 partitions) to appear...
> > [6.281265] frawvm.local systemd-gpt-auto-generator[636]: EFI
> > loader partition unknown.
> >
> > blkid reports:
> > /dev/vda1: SEC_TYPE="msdos" UUID="927C-932C" TYPE="vfat"
> > PARTLABEL="EFI System Partition"
> > PARTUUID="0e3a48c0-3f1b-4ca7-99f4-32fd1d831cdc"
> >
> > gdisk reports:
> > Partition number (1-3): 1
> > Partition GUID code: C12A7328-F81F-11D2-BA4B-00A0C93EC93B (EFI System)
> > Partition unique GUID: 0E3A48C0-3F1B-4CA7-99F4-32FD1D831CDC
> > First sector: 2048 (at 1024.0 KiB)
> > Last sector: 411647 (at 201.0 MiB)
> > Partition size: 409600 sectors (200.0 MiB)
> > Attribute flags: 
> > Partition name: 'EFI System Partition'
> >
>
> What does '/usr/lib/systemd/systemd-dissect /dev/vda' say?


[chris@localhost ~]$ sudo /usr/lib/systemd/systemd-dissect /dev/vda
Found writable 'root' partition (UUID
87d5a92987174be9ad216482074d1409) of type xfs without verity on
partition #2 (/dev/vda2)
Found writable 'esp' partition (UUID b5aa8c29b4ab4021b2b22326860bda97)
of type vfat on partition #1 (/dev/vda1)
Failed to acquire image metadata: Protocol error
[chris@localhost ~]$


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] EFI loader partition unknown

2019-05-03 Thread Chris Murphy

systemd-242-3.git7a6d834.fc31.x86_64

With the Fedora /etc/fstab entry for /boot/efi commented out, systemd
isn't discovering the EFI system partition and mounting it. The /efi
directory exists, and I've tried to boot with both enforcing=0 and
selinux=0 due to and selinux bug [1] but systemd doesn't even attempt
to mount it. Should I file an upstream bug report?


[6.275104] frawvm.local systemd-gpt-auto-generator[636]: Waiting
for device (parent + 3 partitions) to appear...
[6.281265] frawvm.local systemd-gpt-auto-generator[636]: EFI
loader partition unknown.

blkid reports:
/dev/vda1: SEC_TYPE="msdos" UUID="927C-932C" TYPE="vfat"
PARTLABEL="EFI System Partition"
PARTUUID="0e3a48c0-3f1b-4ca7-99f4-32fd1d831cdc"

gdisk reports:
Partition number (1-3): 1
Partition GUID code: C12A7328-F81F-11D2-BA4B-00A0C93EC93B (EFI System)
Partition unique GUID: 0E3A48C0-3F1B-4CA7-99F4-32FD1D831CDC
First sector: 2048 (at 1024.0 KiB)
Last sector: 411647 (at 201.0 MiB)
Partition size: 409600 sectors (200.0 MiB)
Attribute flags: 
Partition name: 'EFI System Partition'



[1]
https://bugzilla.redhat.com/show_bug.cgi?id=1293725

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] transient hang when starting cryptography setup for swap

2019-03-31 Thread Chris Murphy

I wonder if this is related to the late random seed loading, or
possible a race with it? I'd expect only cryptsetup needing a key
derived from unrandom would need random data; yet I'm seeing the hang
always after cryptsetup and mkswap succeeds with the hang happening
after mkswap succeeds without a corresponding swapon. I'm beginning to
wonder if it's a kernel bug, but even with Fedora debug kernels I'm
not seeing any information that explains these long gaps in the
journal.

[   14.623373] flap.local systemd[1]:
var-lib-systemd-random\x2dseed.mount: Collecting.
...
[   25.319006] flap.local audit[1]: SERVICE_START pid=1 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-random-seed comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=?
res=success'
...
[   25.336505] flap.local systemd[582]: systemd-random-seed.service:
Executing: /usr/lib/systemd/systemd-random-seed load

The first one seems late in startup; and then it's not for ~10s later
that it's really executed.

--
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] transient hang when starting cryptography setup for swap

2019-03-26 Thread Chris Murphy

OK the only thing running is `dev-mapper-eswap.device` everything else
is waiting. But even looking at its status, it doesn't reveal why it's
waiting.

I've updated the bug.


--
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] transient hang when starting cryptography setup for swap

2019-03-26 Thread Chris Murphy

On Mon, Mar 25, 2019 at 3:18 AM Lennart Poettering
 wrote:
>
> On Do, 21.03.19 19:36, Chris Murphy (li...@colorremedies.com) wrote:
>
> > Hi,
> >
> > Problem Summary (which I forgot to put in the bug):
> > /etc/crypttab configured to use /dev/urandom to generate a new key
> > each boot for encrypted device to be used as swap. The dmcrypt device
> > is successfully created, and mkswap succeeds, but somewhere just
> > before (?) swapon the job gets stuck and boot hangs indefinitely.
> > There is no stuck swapon process listed by ps.
> >
> > This happens maybe 1 in 10 boots with Fedora 29. And maybe 1 in 2
> > boots with Fedora 30. I guess it could be a race of some kind? I'm not
> > really sure.
> >
> > I filed a bug with attachments and details here:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1691589
> >
> > But it's not a great bug report yet because I don't have enough
> > information why it's hanging. Ordinarily I'd use
> > `systemd.log_level=debug` but the problem never happens so far if I
> > use it. So I'm looking for advice on getting more information why it's
> > stuck.
>
> See:
>
> https://freedesktop.org/wiki/Software/systemd/Debugging/#index1h1
>
> In particular, you might want to do the "systemctl enable
> debug-shell.service" thing, so that you can do "systemctl list-jobs"
> and similar when it hangs to figure out what's going on.


Drat. I didn't see this until just now, so I didn't do list-jobs.
However, I finally got a boot with systemd.log_level=debug to hang so
maybe that's useful?

I attached to the bug report
https://bugzilla.redhat.com/show_bug.cgi?id=1691589

Next time it happens I'll do list-jobs.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] transient hang when starting cryptography setup for swap

2019-03-21 Thread Chris Murphy

Hi,

Problem Summary (which I forgot to put in the bug):
/etc/crypttab configured to use /dev/urandom to generate a new key
each boot for encrypted device to be used as swap. The dmcrypt device
is successfully created, and mkswap succeeds, but somewhere just
before (?) swapon the job gets stuck and boot hangs indefinitely.
There is no stuck swapon process listed by ps.

This happens maybe 1 in 10 boots with Fedora 29. And maybe 1 in 2
boots with Fedora 30. I guess it could be a race of some kind? I'm not
really sure.

I filed a bug with attachments and details here:
https://bugzilla.redhat.com/show_bug.cgi?id=1691589

But it's not a great bug report yet because I don't have enough
information why it's hanging. Ordinarily I'd use
`systemd.log_level=debug` but the problem never happens so far if I
use it. So I'm looking for advice on getting more information why it's
stuck.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] /var/log/journal full, journald is not removing journal files

2018-07-18 Thread Chris Murphy

On Thu, Jul 12, 2018 at 8:28 AM, Michal Koutný  wrote:
> Hi Chris.
>
> On 07/11/2018 09:44 PM, Chris Murphy wrote:
>> Somehow journald would not
>> delete its own files until I had deleted  a few manually.
> Indeed, see man page update [1] added recently for more details. I
> assume your space was occupied by active journal files. Do you have any
> detailed break down of /var/log/journal contents?
>
> Michal
>
> [1] https://github.com/systemd/systemd/commit/1a0d353b44e
>

It seems to be working automatically now that it's been cleaned out
manually, and the SystemMaxUse=1200M, although this gets translated to
1.1G by journald, and there is ~400MB free space.

Jul 17 06:42:38 f28h.local systemd-journald[472]: System journal
(/var/log/journal/bbe68372db9f4c589a1f67f008e70864) is 1.1G, max 1.1G,
0B free.

My original expectation was that SystemMaxUse=2G which is the same
size as the file system, would be limited by the default behavior of
SystemKeepFree= which the man page says is 15% of the file system
size. In theory it would ensure 300MB free at all times. But that
turns out to not be the case. It was definitely not deleting archived
files in that configuration.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] /var/log/journal full, journald is not removing journal files

2018-07-11 Thread Chris Murphy

So I went back to default journald.conf and rebooted and still
/var/log/journal is 100% full. So I manually deleted a bunch of files,
getting to 60% full. Reboot again. And now journald does some kind of
clean up, and /var/log/journal is 28% full. Somehow journald would not
delete its own files until I had deleted  a few manually.

The default is too small so for now I've gone back to
SystemMaxUse=1200M which causes systemd-journald at next reboot:

Jul 11 13:34:30 f28h.local systemd-journald[482]: System journal
(/var/log/journal/bbe68372db9f4c589a1f67f008e70864) is 216.2M, max
1.1G, 983.7M free.

Not much to go on.

Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] /var/log/journal full, journald is not removing journal files

2018-07-11 Thread Chris Murphy

systemd-238-8.git0e0aa59.fc28.x86_64

I'm really confused by what I'm seeing.

Jul 10 09:13:40 f28h.local systemd-journald[493]: System journal
(/var/log/journal/bbe68372db9f4c589a1f67f008e70864) is 1.2G, max 1.3G,
90.0M free.
[chris@f28h ~]$ du -sh /var/log/journal/bbe68372db9f4c589a1f67f008e70864/
1.6G/var/log/journal/bbe68372db9f4c589a1f67f008e70864/
[chris@f28h ~]$

[chris@f28h ~]$ df -h
Filesystem  Size  Used Avail Use% Mounted on
/dev/nvme0n1p7  1.9G  1.9G   73M  97% /var/log

journald.conf is Fedora default except
SystemMaxUse=2G


1. Somehow systemd-journald is deciding to max at 1.3G, instead of the
specified 2G, which is fine. But I don't know how it arrives at this.
2. Clearly max 1.3G is being busted, the contents of var/log/journal
are greater than the max. non-journal files total ~83M and have not
grown at all in the last week (and multiple reboots).
3. If I change SystemMaxUse=1300M there is no change. No attempt by
journald to clean up /var/log/journal, and no errors, it uses /run/log
instead and never switches to persistent logging.
4. My reading of man journald.conf is that that SystemKeepFree=
defaults to 15% of the file system space, so even with SystemMaxUse=2G
journald should have deleted journal files before getting to 100%
full.
5. If I boot with systemd.log_level=debug, there are no journald
entries that help understand why there's no transition from volatile
to persistent storage, i.e. hey var/log/journal is full, and also that
I can't delete files because $reasons, or whatever.

Extra info: this is a 2G f2fs file system mounted at /var/log. Seems
to be working well except for this little problem but I don't see it
being the cause. journald isn't even attempting to delete its own
journal files to free up space.


Anyway the main thing that has me confused is the max 1.3G statement,
which has been the same since the file system was 5% full upon
creation, and then journals increased all the way to enospc without
any of them being deleted.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] journal always corrupt

2018-06-08 Thread Chris Murphy

On Thu, Jun 7, 2018 at 10:32 AM, Mantas Mikulėnas  wrote:
> On Thu, Jun 7, 2018 at 4:21 AM Chris Murphy  wrote:
>>
>> [chris@f28h ~]$ sudo journalctl --verify
>> 15f1c8: Data object references invalid entry at 4855f8
>> File corruption detected at
>> /run/log/journal/bbe68372db9f4c589a1f67f008e70864/system.journal:4854c0
>> (of 8388608 bytes, 56%).
>> FAIL: /run/log/journal/bbe68372db9f4c589a1f67f008e70864/system.journal
>> (Bad message)
>> PASS: /var/log/journal/bbe68372db9f4c589a1f67f008e70864/system.journal
>> PASS: /var/log/journal/bbe68372db9f4c589a1f67f008e70864/user-1000.journal
>> [chris@f28h ~]$ ls -l /run/log/journal/bbe68372db9f4c589a1f67f008e70864/
>> total 8192
>> -rw-r-+ 1 root systemd-journal 8388608 Jun  6 14:28 system.journal
>> [chris@f28h ~]$
>>
>> systemd-238-8.git0e0aa59.fc28.x86_64
>>
>> It doesn't seem to matter whether this is on volatile or persistent
>> media, the very first journal file has corruption, subsequent ones
>> don't. I'm not sure how to troubleshoot this.
>
>
> More precisely, it's the *active* journal file, the one that journald is
> currently writing to. If it has been just a few seconds since the last
> write, you can probably safely assume that it's not fully flushed to disk
> yet. (This can apply to user-* journals as well, but they're relatively low
> traffic and so less likely to be online at the moment.)


This is a recent behavior.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] journal always corrupt

2018-06-06 Thread Chris Murphy

[chris@f28h ~]$ sudo journalctl --verify
15f1c8: Data object references invalid entry at 4855f8
File corruption detected at
/run/log/journal/bbe68372db9f4c589a1f67f008e70864/system.journal:4854c0
(of 8388608 bytes, 56%).
FAIL: /run/log/journal/bbe68372db9f4c589a1f67f008e70864/system.journal
(Bad message)
PASS: /var/log/journal/bbe68372db9f4c589a1f67f008e70864/system.journal
PASS: /var/log/journal/bbe68372db9f4c589a1f67f008e70864/user-1000.journal
[chris@f28h ~]$ ls -l /run/log/journal/bbe68372db9f4c589a1f67f008e70864/
total 8192
-rw-r-+ 1 root systemd-journal 8388608 Jun  6 14:28 system.journal
[chris@f28h ~]$

systemd-238-8.git0e0aa59.fc28.x86_64

It doesn't seem to matter whether this is on volatile or persistent
media, the very first journal file has corruption, subsequent ones
don't. I'm not sure how to troubleshoot this.


[chris@f28h ~]$ sudo journalctl --verify
32ccc28: Invalid data object at hash entry 5179 of 233016
File corruption detected at
/var/log/journal/bbe68372db9f4c589a1f67f008e70864/system.journal:32cca20
(of 58720256 bytes, 90%).
FAIL: /var/log/journal/bbe68372db9f4c589a1f67f008e70864/system.journal
(Bad message)
PASS: /var/log/journal/bbe68372db9f4c589a1f67f008e70864/user-1000.journal



-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] checking for resume= cmdline with hybrid suspend hibernate

2018-06-02 Thread Chris Murphy

I'm seeing this on a Fedora 28 system with systemd-238-8.git0e0aa59.fc28.x86_64

Jun 02 15:11:33 f28h.local systemd[1]: Starting Hybrid Suspend+Hibernate...

[chris@f28h ~]$ sudo cat /proc/cmdline

BOOT_IMAGE=/root28/boot/vmlinuz-4.17.0-0.rc7.git1.1.fc29.x86_64
root=UUID=2662057f-e6c7-47fa-8af9-ad933a22f6ec ro
rootflags=subvol=root28 rhgb quiet rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0
enable_mtrr_cleanup=1 zswap.enabled=1 zswap.max_pool_percent=25
zswap.compressor=lz4 LANG=en_US.UTF-8 no_console_suspend


There is no resume= for the kernel to find the resulting hibernation
image (and I'm going to set aside that I'm pretty sure when Secure
Boot is enabled, that Fedora's kernels don't even support restoring
hibernation images). At one time I thought systemd had a check to see
if the cmdline contained a resume hint for the kernel to find the
hibernation image, and wouldn't do any variant of hibernation if that
hint is not present. But I guess not?

I'm pretty sure it's the DE that's asking for hybrid suspend+hibernate
but I don't know whose domain it is to check if hibernate is actually
supported. I'm vaguely aware of CanHibernate() but I don't know what
all is included in that test.

Thanks,

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

1 2 3 >

1 - 100 of 242 matches

Mail list logo