[systemd-devel] Finding a block device quickly with libblkid

2024-03-01 Thread Eric Curtin
Hi Guys,

We are looking into optimizing the boot sequence of a device with many
partitions.

On boot in the default systemd implementation, all the block devices
are queried via libblkid and the various symlinks are set up in
/dev/disk/* from the results of those queries. The problem is on a
device with many partitions this can delay the boot by hundreds of
milliseconds, which is not ideal, especially when in many cases all
you really care about is mounting the block device that represents the
rootfs partition. We can sort of guess "/dev/sde38" is the correct
one, but that's not deterministic.

So we started digging and came across blkid_find_dev_with_tag and
blkid_dev_devname, which you can call like this:

blkid_dev_devname(blkid_find_dev_with_tag(cache, "PARTLABEL", "system_a")))

blkid_dev_devname(blkid_find_dev_with_tag(cache, "PARTLABEL", "system_b")))

On first glance this looks useful as you don't have to loop through
all the devices to use.

But this function only seems to work if the data is already cached, so
it's not so useful on boot.

Has anyone any ideas on how we can optimize the identification of a
block device via UUID, LABEL, PARTUUID, PARTLABEL, etc.? Because the
current implementations don't scale well when you have many block
devices.

I suspect we may not be the first to encounter this, so just probing
to see if anyone had ideas on how to solve this in the past.

Is mise le meas/Regards,

Eric Curtin



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-18 Thread Eric Curtin
Yes, your understanding is correct. I'm off at the moment, we will try and
open a PR sometime to explain it better.

By the way I'd also happily review your PR also if you think you could
explain it better.

At the moment it's a loopback mounted file from /boot, mounted as an erofs
with transient overlay on top, there's a corresponding initoverlayfs file
for each initramfs file basically.

But it could be configurable in future to load a raw erofs partition if
somebody wanted to do that.

Gonna try and do some of the storage-init things as systemd service scripts
soon.


On Mon, 18 Dec 2023, 22:00 Askar Safin,  wrote:

> Hi. Unfortunately, this is not clear enough from
> https://github.com/containers/initoverlayfs how exactly the
> second-stage early filesystem is mounted. So, please, add that
> information to README. Let me describe how I understand this.
>
> First, init program from (small) first-stage early filesystem mounts
> boot/ESP partition, where second-stage early filesystem image (i. e.
> erofs) is located. Then that init program mounts that erofs image.
> Without copying the whole erofs image into memory. In other words, if
> some part of erofs image is not accessed, then not only it is not
> uncompressed, it even is not loaded from disk to memory at all. Is my
> understanding correct?
>
> --
> Askar Safin
>
>


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Eric Curtin
On Tue, 12 Dec 2023 at 20:35, Nils Kattenbeck  wrote:
>
> Hi, while I have been following this thread passively for now I also
> wanted to chime in.
>
> > (The main reason why sd-stub doesn't actually support erofs-initrds,
> > is that sd-stub also generates initrd cpios on the fly, to pass
> > credentials and system extension images to the kernel, and you can't
> > really mix erofs and cpio initrds into one)
>
> What prevents one from mixing the two (especially given that the
> hypothetical erofs initrd support does not yet exist)?
> Or are you talking about mixing this with your memmap+root=/dev/pmem 
> suggestion?
>
> > The try to optimize the initrd a bit by making it an erofs/memmap
> > thing and so on. And make sure the initrd only contains stuff you
> > always need, so that reading it all into memory is necessary anyway,
> > and hence any approach that tries to run even the initrd off a disk
> > image won't be necessary becuase you need to read everything anyway.
>
> Having to ensure that the initrd is as small as possible is definitely
> no easy task.
> Furthermore unless one has total control over the devices, or even if
> there are only a few hardware revisions, parts of the initrd might not
> be used.
> Even if everything is the same there are codes paths which might not
> be taken during usual operation. An example would be services similar
> to the new systemd-bsod which are only triggered in emergencies.
> Having these in the cpio means that they will always be read and
> decompressed.
> Using sysexts also has the drawback that each and every one of them
> has to be decompressed. I might be mistaken but I expect that this
> will be the case even if the extension-release in the sysext results
> in it being discarded which is obviously another big drawback.
>
> Regardless, even if every single file within the cpio archive (and
> potential sysexts) is used, erofs still has a distinct advantage over
> cpio!
> With cpio everything has to be decompressed and read up front. With
> erofs this is not the case.
> Only the fs header has to be read at first as files are decompressed on 
> demand.
> This means that critical stuff can be started earlier as it does not
> have to wait for decompression of stuff only needed later on.
> For example an initrd-only (i.e. not pivolint root), graphical system
> could start all background services long before the UI starts and
> accesses large asset files.
>
> I agree that this splitting up into another micro-initrd just for some
> storage stuff etc (which I still have not groked completely) does not
> seem to offer any advantages to what we have today. *However*, I
> certainly think that standardizing and supporting some kind of erofs
> based initrd would gain some advantages.

Are we sure? A bunch of stuff in modern initrd's today have nothing to
do with mounting storage. I've proved there's benefit to that with the
data on the initoverlayfs page, you save ~300ms on systemd start time
on a Raspberry Pi 4 with an sd card, if you use an NVMe drive over USB
on a Raspberry Pi 4 it's even more... ~500ms. I wouldn't say that's
insignificant. You still get all the functionality of the fully
fledged initramfs when systemd starts but you save between 300ms and
500ms.

>
> On the other hand this feels like going back to an old ramdisk again.
> This goes beyond my knowledge but based on the kernel docs most
> drawbacks of ramdisks would not apply to an approach with erofs. Also
> maybe the more flexible loopback devices could be used(?) which might
> alleviate some problems.

For the record, this is what we are doing for initoverlayfs at the
moment, mounting "/boot" partition and then loopback. There are
significant advantages as there are few bytes read until you start
using initoverlayfs.

/boot/initramfs-6.5.12-200.fc38.x86_64.img
/boot/initoverlayfs-6.5.12-200.fc38.x86_64.img

>
> -- This block device was of fixed size, so the filesystem mounted on
> it was of fixed size.
>-> Should not be of concern as it is readonly anyhow.
> -- Using a ram disk also required unnecessarily copying memory from
> the fake block device into the page cache (and copying changes back
> out), as well as creating and destroying dentries.
>-> (?) This one I am actually not too sure about and supersedes my
> knowledge on tmpfs, vfs (and its cache layers), erofs caching, and
> loopback devices).
> -- Plus it needed a filesystem driver (such as ext2) to format and
> interpret this data.
>-> erofs is already included in most initrds (and is not too big if
> it is not)
>
> Regards, Nils
>



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Eric Curtin
On Mon, 11 Dec 2023 at 20:59, Luca Boccassi  wrote:
>
> On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour
>  wrote:
> >
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA512
> >
> > On Mon, Dec 11, 2023 at 08:15:27PM +, Luca Boccassi wrote:
> > > On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
> > >  wrote:
> > > >
> > > > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > > > On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
> > > > >
> > > > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > > > mini-initramfs contains just enough to get storage drivers loaded 
> > > > > > and
> > > > > > storage devices initialized. storage-init is a process that is not
> > > > > > designed to replace init, it does just enough to initialize storage
> > > > > > (performs a targeted udev trigger on storage), switches to
> > > > > > initoverlayfs as root and then executes init.
> > > > > >
> > > > > > ```
> > > > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> 
> > > > > > rootfs
> > > > > >
> > > > > > fw -> bootloader -> kernel -> storage-init   -> init 
> > > > > > ->
> > > > > > ```
> > > > >
> > > > > I am not sure I follow what these chains are supposed to mean? Why are
> > > > > there two lines?
> > > > >
> > > > > So, I generally would agree that the current initrd scheme is not
> > > > > ideal, and we have been discussing better approaches. But I am not
> > > > > sure your approach really is useful on generic systems for two
> > > > > reasons:
> > > > >
> > > > > 1. no security model? you need to authenticate your initrd in
> > > > >2023. There's no execuse to not doing that anymore these days. Not
> > > > >in automotive, and not anywhere else really.
> > > > >
> > > > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > > > >unlock their root disks with TPM2 and similar things. People use
> > > > >RAID, LVM, and all that mess.
> > > > >
> > > > > Actually the above are kinda the same problem in a way: you need
> > > > > complex storage, but if you need that you kinda need udev, and
> > > > > services, and then also systemd and all that other stuff, and that's
> > > > > why the system works like the system works right now.
> > > > >
> > > > > Whenever you devise a system like yours by cutting corners, and
> > > > > declaring that you don't want TPM, you don't want signed initrds, you
> > > > > don't want to support weird storage, you just solve your problem in a
> > > > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > > > actually really work without all that and are willing to maintain the
> > > > > solution for your specific problem only.
> > > > >
> > > > > As I understand you are trying to solve multiple problems at once
> > > > > here, and I think one should start with figuring out clearly what
> > > > > those are before trying to address them, maybe without compromising on
> > > > > security. So my guess is you want to address the following:
> > > > >
> > > > > 1. You don't want the whole big initrd to be read off disk on every
> > > > >boot, but only the parts of it that are actually needed.
> > > > >
> > > > > 2. You don't want the whole big initrd to be fully decompressed on 
> > > > > every
> > > > >boot, but only the parts of it that are actually needed.
> > > > >
> > > > > 3. You want to share data between root fs and initrd
> > > > >
> > > > > 4. You want to save some boot time by not bringing up an init system
> > > > >in the initrd once, then tearing it down again, and starting it
> > > > >again from the root fs.
> > > > >
> > > > > For the items listed above I think you can find different solutions
> > > > > which do not necessarily compromise security as much.
> > > > >
>

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Eric Curtin
On Mon, 11 Dec 2023 at 16:36, Demi Marie Obenour
 wrote:
>
> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
> >
> > > Here is the boot sequence with initoverlayfs integrated, the
> > > mini-initramfs contains just enough to get storage drivers loaded and
> > > storage devices initialized. storage-init is a process that is not
> > > designed to replace init, it does just enough to initialize storage
> > > (performs a targeted udev trigger on storage), switches to
> > > initoverlayfs as root and then executes init.
> > >
> > > ```
> > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > >
> > > fw -> bootloader -> kernel -> storage-init   -> init ->
> > > ```
> >
> > I am not sure I follow what these chains are supposed to mean? Why are
> > there two lines?
> >
> > So, I generally would agree that the current initrd scheme is not
> > ideal, and we have been discussing better approaches. But I am not
> > sure your approach really is useful on generic systems for two
> > reasons:
> >
> > 1. no security model? you need to authenticate your initrd in
> >2023. There's no execuse to not doing that anymore these days. Not
> >in automotive, and not anywhere else really.
> >
> > 2. no way to deal with complex storage? i.e. people use FDE, want to
> >unlock their root disks with TPM2 and similar things. People use
> >RAID, LVM, and all that mess.
> >
> > Actually the above are kinda the same problem in a way: you need
> > complex storage, but if you need that you kinda need udev, and
> > services, and then also systemd and all that other stuff, and that's
> > why the system works like the system works right now.
> >
> > Whenever you devise a system like yours by cutting corners, and
> > declaring that you don't want TPM, you don't want signed initrds, you
> > don't want to support weird storage, you just solve your problem in a
> > very specific way, ignoring the big picture. Which is OK, *if* you can
> > actually really work without all that and are willing to maintain the
> > solution for your specific problem only.
> >
> > As I understand you are trying to solve multiple problems at once
> > here, and I think one should start with figuring out clearly what
> > those are before trying to address them, maybe without compromising on
> > security. So my guess is you want to address the following:
> >
> > 1. You don't want the whole big initrd to be read off disk on every
> >boot, but only the parts of it that are actually needed.
> >
> > 2. You don't want the whole big initrd to be fully decompressed on every
> >boot, but only the parts of it that are actually needed.
> >
> > 3. You want to share data between root fs and initrd
> >
> > 4. You want to save some boot time by not bringing up an init system
> >in the initrd once, then tearing it down again, and starting it
> >again from the root fs.
> >
> > For the items listed above I think you can find different solutions
> > which do not necessarily compromise security as much.
> >
> > So, in the list above you could address the latter three like this:
> >
> > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> >loader load the erofs into contigous memory, then use memmap=X!Y on
> >the kernel cmdline to synthesize a block device from that, which
> >you then mount directly (without any initrd) via
> >root=/dev/pmem0. This means yout boot loader will still load the
> >whole image into memory, but only decompress the bits actually
> >neeed. (It also has some other nice benefits I like, such as an
> >immutable rootfs, which tmpfs-based initrds don't have.)
> >
> > 3. Simply never transition to the root fs, don't marke the initrds in
> >systemd's eyes as an initrd (specifically: don't add an
> >/etc/initrd-release file to it). Instead, just merge resources of
> >the root fs into your initrd fs via overlayfs. systemd has
> >infrastructure for this: "systemd-sysext". It takes immutable,
> >authenticated erofs images (with verity, we call them "DDIs",
> >i.e. "discoverable disk images") that it overlays into /usr/. [You
> >could also very nicely combine this approach with systemd's
> >portable services, and npsawn containers, which operate on the same

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Eric Curtin
On Mon, 11 Dec 2023 at 12:48, Eric Curtin  wrote:
>
> On Mon, 11 Dec 2023 at 11:51, Lennart Poettering  
> wrote:
> >
> > On Mo, 11.12.23 11:28, Eric Curtin (ecur...@redhat.com) wrote:
> >
> > > > > For the items listed above I think you can find different solutions
> > > > > which do not necessarily compromise security as much.
> > > > >
> > > > > So, in the list above you could address the latter three like this:
> > > > >
> > > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > > >loader load the erofs into contigous memory, then use memmap=X!Y on
> > > > >the kernel cmdline to synthesize a block device from that, which
> > > > >you then mount directly (without any initrd) via
> > > > >root=/dev/pmem0. This means yout boot loader will still load the
> > > > >whole image into memory, but only decompress the bits actually
> > > > >neeed. (It also has some other nice benefits I like, such as an
> > > > >immutable rootfs, which tmpfs-based initrds don't have.)
> > >
> > > What I am unsure about here, is the "make the bootloader load the
> > > erofs into contiguous memory" part. I wonder could we try and use the
> > > existing initramfs data as is.
> >
> > Today's initrds are packed cpio archives of an OS file system
> > hierarchy. What I proposed means you'd have to put the OS file system
> > hiearchy into an erofs image instead. Which is a trivial operation,
> > just unpack and repack.
> >
> > Note that there are two concepts of "initrd" out there.
> >
> > a) from the kernel perspective an initrd/initramfs (which both are
> >badly named, because its a tmpfs these days) is that packed cpio
> >archive that is unpacked into a tmpfs, and then jumped into.
> >
> > b) from systemd's perspective an initrd is an OS image that carries an
> >/etc/initrd-release file. If that file exists then systemd will not
> >boot up the system regularly, but instead just prepare everything
> >that it can transition into some other root fs.
> >
> > While most often in real life the initrds currently qualify under both
> > definitions. But there's no reason to always do this. You can also
> > have images the kernel would consider an initrd, but systemd does not,
> > which is something we use in the "USI" concept, i.e. "unified system
> > images", which are basically UKIs (large UKIs) with a complete rootfs
> > that is the main system of the OS. And you can also do it the other
> > way round, which is potentially what I am suggesting to you here: use
> > an erofs image that would not be considered an initrd by the kernel,
> > but that systemd would consider one, and transition out of.
> >
> > > I dunno if
> > > bootloaders make much assumptions about the format of that data, worst
> > > case scenario we could encapsulate erofs in the initramfs, cpio looking
> > > data.
> >
> > boot loaders generally don't bother with the cpio, it's just "data"
> > for them. Compression algorithms have changed in the past, and it only
> > mattered that the kernel could decompress it, the boot loader doesn't care.
> >
> > > Teach the kernel not to decompress and process the whole
> > > thing and mount it like an erofs alternatively. Does this sound crazy
> > > or reasonable?
> >
> > You are re-inventing the traditional "initrd" logic of the kernel
> > which was a ramdisk (i.e. a block device /dev/ram0), that was filled
> > with some fs of your choice loaded by the boot loader.
>
> Sort of yes, but preferably using that __initramfs_start /
> initrd_start buffer as is without copying any bytes anywhere else and
> without teaching the bootloaders to do things.
>
> The "memmap=" approach you suggested sounds like what we are thinking,
> but do you think we could do this without teaching bootloaders to do
> new things?

Like could we do that with a "initrd3.0=on" karg and it just uses the
__initramfs_start and __initramfs_size to memmap? (that probably
wouldn't be the arg name, it's just for description purposes here,
maybe it's even a build time flag, etc.)

>
> Although the nice thing about a storage-init like approach is there's
> basically zero copies up front. What storage-init is trying to be, is
> a tool to just call systemd storage things, without also inheriting
> all the systemd stack.
>
> >
> > Lennart
> >
> > --
> > Lennart Poettering, Berlin
> >



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Eric Curtin
On Mon, 11 Dec 2023 at 11:51, Lennart Poettering  wrote:
>
> On Mo, 11.12.23 11:28, Eric Curtin (ecur...@redhat.com) wrote:
>
> > > > For the items listed above I think you can find different solutions
> > > > which do not necessarily compromise security as much.
> > > >
> > > > So, in the list above you could address the latter three like this:
> > > >
> > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > >loader load the erofs into contigous memory, then use memmap=X!Y on
> > > >the kernel cmdline to synthesize a block device from that, which
> > > >you then mount directly (without any initrd) via
> > > >root=/dev/pmem0. This means yout boot loader will still load the
> > > >whole image into memory, but only decompress the bits actually
> > > >neeed. (It also has some other nice benefits I like, such as an
> > > >immutable rootfs, which tmpfs-based initrds don't have.)
> >
> > What I am unsure about here, is the "make the bootloader load the
> > erofs into contiguous memory" part. I wonder could we try and use the
> > existing initramfs data as is.
>
> Today's initrds are packed cpio archives of an OS file system
> hierarchy. What I proposed means you'd have to put the OS file system
> hiearchy into an erofs image instead. Which is a trivial operation,
> just unpack and repack.
>
> Note that there are two concepts of "initrd" out there.
>
> a) from the kernel perspective an initrd/initramfs (which both are
>badly named, because its a tmpfs these days) is that packed cpio
>archive that is unpacked into a tmpfs, and then jumped into.
>
> b) from systemd's perspective an initrd is an OS image that carries an
>/etc/initrd-release file. If that file exists then systemd will not
>boot up the system regularly, but instead just prepare everything
>that it can transition into some other root fs.
>
> While most often in real life the initrds currently qualify under both
> definitions. But there's no reason to always do this. You can also
> have images the kernel would consider an initrd, but systemd does not,
> which is something we use in the "USI" concept, i.e. "unified system
> images", which are basically UKIs (large UKIs) with a complete rootfs
> that is the main system of the OS. And you can also do it the other
> way round, which is potentially what I am suggesting to you here: use
> an erofs image that would not be considered an initrd by the kernel,
> but that systemd would consider one, and transition out of.
>
> > I dunno if
> > bootloaders make much assumptions about the format of that data, worst
> > case scenario we could encapsulate erofs in the initramfs, cpio looking
> > data.
>
> boot loaders generally don't bother with the cpio, it's just "data"
> for them. Compression algorithms have changed in the past, and it only
> mattered that the kernel could decompress it, the boot loader doesn't care.
>
> > Teach the kernel not to decompress and process the whole
> > thing and mount it like an erofs alternatively. Does this sound crazy
> > or reasonable?
>
> You are re-inventing the traditional "initrd" logic of the kernel
> which was a ramdisk (i.e. a block device /dev/ram0), that was filled
> with some fs of your choice loaded by the boot loader.

Sort of yes, but preferably using that __initramfs_start /
initrd_start buffer as is without copying any bytes anywhere else and
without teaching the bootloaders to do things.

The "memmap=" approach you suggested sounds like what we are thinking,
but do you think we could do this without teaching bootloaders to do
new things?

Although the nice thing about a storage-init like approach is there's
basically zero copies up front. What storage-init is trying to be, is
a tool to just call systemd storage things, without also inheriting
all the systemd stack.

>
> Lennart
>
> --
> Lennart Poettering, Berlin
>



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Eric Curtin
I am also thinking, what is the difference between "make the
bootloader load the erofs into contiguous memory" part and doing
something like storage-init.

They are similar approaches, introduce something in the middle to
handle the erofs.

Is mise le meas/Regards,

Eric Curtin

On Mon, 11 Dec 2023 at 11:28, Eric Curtin  wrote:
>
> On Mon, 11 Dec 2023 at 11:20, Eric Curtin  wrote:
> >
> > On Mon, 11 Dec 2023 at 10:06, Lennart Poettering  
> > wrote:
> > >
> > > On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
> > >
> > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > storage devices initialized. storage-init is a process that is not
> > > > designed to replace init, it does just enough to initialize storage
> > > > (performs a targeted udev trigger on storage), switches to
> > > > initoverlayfs as root and then executes init.
> > > >
> > > > ```
> > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > >
> > > > fw -> bootloader -> kernel -> storage-init   -> init ->
> > > > ```
> > >
> > > I am not sure I follow what these chains are supposed to mean? Why are
> > > there two lines?
> >
> > The top line is the filesystem transition, the bottom is more like a
> > process perspective. Will make this clearer in future.
> >
> > >
> > > So, I generally would agree that the current initrd scheme is not
> > > ideal, and we have been discussing better approaches. But I am not
> > > sure your approach really is useful on generic systems for two
> > > reasons:
> > >
> > > 1. no security model? you need to authenticate your initrd in
> > >2023. There's no execuse to not doing that anymore these days. Not
> > >in automotive, and not anywhere else really.
> >
> > Yes you are right, there is no excuse, the plan was to mount using
> > dm-verity most likely with the details from the initramfs, but
> > admittedly we had not looked into that into great detail.
> >
> > >
> > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > >unlock their root disks with TPM2 and similar things. People use
> > >RAID, LVM, and all that mess.
> >
> > We had 3 thoughts on this:
> >
> > 1. Just worry about the common use-cases and leave everyone else
> > fallback to the approaches we use today.
> > 2. Try and split up systemd to make it even smaller. We do use
> > systemd-udev in the small initramfs storage-init process so far.
> > 3. Reimplement some things? But as little as possible, on a case by
> > case basis, we certainly don't want to fall into the trap of rewriting
> > systemd that's for sure, systemd does these things very well.
> >
> > Tbh, if we try and implement this in kernelspace a lot of these
> > questions go away. You just teach the kernel to deal with the
> > filesystem image early (say erofs or whatever other filesystem) and
> > have that data where initramfs data currently is. You still pay for
> > the initial read, but you still save a bunch of kernel time.
> >
> > >
> > > Actually the above are kinda the same problem in a way: you need
> > > complex storage, but if you need that you kinda need udev, and
> > > services, and then also systemd and all that other stuff, and that's
> > > why the system works like the system works right now.
> >
> > True, but there is also a bunch of stuff in current initrd's today
> > that aren't required to mount basic storage, but are designed around
> > the whole idea of having an early throwaway filesystem.
> >
> > >
> > > Whenever you devise a system like yours by cutting corners, and
> > > declaring that you don't want TPM, you don't want signed initrds, you
> > > don't want to support weird storage, you just solve your problem in a
> > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > actually really work without all that and are willing to maintain the
> > > solution for your specific problem only.
> > >
> > > As I understand you are trying to solve multiple problems at once
> > > here, and I think one should start with figuring out clearly what
> > > those are before trying to address them, maybe without compromising on
> > > security. So my guess is you 

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Eric Curtin
On Mon, 11 Dec 2023 at 11:20, Eric Curtin  wrote:
>
> On Mon, 11 Dec 2023 at 10:06, Lennart Poettering  wrote:
> >
> > On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
> >
> > > Here is the boot sequence with initoverlayfs integrated, the
> > > mini-initramfs contains just enough to get storage drivers loaded and
> > > storage devices initialized. storage-init is a process that is not
> > > designed to replace init, it does just enough to initialize storage
> > > (performs a targeted udev trigger on storage), switches to
> > > initoverlayfs as root and then executes init.
> > >
> > > ```
> > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > >
> > > fw -> bootloader -> kernel -> storage-init   -> init ->
> > > ```
> >
> > I am not sure I follow what these chains are supposed to mean? Why are
> > there two lines?
>
> The top line is the filesystem transition, the bottom is more like a
> process perspective. Will make this clearer in future.
>
> >
> > So, I generally would agree that the current initrd scheme is not
> > ideal, and we have been discussing better approaches. But I am not
> > sure your approach really is useful on generic systems for two
> > reasons:
> >
> > 1. no security model? you need to authenticate your initrd in
> >2023. There's no execuse to not doing that anymore these days. Not
> >in automotive, and not anywhere else really.
>
> Yes you are right, there is no excuse, the plan was to mount using
> dm-verity most likely with the details from the initramfs, but
> admittedly we had not looked into that into great detail.
>
> >
> > 2. no way to deal with complex storage? i.e. people use FDE, want to
> >unlock their root disks with TPM2 and similar things. People use
> >RAID, LVM, and all that mess.
>
> We had 3 thoughts on this:
>
> 1. Just worry about the common use-cases and leave everyone else
> fallback to the approaches we use today.
> 2. Try and split up systemd to make it even smaller. We do use
> systemd-udev in the small initramfs storage-init process so far.
> 3. Reimplement some things? But as little as possible, on a case by
> case basis, we certainly don't want to fall into the trap of rewriting
> systemd that's for sure, systemd does these things very well.
>
> Tbh, if we try and implement this in kernelspace a lot of these
> questions go away. You just teach the kernel to deal with the
> filesystem image early (say erofs or whatever other filesystem) and
> have that data where initramfs data currently is. You still pay for
> the initial read, but you still save a bunch of kernel time.
>
> >
> > Actually the above are kinda the same problem in a way: you need
> > complex storage, but if you need that you kinda need udev, and
> > services, and then also systemd and all that other stuff, and that's
> > why the system works like the system works right now.
>
> True, but there is also a bunch of stuff in current initrd's today
> that aren't required to mount basic storage, but are designed around
> the whole idea of having an early throwaway filesystem.
>
> >
> > Whenever you devise a system like yours by cutting corners, and
> > declaring that you don't want TPM, you don't want signed initrds, you
> > don't want to support weird storage, you just solve your problem in a
> > very specific way, ignoring the big picture. Which is OK, *if* you can
> > actually really work without all that and are willing to maintain the
> > solution for your specific problem only.
> >
> > As I understand you are trying to solve multiple problems at once
> > here, and I think one should start with figuring out clearly what
> > those are before trying to address them, maybe without compromising on
> > security. So my guess is you want to address the following:
> >
> > 1. You don't want the whole big initrd to be read off disk on every
> >boot, but only the parts of it that are actually needed.
> >
> > 2. You don't want the whole big initrd to be fully decompressed on every
> >boot, but only the parts of it that are actually needed.
> >
> > 3. You want to share data between root fs and initrd
> >
> > 4. You want to save some boot time by not bringing up an init system
> >in the initrd once, then tearing it down again, and starting it
> >again from the root fs.
>
> It's mainly the top 3 that were the goals. And that people have the
> freedom to consider using heavier weight generic libra

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Eric Curtin
On Mon, 11 Dec 2023 at 10:06, Lennart Poettering  wrote:
>
> On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
>
> > Here is the boot sequence with initoverlayfs integrated, the
> > mini-initramfs contains just enough to get storage drivers loaded and
> > storage devices initialized. storage-init is a process that is not
> > designed to replace init, it does just enough to initialize storage
> > (performs a targeted udev trigger on storage), switches to
> > initoverlayfs as root and then executes init.
> >
> > ```
> > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> >
> > fw -> bootloader -> kernel -> storage-init   -> init ->
> > ```
>
> I am not sure I follow what these chains are supposed to mean? Why are
> there two lines?

The top line is the filesystem transition, the bottom is more like a
process perspective. Will make this clearer in future.

>
> So, I generally would agree that the current initrd scheme is not
> ideal, and we have been discussing better approaches. But I am not
> sure your approach really is useful on generic systems for two
> reasons:
>
> 1. no security model? you need to authenticate your initrd in
>2023. There's no execuse to not doing that anymore these days. Not
>in automotive, and not anywhere else really.

Yes you are right, there is no excuse, the plan was to mount using
dm-verity most likely with the details from the initramfs, but
admittedly we had not looked into that into great detail.

>
> 2. no way to deal with complex storage? i.e. people use FDE, want to
>unlock their root disks with TPM2 and similar things. People use
>RAID, LVM, and all that mess.

We had 3 thoughts on this:

1. Just worry about the common use-cases and leave everyone else
fallback to the approaches we use today.
2. Try and split up systemd to make it even smaller. We do use
systemd-udev in the small initramfs storage-init process so far.
3. Reimplement some things? But as little as possible, on a case by
case basis, we certainly don't want to fall into the trap of rewriting
systemd that's for sure, systemd does these things very well.

Tbh, if we try and implement this in kernelspace a lot of these
questions go away. You just teach the kernel to deal with the
filesystem image early (say erofs or whatever other filesystem) and
have that data where initramfs data currently is. You still pay for
the initial read, but you still save a bunch of kernel time.

>
> Actually the above are kinda the same problem in a way: you need
> complex storage, but if you need that you kinda need udev, and
> services, and then also systemd and all that other stuff, and that's
> why the system works like the system works right now.

True, but there is also a bunch of stuff in current initrd's today
that aren't required to mount basic storage, but are designed around
the whole idea of having an early throwaway filesystem.

>
> Whenever you devise a system like yours by cutting corners, and
> declaring that you don't want TPM, you don't want signed initrds, you
> don't want to support weird storage, you just solve your problem in a
> very specific way, ignoring the big picture. Which is OK, *if* you can
> actually really work without all that and are willing to maintain the
> solution for your specific problem only.
>
> As I understand you are trying to solve multiple problems at once
> here, and I think one should start with figuring out clearly what
> those are before trying to address them, maybe without compromising on
> security. So my guess is you want to address the following:
>
> 1. You don't want the whole big initrd to be read off disk on every
>boot, but only the parts of it that are actually needed.
>
> 2. You don't want the whole big initrd to be fully decompressed on every
>boot, but only the parts of it that are actually needed.
>
> 3. You want to share data between root fs and initrd
>
> 4. You want to save some boot time by not bringing up an init system
>in the initrd once, then tearing it down again, and starting it
>again from the root fs.

It's mainly the top 3 that were the goals. And that people have the
freedom to consider using heavier weight generic libraries, tools,
etc. if they want. You want to use Rust (or languages X, Y, Z) to
write something early boot, go ahead! You'll only pay the cost for the
larger binary if you actually use it. The week I started tinkering at
this, there was a mini-debate on whether we should include glib or not
in the initrd. And we are regularly under pressure to reduce boot time
at the moment.

Number 4 was a convenient way to do an early version of this, stick a
process in between systemd and the kernel. But it turns out, it works
very 

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Eric Curtin
On Sat, 9 Dec 2023 at 18:12, Luca Boccassi  wrote:
>
> On Sat, 9 Dec 2023 at 17:58, Eric Curtin  wrote:
> >
> > On Sat, 9 Dec 2023 at 17:46, Luca Boccassi  wrote:
> > >
> > > On Sat, 9 Dec 2023 at 17:25, Eric Curtin  wrote:
> > > >
> > > > On Sat, 9 Dec 2023 at 17:19, Luca Boccassi  wrote:
> > > > >
> > > > > On Sat, 9 Dec 2023 at 15:08, Eric Curtin  wrote:
> > > > > >
> > > > > > On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov  
> > > > > > wrote:
> > > > > > >
> > > > > > > On 09.12.2023 17:42, Eric Curtin wrote:
> > > > > > > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  
> > > > > > > > wrote:
> > > > > > > >>
> > > > > > > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin  
> > > > > > > >> wrote:
> > > > > > > >>>
> > > > > > > >>> We have been working on a new initial filesystem called 
> > > > > > > >>> initoverlayfs.
> > > > > > > >>> It is a new filesystem that provides a more scalable approach 
> > > > > > > >>> to
> > > > > > > >>> initial filesystems as opposed to just using initrds. We are 
> > > > > > > >>> writing
> > > > > > > >>> this RFC to the systemd and dracut mailing lists (feel free 
> > > > > > > >>> to forward
> > > > > > > >>> to UAPI group also) because although this solution works 
> > > > > > > >>> without
> > > > > > > >>> changing the code in these projects, it operates in the same 
> > > > > > > >>> area as
> > > > > > > >>> systemd, udev, dracut, etc. and uses these tools.
> > > > > > > >>
> > > > > > > >> It seems to me everything you described already exists? If you 
> > > > > > > >> want to
> > > > > > > >> avoid having an initrd -> rootfs transition, you can already 
> > > > > > > >> do that -
> > > > > > > >
> > > > > > > > You need a initrd -> rootfs transition for generic linux 
> > > > > > > > operating
> > > > > > > > systems right?
> > > > > > >
> > > > > > > No, you do not. Nothing stops you from running off initramfs 
> > > > > > > (today you
> > > > > > > do not really have init*RAM Disk* - the content of initrd is 
> > > > > > > unpacked
> > > > > > > into initramfs.
> > > > > >
> > > > > > Apologies if I am misinterpreting this response, I use terms initrd
> > > > > > and initramfs
> > > > > > interchangeably (not technically correct, but it's common to do 
> > > > > > this). The
> > > > > > point is to avoid unpacking as much as possible, because in many 
> > > > > > initrds
> > > > > > the majority of the software need not be unpacked, but is designed 
> > > > > > to work
> > > > > > with throwaway initial filesystems.
> > > > >
> > > > > sd-stub already supports having a small initrd shipped in the UKI,
> > > > > that is extended via sysexts, and systemd already supports running
> > > > > from it, without any transition to a final rootfs. What else do you
> > > > > need? What problem is this attempting to solve?
> > > >
> > > > I must give sd-stub a try. The bootloader I most commonly work with 
> > > > (and is one
> > > > of the target platforms this is intended for) isn't UEFI, we need 
> > > > something more
> > > > portable.
> > >
> > > Do we, though? All modern hardware platforms (and VMs) that matter are
> > > UEFI. Why would any of this be needed for legacy hardware platforms?
> > > The existing mechanisms can work just fine on those until they reach
> > > EOL, they won't stop working.
> >
> > Respectfully, this is not true. Especially on ARM platforms. I would
> > like it to be true, but it's not true today.
>
> Where any of this would actually matter, they mostly do, and where
> they don't one can put together uboot with uefi mode.

When you are trying to improve boot performance, introducing another
layer of bootloader with uboot doesn't help. You also have to port
every hardware platform you encounter to uboot. And if you can solve
the problem in the Linux stack somewhere rather than the bootloader.
Why would we choose to fix the problem in the bootloader?

>
> > I should have expanded, we are not trying to avoid transitioning to a
> > final rootfs, the goal is to transition to a final rootfs. But not to 
> > decompress
> > and copy all the bytes to a tmpfs up front, rather use something like erofs,
> > overlayfs, etc. sysexts uses erofs+overlayfs, but it's designed with
> > a different goal in mind.
>
> In what way is the goal different?

This project is basically build an initrd, but put it in a
erofs+overlayfs alternatively (technically it builds a really small
initrd to initialize some basic storage drivers etc. and build a
second initrd in an erofs format). All existing software that we've
tested "just works" with this approach, including all the systemd
stuff. And you can do transparent decompression with lz4hc
alternatively. It also means you don't have to be as afraid of
bloating your initial filesystem, because minimizing initrd's is
tedious work.

>



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Eric Curtin
On Sat, 9 Dec 2023 at 17:46, Luca Boccassi  wrote:
>
> On Sat, 9 Dec 2023 at 17:25, Eric Curtin  wrote:
> >
> > On Sat, 9 Dec 2023 at 17:19, Luca Boccassi  wrote:
> > >
> > > On Sat, 9 Dec 2023 at 15:08, Eric Curtin  wrote:
> > > >
> > > > On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov  
> > > > wrote:
> > > > >
> > > > > On 09.12.2023 17:42, Eric Curtin wrote:
> > > > > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  wrote:
> > > > > >>
> > > > > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin  
> > > > > >> wrote:
> > > > > >>>
> > > > > >>> We have been working on a new initial filesystem called 
> > > > > >>> initoverlayfs.
> > > > > >>> It is a new filesystem that provides a more scalable approach to
> > > > > >>> initial filesystems as opposed to just using initrds. We are 
> > > > > >>> writing
> > > > > >>> this RFC to the systemd and dracut mailing lists (feel free to 
> > > > > >>> forward
> > > > > >>> to UAPI group also) because although this solution works without
> > > > > >>> changing the code in these projects, it operates in the same area 
> > > > > >>> as
> > > > > >>> systemd, udev, dracut, etc. and uses these tools.
> > > > > >>
> > > > > >> It seems to me everything you described already exists? If you 
> > > > > >> want to
> > > > > >> avoid having an initrd -> rootfs transition, you can already do 
> > > > > >> that -
> > > > > >
> > > > > > You need a initrd -> rootfs transition for generic linux operating
> > > > > > systems right?
> > > > >
> > > > > No, you do not. Nothing stops you from running off initramfs (today 
> > > > > you
> > > > > do not really have init*RAM Disk* - the content of initrd is unpacked
> > > > > into initramfs.
> > > >
> > > > Apologies if I am misinterpreting this response, I use terms initrd
> > > > and initramfs
> > > > interchangeably (not technically correct, but it's common to do this). 
> > > > The
> > > > point is to avoid unpacking as much as possible, because in many initrds
> > > > the majority of the software need not be unpacked, but is designed to 
> > > > work
> > > > with throwaway initial filesystems.
> > >
> > > sd-stub already supports having a small initrd shipped in the UKI,
> > > that is extended via sysexts, and systemd already supports running
> > > from it, without any transition to a final rootfs. What else do you
> > > need? What problem is this attempting to solve?
> >
> > I must give sd-stub a try. The bootloader I most commonly work with (and is 
> > one
> > of the target platforms this is intended for) isn't UEFI, we need something 
> > more
> > portable.
>
> Do we, though? All modern hardware platforms (and VMs) that matter are
> UEFI. Why would any of this be needed for legacy hardware platforms?
> The existing mechanisms can work just fine on those until they reach
> EOL, they won't stop working.

Respectfully, this is not true. Especially on ARM platforms. I would
like it to be true, but it's not true today.

I should have expanded, we are not trying to avoid transitioning to a
final rootfs, the goal is to transition to a final rootfs. But not to decompress
and copy all the bytes to a tmpfs up front, rather use something like erofs,
overlayfs, etc. sysexts uses erofs+overlayfs, but it's designed with
a different goal in mind.

>



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Eric Curtin
On Sat, 9 Dec 2023 at 17:19, Luca Boccassi  wrote:
>
> On Sat, 9 Dec 2023 at 15:08, Eric Curtin  wrote:
> >
> > On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov  wrote:
> > >
> > > On 09.12.2023 17:42, Eric Curtin wrote:
> > > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  wrote:
> > > >>
> > > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin  wrote:
> > > >>>
> > > >>> We have been working on a new initial filesystem called initoverlayfs.
> > > >>> It is a new filesystem that provides a more scalable approach to
> > > >>> initial filesystems as opposed to just using initrds. We are writing
> > > >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> > > >>> to UAPI group also) because although this solution works without
> > > >>> changing the code in these projects, it operates in the same area as
> > > >>> systemd, udev, dracut, etc. and uses these tools.
> > > >>
> > > >> It seems to me everything you described already exists? If you want to
> > > >> avoid having an initrd -> rootfs transition, you can already do that -
> > > >
> > > > You need a initrd -> rootfs transition for generic linux operating
> > > > systems right?
> > >
> > > No, you do not. Nothing stops you from running off initramfs (today you
> > > do not really have init*RAM Disk* - the content of initrd is unpacked
> > > into initramfs.
> >
> > Apologies if I am misinterpreting this response, I use terms initrd
> > and initramfs
> > interchangeably (not technically correct, but it's common to do this). The
> > point is to avoid unpacking as much as possible, because in many initrds
> > the majority of the software need not be unpacked, but is designed to work
> > with throwaway initial filesystems.
>
> sd-stub already supports having a small initrd shipped in the UKI,
> that is extended via sysexts, and systemd already supports running
> from it, without any transition to a final rootfs. What else do you
> need? What problem is this attempting to solve?

I must give sd-stub a try. The bootloader I most commonly work with (and is one
of the target platforms this is intended for) isn't UEFI, we need something more
portable.

Is mise le meas/Regards,

Eric Curtin

>



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Eric Curtin
On Sat, 9 Dec 2023 at 15:23, Daan De Meyer  wrote:
>
> > We have been working on a new initial filesystem called initoverlayfs.
> > It is a new filesystem that provides a more scalable approach to
> > initial filesystems as opposed to just using initrds. We are writing
> > this RFC to the systemd and dracut mailing lists (feel free to forward
> > to UAPI group also) because although this solution works without
> > changing the code in these projects, it operates in the same area as
> > systemd, udev, dracut, etc. and uses these tools.
>
> I like the concept of using erofs instead of a compressed cpio and we have
> been discussing doing something similar within systemd. I very much dislike
> the implementation though. I believe this should be implemented natively 
> within
> the Linux kernel instead of hacking around the missing kernel support
> in userspace.
>

I'm not against eventually implementing this in kernelspace, it's
something I've thought about. Implementing in userspace made more
sense to start as a lot of this tooling is much easier to work with in
userspace. It was much faster to write this in userspace to prove the
benefits, test, etc.

It is easier to maintain and develop software in userspace though. So
we would need to have serious thought on why we are pushing this into
kernelspace, what are the benefits, etc.

> If the kernel would add support for supplying an erofs initramfs
> instead of a cpio
> initramfs, put a writable tmpfs on top of it and would unpack any
> extra cpios provided
> by the bootloader on top of the tmpfs, then there wouldn't be any need
> for initoverlayfs.

Do we have to unpack extra cpio's, could that be optional? Mounting
erofs with transient overlay is really fast. Of course if people want
to do that it's fine :)


>
> Before adopting anything like this I believe there should be a serious
> effort to get
> this implemented within Linux itself. Only if that turns out to be
> impossible should
> we fall back to exploring userspace only solutions.
>
> Cheers,
>
> Daan
>
>
> On Sat, 9 Dec 2023 at 16:08, Eric Curtin  wrote:
> >
> > On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov  wrote:
> > >
> > > On 09.12.2023 17:42, Eric Curtin wrote:
> > > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  wrote:
> > > >>
> > > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin  wrote:
> > > >>>
> > > >>> We have been working on a new initial filesystem called initoverlayfs.
> > > >>> It is a new filesystem that provides a more scalable approach to
> > > >>> initial filesystems as opposed to just using initrds. We are writing
> > > >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> > > >>> to UAPI group also) because although this solution works without
> > > >>> changing the code in these projects, it operates in the same area as
> > > >>> systemd, udev, dracut, etc. and uses these tools.
> > > >>
> > > >> It seems to me everything you described already exists? If you want to
> > > >> avoid having an initrd -> rootfs transition, you can already do that -
> > > >
> > > > You need a initrd -> rootfs transition for generic linux operating
> > > > systems right?
> > >
> > > No, you do not. Nothing stops you from running off initramfs (today you
> > > do not really have init*RAM Disk* - the content of initrd is unpacked
> > > into initramfs.
> >
> > Apologies if I am misinterpreting this response, I use terms initrd
> > and initramfs
> > interchangeably (not technically correct, but it's common to do this). The
> > point is to avoid unpacking as much as possible, because in many initrds
> > the majority of the software need not be unpacked, but is designed to work
> > with throwaway initial filesystems.
> >
> > >
> > > > Or else you start building all sorts of things directly
> > > > into the kernel which isn't really scalable.
> > > >
> > >
> > > See above.
> > >
> >
>



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Eric Curtin
On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov  wrote:
>
> On 09.12.2023 17:42, Eric Curtin wrote:
> > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  wrote:
> >>
> >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin  wrote:
> >>>
> >>> We have been working on a new initial filesystem called initoverlayfs.
> >>> It is a new filesystem that provides a more scalable approach to
> >>> initial filesystems as opposed to just using initrds. We are writing
> >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> >>> to UAPI group also) because although this solution works without
> >>> changing the code in these projects, it operates in the same area as
> >>> systemd, udev, dracut, etc. and uses these tools.
> >>
> >> It seems to me everything you described already exists? If you want to
> >> avoid having an initrd -> rootfs transition, you can already do that -
> >
> > You need a initrd -> rootfs transition for generic linux operating
> > systems right?
>
> No, you do not. Nothing stops you from running off initramfs (today you
> do not really have init*RAM Disk* - the content of initrd is unpacked
> into initramfs.

Apologies if I am misinterpreting this response, I use terms initrd
and initramfs
interchangeably (not technically correct, but it's common to do this). The
point is to avoid unpacking as much as possible, because in many initrds
the majority of the software need not be unpacked, but is designed to work
with throwaway initial filesystems.

>
> > Or else you start building all sorts of things directly
> > into the kernel which isn't really scalable.
> >
>
> See above.
>



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Eric Curtin
On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  wrote:
>
> On Fri, 8 Dec 2023 at 19:00, Eric Curtin  wrote:
> >
> > We have been working on a new initial filesystem called initoverlayfs.
> > It is a new filesystem that provides a more scalable approach to
> > initial filesystems as opposed to just using initrds. We are writing
> > this RFC to the systemd and dracut mailing lists (feel free to forward
> > to UAPI group also) because although this solution works without
> > changing the code in these projects, it operates in the same area as
> > systemd, udev, dracut, etc. and uses these tools.
>
> It seems to me everything you described already exists? If you want to
> avoid having an initrd -> rootfs transition, you can already do that -

You need a initrd -> rootfs transition for generic linux operating
systems right? Or else you start building all sorts of things directly
into the kernel which isn't really scalable.

> the initrd code paths run because there's /etc/initrd-release, omit
> that and the transition/phase is avoided. If you want to have an
> overlay with r/o images, you can already do that with sysexts. You'll
> need to reimplement and maintain separately TPM support, LUKS support,
> fido2, etc etc

This is intended to be something you can use with or without sysexts,
not a competing alternative. There will be some reimplementations, but
our hope is to minimize that, leave as much as possible to systemd,
initoverlayfs stage, etc. where you don't pay the upfront cost for
decompressing and copying all the bytes.

We are open to executing minified systemd libraries/binaries in the
minified initramfs, we do that in the current version of storage-init
by calling systemd udev binaries.

>



[RFC] initoverlayfs - a scalable initial filesystem

2023-12-08 Thread Eric Curtin
We have been working on a new initial filesystem called initoverlayfs.
It is a new filesystem that provides a more scalable approach to
initial filesystems as opposed to just using initrds. We are writing
this RFC to the systemd and dracut mailing lists (feel free to forward
to UAPI group also) because although this solution works without
changing the code in these projects, it operates in the same area as
systemd, udev, dracut, etc. and uses these tools.

Brief context:
--

initoverlayfs by default uses transient overlays rather than tmpfs to
create throwaway filesystems early in the boot sequence.

Why?

An initramfs has to be decompressed and copied to a tmpfs up front
before it can be used. This results in a situation where you end up
paying for every byte in an initrd in boot performance, even the ones
you don't use in a given boot.

This leads to a fear of using languages that result in larger binaries
sizes early boot, reusing libraries, etc. In some cases, reimplemented
minified versions of software components present in the rootfs are
used.

Alternatively, initoverlayfs uses erofs (with compression) and
overlayfs to achieve this, so you only pay for the bytes you actually
use.

There is also increased pressure from certain industries like
automotive, to start essential services in a boot sequence early.

Requirements:
-

An init system
An initramfs building tool
A device manager
overlayfs

Nothing that you wouldn't find in most Linux distributions today.

Design:
---

Here is the boot sequence with initoverlayfs integrated, the
mini-initramfs contains just enough to get storage drivers loaded and
storage devices initialized. storage-init is a process that is not
designed to replace init, it does just enough to initialize storage
(performs a targeted udev trigger on storage), switches to
initoverlayfs as root and then executes init.

```
fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs

fw -> bootloader -> kernel -> storage-init   -> init ->
```

Benefits:
-

Scalability: You can put less emphasis on keeping this initial
filesystem small as you will only pay for the bytes you read. This is
probably the bigger picture than raw performance in the next point.

Performance: As this minifies the initramfs to contain only the most
basic storage initialization tasks, linux userspace starts earlier
than it would using just initramfs alone. Leaving all the other
software that require early throwaway filesystems to be executed in
the initoverlayfs. In the case of a Raspberry Pi 4 with sd card, it
leads to systemd starting ~300ms faster and in the case of a Raspberry
Pi 4 with NVMe SSD drive over USB it leads to systemd starting ~500ms
faster. There are some devices that by starting Linux userspace early,
you can expose a slowly initializing storage driver, leading to a
slower boot as with just an initramfs you mask this slow driver by
spending this time on decompression and copying. But a computer is
only as fast as it's slowest component, so if you care about super
fast boots, you need to optimize your storage drivers.

Flexibility: It is now easier to consider using fatter languages like
Rust, etc. Using libraries like graphics libraries, camera libraries,
libevent, glib, C++, etc. early boot can be considered. As you don't
have to decompress and copy this data upfront. This leads to easier to
maintain initrd software also, with more consolidation between rootfs
impelmentations and initial filesystem implementations of components.

Changes required in other projects:
---

There are no major changes required in other projects. Tools like
systemd-analyze might need to be updated to recognize this boot
sequence more accurately, because it has no awareness of
initoverlayfs.

Future plans:
-

We intend to propose this to Fedora, CentOS Stream, ostree and
non-ostree variants as we continue this project.

Feel free to try:
-

It should work on most standard 3 partition non-ostree Fedora and
CentOS 9 installs (note: CentOS 9 kernel does not support erofs
compression, so Fedora is a better playground today). It's still in
alpha/beta state I guess. Although I successfully dogfood this on my
laptop and we hard tried this on a couple of different pieces of
hardware and VMs... Maybe run this on a non-critical piece of hardware
or a VM for the next few weeks if you want to try :)

git repo:

https://github.com/containers/initoverlayfs

Also checkout the README.md, there are some graphs and other information there:

https://github.com/containers/initoverlayfs/blob/main/README.md

rpm available in copr:

dnf copr enable @centos-automotive-sig/next
dnf install initoverlayfs
initoverlayfs-install

Is mise le meas/Regards,

Eric Curtin