Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-18 Thread Eric Curtin
Yes, your understanding is correct. I'm off at the moment, we will try and
open a PR sometime to explain it better.

By the way I'd also happily review your PR also if you think you could
explain it better.

At the moment it's a loopback mounted file from /boot, mounted as an erofs
with transient overlay on top, there's a corresponding initoverlayfs file
for each initramfs file basically.

But it could be configurable in future to load a raw erofs partition if
somebody wanted to do that.

Gonna try and do some of the storage-init things as systemd service scripts
soon.


On Mon, 18 Dec 2023, 22:00 Askar Safin,  wrote:

> Hi. Unfortunately, this is not clear enough from
> https://github.com/containers/initoverlayfs how exactly the
> second-stage early filesystem is mounted. So, please, add that
> information to README. Let me describe how I understand this.
>
> First, init program from (small) first-stage early filesystem mounts
> boot/ESP partition, where second-stage early filesystem image (i. e.
> erofs) is located. Then that init program mounts that erofs image.
> Without copying the whole erofs image into memory. In other words, if
> some part of erofs image is not accessed, then not only it is not
> uncompressed, it even is not loaded from disk to memory at all. Is my
> understanding correct?
>
> --
> Askar Safin
>
>


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-16 Thread Lennart Poettering
On Do, 14.12.23 02:17, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> On Wed, Dec 13, 2023 at 10:03 AM Lennart Poettering
>  wrote:
> >
> > On Di, 12.12.23 23:01, Nils Kattenbeck (nilskem...@gmail.com) wrote:
> >
> > > > sysexts are erofs or squashfs file systems with verity backing. Only
> > > > the sectors you access are decompressed.
> > >
> > > Okay I forgot that they were erofs based and mentioned cpio archives
> > > so I assumed they would be one.
> > > Do they need to be fully read from disk to generate the cpio archive?
> >
> > erofs is a file system, cpio is a serialized archive. Two different
> > things. The discussion here is whether to pass the initrd to the
> > kernel as one or the other. But noone is suggesting to convert one to
> > the other at boot time.
>
> I was referring to the following line from sd-stub's man page: "The
> following resources are passed as initrd cpio archives to the booted
> kernel: [...] /.extra/sysext/*.raw [...]". I assume the initrd
> containing the sysexts has to be created at some point?

These cpios are created on-the-fly and placed into memory and passed
to the invoked kernel. And yes, for that the data they contian needs
to be read off disk first.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-13 Thread Nils Kattenbeck
On Wed, Dec 13, 2023 at 10:03 AM Lennart Poettering
 wrote:
>
> On Di, 12.12.23 23:01, Nils Kattenbeck (nilskem...@gmail.com) wrote:
>
> > > sysexts are erofs or squashfs file systems with verity backing. Only
> > > the sectors you access are decompressed.
> >
> > Okay I forgot that they were erofs based and mentioned cpio archives
> > so I assumed they would be one.
> > Do they need to be fully read from disk to generate the cpio archive?
>
> erofs is a file system, cpio is a serialized archive. Two different
> things. The discussion here is whether to pass the initrd to the
> kernel as one or the other. But noone is suggesting to convert one to
> the other at boot time.

I was referring to the following line from sd-stub's man page: "The
following resources are passed as initrd cpio archives to the booted
kernel: [...] /.extra/sysext/*.raw [...]". I assume the initrd
containing the sysexts has to be created at some point?


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-13 Thread Lennart Poettering
On Di, 12.12.23 23:01, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> > sysexts are erofs or squashfs file systems with verity backing. Only
> > the sectors you access are decompressed.
>
> Okay I forgot that they were erofs based and mentioned cpio archives
> so I assumed they would be one.
> Do they need to be fully read from disk to generate the cpio archive?

erofs is a file system, cpio is a serialized archive. Two different
things. The discussion here is whether to pass the initrd to the
kernel as one or the other. But noone is suggesting to convert one to
the other at boot time.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Nils Kattenbeck
On Tue, Dec 12, 2023 at 10:02 PM Lennart Poettering
 wrote:
>
> If you have 7 cpio initrds then the kernel will allocate a tmpfs and
> unpack them all into it, one after the other, on top of each other,
> and then jumps into the result.
>
> if you have an erofs and 7 cpio initds, what are you going to do? You
> cannot extract into an erofs, it's immutable. You'd need something
> like overlayfs, but that would require (at least for now) an
> additional step in userspace, which is something to avoid.
>
> Alternatively (and preferred by me) would support a mode where it
> would unpack any cpios it gets into a tmpfs, and then pass an fsopen()
> fd to that to the executable it then invokes from the erofs. the
> executable could then mount that somewhere if it wants. But this would
> require a kenrel patch.

Such a kernel patch would likely be the more advanced method.
I also saw that they now wrote to the LKML to potentially discuss
something like this.
The method with an overlaysfs would likely be easier for init systems
to use but also less customizable.

> > Even if everything is the same there are codes paths which might not
> > be taken during usual operation. An example would be services similar
> > to the new systemd-bsod which are only triggered in emergencies.
> > Having these in the cpio means that they will always be read and
> > decompressed.
>
> systemd-bsod is tiny though, less than 8K compressed here. Not sure it
> is a good example.

Yes that is right though it is the first and most universal thing
which came to mind.
A better example would be something like a fleet management SDK (in
Java or a similar language with a runtime) which phones to a
management server indicating a boot failure and publishing crash logs.

> > Using sysexts also has the drawback that each and every one of them
> > has to be decompressed. I might be mistaken but I expect that this
> > will be the case even if the extension-release in the sysext results
> > in it being discarded which is obviously another big drawback.
>
> sysexts are erofs or squashfs file systems with verity backing. Only
> the sectors you access are decompressed.

Okay I forgot that they were erofs based and mentioned cpio archives
so I assumed they would be one.
Do they need to be fully read from disk to generate the cpio archive?

> Lennart
>
> --
> Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Lennart Poettering
On Di, 12.12.23 21:34, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> Hi, while I have been following this thread passively for now I also
> wanted to chime in.
>
> > (The main reason why sd-stub doesn't actually support erofs-initrds,
> > is that sd-stub also generates initrd cpios on the fly, to pass
> > credentials and system extension images to the kernel, and you can't
> > really mix erofs and cpio initrds into one)
>
> What prevents one from mixing the two (especially given that the
> hypothetical erofs initrd support does not yet exist)?
> Or are you talking about mixing this with your memmap+root=/dev/pmem
> suggestion?

If you have 7 cpio initrds then the kernel will allocate a tmpfs and
unpack them all into it, one after the other, on top of each other,
and then jumps into the result.

if you have an erofs and 7 cpio initds, what are you going to do? You
cannot extract into an erofs, it's immutable. You'd need something
like overlayfs, but that would require (at least for now) an
additional step in userspace, which is something to avoid.

Alternatively (and preferred by me) would support a mode where it
would unpack any cpios it gets into a tmpfs, and then pass an fsopen()
fd to that to the executable it then invokes from the erofs. the
executable could then mount that somewhere if it wants. But this would
require a kenrel patch.

> Even if everything is the same there are codes paths which might not
> be taken during usual operation. An example would be services similar
> to the new systemd-bsod which are only triggered in emergencies.
> Having these in the cpio means that they will always be read and
> decompressed.

systemd-bsod is tiny though, less than 8K compressed here. Not sure it
is a good example.

> Using sysexts also has the drawback that each and every one of them
> has to be decompressed. I might be mistaken but I expect that this
> will be the case even if the extension-release in the sysext results
> in it being discarded which is obviously another big drawback.

sysexts are erofs or squashfs file systems with verity backing. Only
the sectors you access are decompressed.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Eric Curtin
On Tue, 12 Dec 2023 at 20:35, Nils Kattenbeck  wrote:
>
> Hi, while I have been following this thread passively for now I also
> wanted to chime in.
>
> > (The main reason why sd-stub doesn't actually support erofs-initrds,
> > is that sd-stub also generates initrd cpios on the fly, to pass
> > credentials and system extension images to the kernel, and you can't
> > really mix erofs and cpio initrds into one)
>
> What prevents one from mixing the two (especially given that the
> hypothetical erofs initrd support does not yet exist)?
> Or are you talking about mixing this with your memmap+root=/dev/pmem 
> suggestion?
>
> > The try to optimize the initrd a bit by making it an erofs/memmap
> > thing and so on. And make sure the initrd only contains stuff you
> > always need, so that reading it all into memory is necessary anyway,
> > and hence any approach that tries to run even the initrd off a disk
> > image won't be necessary becuase you need to read everything anyway.
>
> Having to ensure that the initrd is as small as possible is definitely
> no easy task.
> Furthermore unless one has total control over the devices, or even if
> there are only a few hardware revisions, parts of the initrd might not
> be used.
> Even if everything is the same there are codes paths which might not
> be taken during usual operation. An example would be services similar
> to the new systemd-bsod which are only triggered in emergencies.
> Having these in the cpio means that they will always be read and
> decompressed.
> Using sysexts also has the drawback that each and every one of them
> has to be decompressed. I might be mistaken but I expect that this
> will be the case even if the extension-release in the sysext results
> in it being discarded which is obviously another big drawback.
>
> Regardless, even if every single file within the cpio archive (and
> potential sysexts) is used, erofs still has a distinct advantage over
> cpio!
> With cpio everything has to be decompressed and read up front. With
> erofs this is not the case.
> Only the fs header has to be read at first as files are decompressed on 
> demand.
> This means that critical stuff can be started earlier as it does not
> have to wait for decompression of stuff only needed later on.
> For example an initrd-only (i.e. not pivolint root), graphical system
> could start all background services long before the UI starts and
> accesses large asset files.
>
> I agree that this splitting up into another micro-initrd just for some
> storage stuff etc (which I still have not groked completely) does not
> seem to offer any advantages to what we have today. *However*, I
> certainly think that standardizing and supporting some kind of erofs
> based initrd would gain some advantages.

Are we sure? A bunch of stuff in modern initrd's today have nothing to
do with mounting storage. I've proved there's benefit to that with the
data on the initoverlayfs page, you save ~300ms on systemd start time
on a Raspberry Pi 4 with an sd card, if you use an NVMe drive over USB
on a Raspberry Pi 4 it's even more... ~500ms. I wouldn't say that's
insignificant. You still get all the functionality of the fully
fledged initramfs when systemd starts but you save between 300ms and
500ms.

>
> On the other hand this feels like going back to an old ramdisk again.
> This goes beyond my knowledge but based on the kernel docs most
> drawbacks of ramdisks would not apply to an approach with erofs. Also
> maybe the more flexible loopback devices could be used(?) which might
> alleviate some problems.

For the record, this is what we are doing for initoverlayfs at the
moment, mounting "/boot" partition and then loopback. There are
significant advantages as there are few bytes read until you start
using initoverlayfs.

/boot/initramfs-6.5.12-200.fc38.x86_64.img
/boot/initoverlayfs-6.5.12-200.fc38.x86_64.img

>
> -- This block device was of fixed size, so the filesystem mounted on
> it was of fixed size.
>-> Should not be of concern as it is readonly anyhow.
> -- Using a ram disk also required unnecessarily copying memory from
> the fake block device into the page cache (and copying changes back
> out), as well as creating and destroying dentries.
>-> (?) This one I am actually not too sure about and supersedes my
> knowledge on tmpfs, vfs (and its cache layers), erofs caching, and
> loopback devices).
> -- Plus it needed a filesystem driver (such as ext2) to format and
> interpret this data.
>-> erofs is already included in most initrds (and is not too big if
> it is not)
>
> Regards, Nils
>



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Nils Kattenbeck
Hi, while I have been following this thread passively for now I also
wanted to chime in.

> (The main reason why sd-stub doesn't actually support erofs-initrds,
> is that sd-stub also generates initrd cpios on the fly, to pass
> credentials and system extension images to the kernel, and you can't
> really mix erofs and cpio initrds into one)

What prevents one from mixing the two (especially given that the
hypothetical erofs initrd support does not yet exist)?
Or are you talking about mixing this with your memmap+root=/dev/pmem suggestion?

> The try to optimize the initrd a bit by making it an erofs/memmap
> thing and so on. And make sure the initrd only contains stuff you
> always need, so that reading it all into memory is necessary anyway,
> and hence any approach that tries to run even the initrd off a disk
> image won't be necessary becuase you need to read everything anyway.

Having to ensure that the initrd is as small as possible is definitely
no easy task.
Furthermore unless one has total control over the devices, or even if
there are only a few hardware revisions, parts of the initrd might not
be used.
Even if everything is the same there are codes paths which might not
be taken during usual operation. An example would be services similar
to the new systemd-bsod which are only triggered in emergencies.
Having these in the cpio means that they will always be read and
decompressed.
Using sysexts also has the drawback that each and every one of them
has to be decompressed. I might be mistaken but I expect that this
will be the case even if the extension-release in the sysext results
in it being discarded which is obviously another big drawback.

Regardless, even if every single file within the cpio archive (and
potential sysexts) is used, erofs still has a distinct advantage over
cpio!
With cpio everything has to be decompressed and read up front. With
erofs this is not the case.
Only the fs header has to be read at first as files are decompressed on demand.
This means that critical stuff can be started earlier as it does not
have to wait for decompression of stuff only needed later on.
For example an initrd-only (i.e. not pivolint root), graphical system
could start all background services long before the UI starts and
accesses large asset files.

I agree that this splitting up into another micro-initrd just for some
storage stuff etc (which I still have not groked completely) does not
seem to offer any advantages to what we have today. *However*, I
certainly think that standardizing and supporting some kind of erofs
based initrd would gain some advantages.

On the other hand this feels like going back to an old ramdisk again.
This goes beyond my knowledge but based on the kernel docs most
drawbacks of ramdisks would not apply to an approach with erofs. Also
maybe the more flexible loopback devices could be used(?) which might
alleviate some problems.

-- This block device was of fixed size, so the filesystem mounted on
it was of fixed size.
   -> Should not be of concern as it is readonly anyhow.
-- Using a ram disk also required unnecessarily copying memory from
the fake block device into the page cache (and copying changes back
out), as well as creating and destroying dentries.
   -> (?) This one I am actually not too sure about and supersedes my
knowledge on tmpfs, vfs (and its cache layers), erofs caching, and
loopback devices).
-- Plus it needed a filesystem driver (such as ext2) to format and
interpret this data.
   -> erofs is already included in most initrds (and is not too big if
it is not)

Regards, Nils


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Demi Marie Obenour
On Tue, Dec 12, 2023 at 06:40:32PM +0100, Lennart Poettering wrote:
> On Mo, 11.12.23 12:48, Eric Curtin (ecur...@redhat.com) wrote:
> 
> > Although the nice thing about a storage-init like approach is there's
> > basically zero copies up front. What storage-init is trying to be, is
> > a tool to just call systemd storage things, without also inheriting
> > all the systemd stack.
> 
> Just to make this clear: using things like systemd-cryptsetup outside
> of the systemd stack is not going to work once you leave trivial
> setups. i.e. the TPM hookup involves multiple services these days, and
> it's not going to get any simpler. i.e. systemd-tpm2-setup,
> systemd-pcrextend, systemd-pcrlock and so on. I am sorry, but doing
> reasonable disk encryption with TPM involved means you either buy into
> the whole systemd offer (i.e. with the service manager) or you have to
> rewrite your own systemd.
> 
> But maybe I am misunderstanding what you are saying here.

I think a key factor here is that the initial suggestion was for
automotive use cases.  One can have a vastly simpler system if one is
willing to deliver hardware-specific images, rather than trying to have
a single image that supports many different hardware models.  Automotive
and other embedded systemd understandably do not want to pay for
complexity that they do not need, and which is present to support
features (such as supporting arbitrary hardware) they will never use.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Stephen Smoogen
On Tue, 12 Dec 2023 at 12:38, Lennart Poettering 
wrote:

> On Mo, 11.12.23 12:48, Eric Curtin (ecur...@redhat.com) wrote:
>
> > Sort of yes, but preferably using that __initramfs_start /
> > initrd_start buffer as is without copying any bytes anywhere else and
> > without teaching the bootloaders to do things.
> >
> > The "memmap=" approach you suggested sounds like what we are thinking,
> > but do you think we could do this without teaching bootloaders to do
> > new things?
>
> Well, in a standard UEFI world it would suffice to teach the memmap=
> logic to the stub that is glued in front of the kernel. For example,
> make sd-stub find the erofs initrd in the UKI, then trivially
> synthesize a memmap= switch and append it to the kernel command line.
>
> but of course, you don't believe in UEFI or good boot loaders, so you
> kinda dug your own grave here...
>
>
To clarify here.. it is not that we don't believe in UEFI or good boot
loaders, it is more that the various hardware being tasked to these
scenarios does not come with it. This is more of trying to make the best
with the ingredients we have, and realizing what we end up with will not be
as palatable as we wished. We all know that having UEFI or coreboot would
make this so much easier and better, but it would have taken the board
designers to have realized that nearly a decade ago since that is when
initial board designs seem to have been chosen. Even if they realized it at
this moment.. we would still be dealing with this for a while.

At this point it isn't that we are trying to dig this grave any deeper, but
are trying to come up with ways to dig out of it :). Some of the proposed
solutions may not do that, but it is what is being tried.



> (The main reason why sd-stub doesn't actually support erofs-initrds,
> is that sd-stub also generates initrd cpios on the fly, to pass
> credentials and system extension images to the kernel, and you can't
> really mix erofs and cpio initrds into one)
>
> Lennart
>
> --
> Lennart Poettering, Berlin
>
>

-- 
Stephen Smoogen, Red Hat Automotive
Let us be kind to one another, for most of us are fighting a hard battle.
-- Ian MacClaren


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Lennart Poettering
On Mo, 11.12.23 17:03, Eric Curtin (ecur...@redhat.com) wrote:

> A generic approach is hard, I think it's worth discussing which type of boots
> you should actually care about milliseconds of performance for. It would be 
> nice
> if we had an init system that contained the binary data to do the minimum for
> standard Fedora, Debian installs and everything else was an extension whether
> that's sysexts, dlopen, a new binary to execute etc.
>
> If the network is ingrained in your boot stack like this, I'm
> guessing you probably don't care about boot performance.

Uh, I am not sure that's really true. People boot up VMs on demand,
based on network traffic. They sure care about latency and boot
times. I mean people care about firecracker and these things precisely
because it brings the of off-to-IP to a minimum.

> Automotive has an expectation for really fast boots, like 2 seconds, in 
> standard
> desktops installs there's some expectation as you interface directly
> with a human,
> but for other installs how much expectation is there?

AFAIR in particular in cars there's quite som functionality you
probaly want to move very early in boot. Which yells to me that you
want a service manager super early. Which again suggests to me that
the first initrd that runs should probably already cover that.

If I were you I'd probably focus on a design like this: ship a basic
systemd in an initrd. Complete enough to find the harddisk, and to run
the other services that are absolutely necessary this early. Then,
once you found the disk, look for sysext images on it, and apply them
all on top of the initrd's root fs you are already running with. Never
transition anywhere else.

The try to optimize the initrd a bit by making it an erofs/memmap
thing and so on. And make sure the initrd only contains stuff you
always need, so that reading it all into memory is necessary anyway,
and hence any approach that tries to run even the initrd off a disk
image won't be necessary becuase you need to read everything anyway.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Lennart Poettering
On Mo, 11.12.23 11:28, Demi Marie Obenour (d...@invisiblethingslab.com) wrote:

> I don't think this is "a pretty specific solution to one set of devices"
> _at all_.  To the contrary, it is _exactly_ what I want to see desktop
> systems moving to in the future.
>
> It solves the problem of large firmware images.  It solves the problem
> of device-specific configuration, because one can use a file on the EFI
> system partition that is read by userspace and either treated as
> untrusted or TPM-signed.  It means that one have a complete set of
> recovery tools in the event of a problem, rather than being limited to
> whatever one can squeese into an initramfs.  One can even include a full
> GUI stack (with accessibility support!), rather than just Plymouth.  For
> Qubes OS, one can include enough of the Xen and Qubes toolstack to even
> launch virtual machines, allowing the use of USB devices and networking
> for recovery purposes.  It even means that one can use a FIDO2 token to
> unlock the hard drive without a USB stack on the host.  And because the
> initramfs _only_ needs to load the boot extension volume, it can be
> very, _very_ small, which works great with using Linux as a coreboot
> payload.

systemd's "system extension" concept ("sysexts") already allow you to
do all that. The stuff I was fantasizing about would only change one
thing: instead of sd-stub from uefi mode already putting the sysexts
you installed into memory for the initrd to consume, it would be some
proto-initrd that would do so. This does not really change what you
can do with this, but mostly is just an optimization, reducing iops
and memory use a bit, and thus boot time latency.

> The only problem I can see that this does not solve is network boot, but
> that is very much a niche use case when compared to the millions of
> Fedora or Debian desktop installs, or even the tens of thousands of
> Qubes OS installs.  Furthermore, I would _much_ rather network boot be
> handled by userspace and kexec, rather than the closed source UEFI network
> stack.

Well, somebody's niche is somebody else's common case. In VM/cloud/server
scenarios network booting is not that "niche" as it might be on the desktop.

> It does require some care when upgrading, as the dm-verity image and the
> UKI cannot both be updated atomically, but one can solve that by first
> writing the new dm-verity image to a separate location.  The UKI will
> try both both the old and new locations for the dm-verity image and
> rename the new image over the old one on success.  The wrong image will
> simply fail to mount as its root hash will be wrong.

systemd-sysext already covers this just fine: you can encode in their
"extension-release" file to which base images they match up, and
systemd-syext will then find the right one to apply, and ignore the
others. Thus just make sure you drop in the sysexts fist, and the UKI
last and things should be perfectly robust.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Lennart Poettering
On Mo, 11.12.23 12:48, Eric Curtin (ecur...@redhat.com) wrote:

> Although the nice thing about a storage-init like approach is there's
> basically zero copies up front. What storage-init is trying to be, is
> a tool to just call systemd storage things, without also inheriting
> all the systemd stack.

Just to make this clear: using things like systemd-cryptsetup outside
of the systemd stack is not going to work once you leave trivial
setups. i.e. the TPM hookup involves multiple services these days, and
it's not going to get any simpler. i.e. systemd-tpm2-setup,
systemd-pcrextend, systemd-pcrlock and so on. I am sorry, but doing
reasonable disk encryption with TPM involved means you either buy into
the whole systemd offer (i.e. with the service manager) or you have to
rewrite your own systemd.

But maybe I am misunderstanding what you are saying here.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Lennart Poettering
On Mo, 11.12.23 12:48, Eric Curtin (ecur...@redhat.com) wrote:

> Sort of yes, but preferably using that __initramfs_start /
> initrd_start buffer as is without copying any bytes anywhere else and
> without teaching the bootloaders to do things.
>
> The "memmap=" approach you suggested sounds like what we are thinking,
> but do you think we could do this without teaching bootloaders to do
> new things?

Well, in a standard UEFI world it would suffice to teach the memmap=
logic to the stub that is glued in front of the kernel. For example,
make sd-stub find the erofs initrd in the UKI, then trivially
synthesize a memmap= switch and append it to the kernel command line.

but of course, you don't believe in UEFI or good boot loaders, so you
kinda dug your own grave here...

(The main reason why sd-stub doesn't actually support erofs-initrds,
is that sd-stub also generates initrd cpios on the fly, to pass
credentials and system extension images to the kernel, and you can't
really mix erofs and cpio initrds into one)

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Paul Menzel
[Sorry for the spam to the people in Cc. Now the real address.]

Dear Luca,


Am 11.12.23 um 22:45 schrieb Luca Boccassi:
> On Mon, 11 Dec 2023 at 21:20, Demi Marie Obenour wrote:
>>
>> On Mon, Dec 11, 2023 at 08:58:58PM +, Luca Boccassi wrote:
>>> On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour wrote:

 On Mon, Dec 11, 2023 at 08:15:27PM +, Luca Boccassi wrote:
> On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour

>> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
>>> On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:

[…]

> And for ancient, legacy platforms that do not support modern APIs, the
> old ways will still be there, and can be used. Nobody is going to take
> away grub and dracut from the internet, if you got some special corner
> case where you want to use it it will still be there, but the fact
> that such corner cases exist cannot stop the rest of the ecosystem
> that is targeted to modern hardware from evolving into something
> better, more maintainable and more straightforward.

 The problem is not that UEFI is not usable in automotive systems.  The
 problem is that U-Boot (or any other UEFI implementation) is an extra
 stage in the boot process, slows things down, and has more attack
 surface.
>>>
>>> Whatever firmware you use will have an attack surface, the interface
>>> it provides - whether legacy bios or uefi-based - is irrelevant for
>>> that. Skipping or reimplementing all the verity, tpm, etc logic also
>>> increases the attack surface, as does adding initrd-only code that is
>>> never tested and exercised outside of that limited context. If you are
>>> running with legacy bios on ancient hardware you also will likely lack
>>> tpm, secure boot, and so on, so it's all moot, any security argument
>>> goes out of the window. If anybody cares about platform security, then
>>> a tpm-capable and secureboot-capable firmware with a modern, usable
>>> interface like uefi, running the same code in initrd and full system,
>>> using dm-verity everywhere, is pretty much the best one can do.
>>
>> Neither Chrome OS devices nor Macs with Apple silicon use UEFI, and both
>> have better platform security than any UEFI-based device on the market I
>> am aware of.
> 
> We are talking about Linux distributions here. If one wants to use
> proprietary systems, sure, there are better things out there, but
> that's off topic.

In what way is ChromeOS more proprietary than the other GNU/Linux 
distributions, that allow to install the Chrome browser?


Kind regards,

Paul


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Luca Boccassi
On Mon, 11 Dec 2023 at 21:20, Demi Marie Obenour
 wrote:
>
> On Mon, Dec 11, 2023 at 08:58:58PM +, Luca Boccassi wrote:
> > On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour
> >  wrote:
> > >
> > > -BEGIN PGP SIGNED MESSAGE-
> > > Hash: SHA512
> > >
> > > On Mon, Dec 11, 2023 at 08:15:27PM +, Luca Boccassi wrote:
> > > > On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
> > > >  wrote:
> > > > >
> > > > > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > > > > On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
> > > > > >
> > > > > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > > > > mini-initramfs contains just enough to get storage drivers loaded 
> > > > > > > and
> > > > > > > storage devices initialized. storage-init is a process that is not
> > > > > > > designed to replace init, it does just enough to initialize 
> > > > > > > storage
> > > > > > > (performs a targeted udev trigger on storage), switches to
> > > > > > > initoverlayfs as root and then executes init.
> > > > > > >
> > > > > > > ```
> > > > > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> 
> > > > > > > rootfs
> > > > > > >
> > > > > > > fw -> bootloader -> kernel -> storage-init   -> init 
> > > > > > > ->
> > > > > > > ```
> > > > > >
> > > > > > I am not sure I follow what these chains are supposed to mean? Why 
> > > > > > are
> > > > > > there two lines?
> > > > > >
> > > > > > So, I generally would agree that the current initrd scheme is not
> > > > > > ideal, and we have been discussing better approaches. But I am not
> > > > > > sure your approach really is useful on generic systems for two
> > > > > > reasons:
> > > > > >
> > > > > > 1. no security model? you need to authenticate your initrd in
> > > > > >2023. There's no execuse to not doing that anymore these days. 
> > > > > > Not
> > > > > >in automotive, and not anywhere else really.
> > > > > >
> > > > > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > > > > >unlock their root disks with TPM2 and similar things. People use
> > > > > >RAID, LVM, and all that mess.
> > > > > >
> > > > > > Actually the above are kinda the same problem in a way: you need
> > > > > > complex storage, but if you need that you kinda need udev, and
> > > > > > services, and then also systemd and all that other stuff, and that's
> > > > > > why the system works like the system works right now.
> > > > > >
> > > > > > Whenever you devise a system like yours by cutting corners, and
> > > > > > declaring that you don't want TPM, you don't want signed initrds, 
> > > > > > you
> > > > > > don't want to support weird storage, you just solve your problem in 
> > > > > > a
> > > > > > very specific way, ignoring the big picture. Which is OK, *if* you 
> > > > > > can
> > > > > > actually really work without all that and are willing to maintain 
> > > > > > the
> > > > > > solution for your specific problem only.
> > > > > >
> > > > > > As I understand you are trying to solve multiple problems at once
> > > > > > here, and I think one should start with figuring out clearly what
> > > > > > those are before trying to address them, maybe without compromising 
> > > > > > on
> > > > > > security. So my guess is you want to address the following:
> > > > > >
> > > > > > 1. You don't want the whole big initrd to be read off disk on every
> > > > > >boot, but only the parts of it that are actually needed.
> > > > > >
> > > > > > 2. You don't want the whole big initrd to be fully decompressed on 
> > > > > > every
> > > > > >boot, but only the parts of it that are actually needed.
> > > > > >
> > > > > > 3. You want to share data between root fs and initrd
> > > > > >
> > > > > > 4. You want to save some boot time by not bringing up an init system
> > > > > >in the initrd once, then tearing it down again, and starting it
> > > > > >again from the root fs.
> > > > > >
> > > > > > For the items listed above I think you can find different solutions
> > > > > > which do not necessarily compromise security as much.
> > > > > >
> > > > > > So, in the list above you could address the latter three like this:
> > > > > >
> > > > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > > > >loader load the erofs into contigous memory, then use memmap=X!Y 
> > > > > > on
> > > > > >the kernel cmdline to synthesize a block device from that, which
> > > > > >you then mount directly (without any initrd) via
> > > > > >root=/dev/pmem0. This means yout boot loader will still load the
> > > > > >whole image into memory, but only decompress the bits actually
> > > > > >neeed. (It also has some other nice benefits I like, such as an
> > > > > >immutable rootfs, which tmpfs-based initrds don't have.)
> > > > > >
> > > > > > 3. Simply never transition to the root fs, don't marke the initrds 
> > > > > > in
> > > > 

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Eric Curtin
On Mon, 11 Dec 2023 at 20:59, Luca Boccassi  wrote:
>
> On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour
>  wrote:
> >
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA512
> >
> > On Mon, Dec 11, 2023 at 08:15:27PM +, Luca Boccassi wrote:
> > > On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
> > >  wrote:
> > > >
> > > > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > > > On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
> > > > >
> > > > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > > > mini-initramfs contains just enough to get storage drivers loaded 
> > > > > > and
> > > > > > storage devices initialized. storage-init is a process that is not
> > > > > > designed to replace init, it does just enough to initialize storage
> > > > > > (performs a targeted udev trigger on storage), switches to
> > > > > > initoverlayfs as root and then executes init.
> > > > > >
> > > > > > ```
> > > > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> 
> > > > > > rootfs
> > > > > >
> > > > > > fw -> bootloader -> kernel -> storage-init   -> init 
> > > > > > ->
> > > > > > ```
> > > > >
> > > > > I am not sure I follow what these chains are supposed to mean? Why are
> > > > > there two lines?
> > > > >
> > > > > So, I generally would agree that the current initrd scheme is not
> > > > > ideal, and we have been discussing better approaches. But I am not
> > > > > sure your approach really is useful on generic systems for two
> > > > > reasons:
> > > > >
> > > > > 1. no security model? you need to authenticate your initrd in
> > > > >2023. There's no execuse to not doing that anymore these days. Not
> > > > >in automotive, and not anywhere else really.
> > > > >
> > > > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > > > >unlock their root disks with TPM2 and similar things. People use
> > > > >RAID, LVM, and all that mess.
> > > > >
> > > > > Actually the above are kinda the same problem in a way: you need
> > > > > complex storage, but if you need that you kinda need udev, and
> > > > > services, and then also systemd and all that other stuff, and that's
> > > > > why the system works like the system works right now.
> > > > >
> > > > > Whenever you devise a system like yours by cutting corners, and
> > > > > declaring that you don't want TPM, you don't want signed initrds, you
> > > > > don't want to support weird storage, you just solve your problem in a
> > > > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > > > actually really work without all that and are willing to maintain the
> > > > > solution for your specific problem only.
> > > > >
> > > > > As I understand you are trying to solve multiple problems at once
> > > > > here, and I think one should start with figuring out clearly what
> > > > > those are before trying to address them, maybe without compromising on
> > > > > security. So my guess is you want to address the following:
> > > > >
> > > > > 1. You don't want the whole big initrd to be read off disk on every
> > > > >boot, but only the parts of it that are actually needed.
> > > > >
> > > > > 2. You don't want the whole big initrd to be fully decompressed on 
> > > > > every
> > > > >boot, but only the parts of it that are actually needed.
> > > > >
> > > > > 3. You want to share data between root fs and initrd
> > > > >
> > > > > 4. You want to save some boot time by not bringing up an init system
> > > > >in the initrd once, then tearing it down again, and starting it
> > > > >again from the root fs.
> > > > >
> > > > > For the items listed above I think you can find different solutions
> > > > > which do not necessarily compromise security as much.
> > > > >
> > > > > So, in the list above you could address the latter three like this:
> > > > >
> > > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > > >loader load the erofs into contigous memory, then use memmap=X!Y on
> > > > >the kernel cmdline to synthesize a block device from that, which
> > > > >you then mount directly (without any initrd) via
> > > > >root=/dev/pmem0. This means yout boot loader will still load the
> > > > >whole image into memory, but only decompress the bits actually
> > > > >neeed. (It also has some other nice benefits I like, such as an
> > > > >immutable rootfs, which tmpfs-based initrds don't have.)
> > > > >
> > > > > 3. Simply never transition to the root fs, don't marke the initrds in
> > > > >systemd's eyes as an initrd (specifically: don't add an
> > > > >/etc/initrd-release file to it). Instead, just merge resources of
> > > > >the root fs into your initrd fs via overlayfs. systemd has
> > > > >infrastructure for this: "systemd-sysext". It takes immutable,
> > > > >authenticated erofs images (with verity, we call them "DDIs",
> > > > >i.e. 

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Demi Marie Obenour
On Mon, Dec 11, 2023 at 08:58:58PM +, Luca Boccassi wrote:
> On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour
>  wrote:
> >
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA512
> >
> > On Mon, Dec 11, 2023 at 08:15:27PM +, Luca Boccassi wrote:
> > > On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
> > >  wrote:
> > > >
> > > > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > > > On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
> > > > >
> > > > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > > > mini-initramfs contains just enough to get storage drivers loaded 
> > > > > > and
> > > > > > storage devices initialized. storage-init is a process that is not
> > > > > > designed to replace init, it does just enough to initialize storage
> > > > > > (performs a targeted udev trigger on storage), switches to
> > > > > > initoverlayfs as root and then executes init.
> > > > > >
> > > > > > ```
> > > > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> 
> > > > > > rootfs
> > > > > >
> > > > > > fw -> bootloader -> kernel -> storage-init   -> init 
> > > > > > ->
> > > > > > ```
> > > > >
> > > > > I am not sure I follow what these chains are supposed to mean? Why are
> > > > > there two lines?
> > > > >
> > > > > So, I generally would agree that the current initrd scheme is not
> > > > > ideal, and we have been discussing better approaches. But I am not
> > > > > sure your approach really is useful on generic systems for two
> > > > > reasons:
> > > > >
> > > > > 1. no security model? you need to authenticate your initrd in
> > > > >2023. There's no execuse to not doing that anymore these days. Not
> > > > >in automotive, and not anywhere else really.
> > > > >
> > > > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > > > >unlock their root disks with TPM2 and similar things. People use
> > > > >RAID, LVM, and all that mess.
> > > > >
> > > > > Actually the above are kinda the same problem in a way: you need
> > > > > complex storage, but if you need that you kinda need udev, and
> > > > > services, and then also systemd and all that other stuff, and that's
> > > > > why the system works like the system works right now.
> > > > >
> > > > > Whenever you devise a system like yours by cutting corners, and
> > > > > declaring that you don't want TPM, you don't want signed initrds, you
> > > > > don't want to support weird storage, you just solve your problem in a
> > > > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > > > actually really work without all that and are willing to maintain the
> > > > > solution for your specific problem only.
> > > > >
> > > > > As I understand you are trying to solve multiple problems at once
> > > > > here, and I think one should start with figuring out clearly what
> > > > > those are before trying to address them, maybe without compromising on
> > > > > security. So my guess is you want to address the following:
> > > > >
> > > > > 1. You don't want the whole big initrd to be read off disk on every
> > > > >boot, but only the parts of it that are actually needed.
> > > > >
> > > > > 2. You don't want the whole big initrd to be fully decompressed on 
> > > > > every
> > > > >boot, but only the parts of it that are actually needed.
> > > > >
> > > > > 3. You want to share data between root fs and initrd
> > > > >
> > > > > 4. You want to save some boot time by not bringing up an init system
> > > > >in the initrd once, then tearing it down again, and starting it
> > > > >again from the root fs.
> > > > >
> > > > > For the items listed above I think you can find different solutions
> > > > > which do not necessarily compromise security as much.
> > > > >
> > > > > So, in the list above you could address the latter three like this:
> > > > >
> > > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > > >loader load the erofs into contigous memory, then use memmap=X!Y on
> > > > >the kernel cmdline to synthesize a block device from that, which
> > > > >you then mount directly (without any initrd) via
> > > > >root=/dev/pmem0. This means yout boot loader will still load the
> > > > >whole image into memory, but only decompress the bits actually
> > > > >neeed. (It also has some other nice benefits I like, such as an
> > > > >immutable rootfs, which tmpfs-based initrds don't have.)
> > > > >
> > > > > 3. Simply never transition to the root fs, don't marke the initrds in
> > > > >systemd's eyes as an initrd (specifically: don't add an
> > > > >/etc/initrd-release file to it). Instead, just merge resources of
> > > > >the root fs into your initrd fs via overlayfs. systemd has
> > > > >infrastructure for this: "systemd-sysext". It takes immutable,
> > > > >authenticated erofs images (with verity, we call them "DDIs",
> > > > >  

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Luca Boccassi
On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour
 wrote:
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA512
>
> On Mon, Dec 11, 2023 at 08:15:27PM +, Luca Boccassi wrote:
> > On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
> >  wrote:
> > >
> > > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > > On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
> > > >
> > > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > > storage devices initialized. storage-init is a process that is not
> > > > > designed to replace init, it does just enough to initialize storage
> > > > > (performs a targeted udev trigger on storage), switches to
> > > > > initoverlayfs as root and then executes init.
> > > > >
> > > > > ```
> > > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> 
> > > > > rootfs
> > > > >
> > > > > fw -> bootloader -> kernel -> storage-init   -> init 
> > > > > ->
> > > > > ```
> > > >
> > > > I am not sure I follow what these chains are supposed to mean? Why are
> > > > there two lines?
> > > >
> > > > So, I generally would agree that the current initrd scheme is not
> > > > ideal, and we have been discussing better approaches. But I am not
> > > > sure your approach really is useful on generic systems for two
> > > > reasons:
> > > >
> > > > 1. no security model? you need to authenticate your initrd in
> > > >2023. There's no execuse to not doing that anymore these days. Not
> > > >in automotive, and not anywhere else really.
> > > >
> > > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > > >unlock their root disks with TPM2 and similar things. People use
> > > >RAID, LVM, and all that mess.
> > > >
> > > > Actually the above are kinda the same problem in a way: you need
> > > > complex storage, but if you need that you kinda need udev, and
> > > > services, and then also systemd and all that other stuff, and that's
> > > > why the system works like the system works right now.
> > > >
> > > > Whenever you devise a system like yours by cutting corners, and
> > > > declaring that you don't want TPM, you don't want signed initrds, you
> > > > don't want to support weird storage, you just solve your problem in a
> > > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > > actually really work without all that and are willing to maintain the
> > > > solution for your specific problem only.
> > > >
> > > > As I understand you are trying to solve multiple problems at once
> > > > here, and I think one should start with figuring out clearly what
> > > > those are before trying to address them, maybe without compromising on
> > > > security. So my guess is you want to address the following:
> > > >
> > > > 1. You don't want the whole big initrd to be read off disk on every
> > > >boot, but only the parts of it that are actually needed.
> > > >
> > > > 2. You don't want the whole big initrd to be fully decompressed on every
> > > >boot, but only the parts of it that are actually needed.
> > > >
> > > > 3. You want to share data between root fs and initrd
> > > >
> > > > 4. You want to save some boot time by not bringing up an init system
> > > >in the initrd once, then tearing it down again, and starting it
> > > >again from the root fs.
> > > >
> > > > For the items listed above I think you can find different solutions
> > > > which do not necessarily compromise security as much.
> > > >
> > > > So, in the list above you could address the latter three like this:
> > > >
> > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > >loader load the erofs into contigous memory, then use memmap=X!Y on
> > > >the kernel cmdline to synthesize a block device from that, which
> > > >you then mount directly (without any initrd) via
> > > >root=/dev/pmem0. This means yout boot loader will still load the
> > > >whole image into memory, but only decompress the bits actually
> > > >neeed. (It also has some other nice benefits I like, such as an
> > > >immutable rootfs, which tmpfs-based initrds don't have.)
> > > >
> > > > 3. Simply never transition to the root fs, don't marke the initrds in
> > > >systemd's eyes as an initrd (specifically: don't add an
> > > >/etc/initrd-release file to it). Instead, just merge resources of
> > > >the root fs into your initrd fs via overlayfs. systemd has
> > > >infrastructure for this: "systemd-sysext". It takes immutable,
> > > >authenticated erofs images (with verity, we call them "DDIs",
> > > >i.e. "discoverable disk images") that it overlays into /usr/. [You
> > > >could also very nicely combine this approach with systemd's
> > > >portable services, and npsawn containers, which operate on the same
> > > >authenticated images]. At MSFT we have a 

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Demi Marie Obenour
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On Mon, Dec 11, 2023 at 08:15:27PM +, Luca Boccassi wrote:
> On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
>  wrote:
> >
> > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
> > >
> > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > storage devices initialized. storage-init is a process that is not
> > > > designed to replace init, it does just enough to initialize storage
> > > > (performs a targeted udev trigger on storage), switches to
> > > > initoverlayfs as root and then executes init.
> > > >
> > > > ```
> > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > >
> > > > fw -> bootloader -> kernel -> storage-init   -> init ->
> > > > ```
> > >
> > > I am not sure I follow what these chains are supposed to mean? Why are
> > > there two lines?
> > >
> > > So, I generally would agree that the current initrd scheme is not
> > > ideal, and we have been discussing better approaches. But I am not
> > > sure your approach really is useful on generic systems for two
> > > reasons:
> > >
> > > 1. no security model? you need to authenticate your initrd in
> > >2023. There's no execuse to not doing that anymore these days. Not
> > >in automotive, and not anywhere else really.
> > >
> > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > >unlock their root disks with TPM2 and similar things. People use
> > >RAID, LVM, and all that mess.
> > >
> > > Actually the above are kinda the same problem in a way: you need
> > > complex storage, but if you need that you kinda need udev, and
> > > services, and then also systemd and all that other stuff, and that's
> > > why the system works like the system works right now.
> > >
> > > Whenever you devise a system like yours by cutting corners, and
> > > declaring that you don't want TPM, you don't want signed initrds, you
> > > don't want to support weird storage, you just solve your problem in a
> > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > actually really work without all that and are willing to maintain the
> > > solution for your specific problem only.
> > >
> > > As I understand you are trying to solve multiple problems at once
> > > here, and I think one should start with figuring out clearly what
> > > those are before trying to address them, maybe without compromising on
> > > security. So my guess is you want to address the following:
> > >
> > > 1. You don't want the whole big initrd to be read off disk on every
> > >boot, but only the parts of it that are actually needed.
> > >
> > > 2. You don't want the whole big initrd to be fully decompressed on every
> > >boot, but only the parts of it that are actually needed.
> > >
> > > 3. You want to share data between root fs and initrd
> > >
> > > 4. You want to save some boot time by not bringing up an init system
> > >in the initrd once, then tearing it down again, and starting it
> > >again from the root fs.
> > >
> > > For the items listed above I think you can find different solutions
> > > which do not necessarily compromise security as much.
> > >
> > > So, in the list above you could address the latter three like this:
> > >
> > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > >loader load the erofs into contigous memory, then use memmap=X!Y on
> > >the kernel cmdline to synthesize a block device from that, which
> > >you then mount directly (without any initrd) via
> > >root=/dev/pmem0. This means yout boot loader will still load the
> > >whole image into memory, but only decompress the bits actually
> > >neeed. (It also has some other nice benefits I like, such as an
> > >immutable rootfs, which tmpfs-based initrds don't have.)
> > >
> > > 3. Simply never transition to the root fs, don't marke the initrds in
> > >systemd's eyes as an initrd (specifically: don't add an
> > >/etc/initrd-release file to it). Instead, just merge resources of
> > >the root fs into your initrd fs via overlayfs. systemd has
> > >infrastructure for this: "systemd-sysext". It takes immutable,
> > >authenticated erofs images (with verity, we call them "DDIs",
> > >i.e. "discoverable disk images") that it overlays into /usr/. [You
> > >could also very nicely combine this approach with systemd's
> > >portable services, and npsawn containers, which operate on the same
> > >authenticated images]. At MSFT we have a major product that works
> > >exactly like this: the OS runs off a rootfs that is loaded as an
> > >initrd, and everything that runs on top of this are just these
> > >verity disk images, using overlayfs and portable services.
> > >
> > > 4. The proposal 

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Luca Boccassi
On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
 wrote:
>
> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
> >
> > > Here is the boot sequence with initoverlayfs integrated, the
> > > mini-initramfs contains just enough to get storage drivers loaded and
> > > storage devices initialized. storage-init is a process that is not
> > > designed to replace init, it does just enough to initialize storage
> > > (performs a targeted udev trigger on storage), switches to
> > > initoverlayfs as root and then executes init.
> > >
> > > ```
> > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > >
> > > fw -> bootloader -> kernel -> storage-init   -> init ->
> > > ```
> >
> > I am not sure I follow what these chains are supposed to mean? Why are
> > there two lines?
> >
> > So, I generally would agree that the current initrd scheme is not
> > ideal, and we have been discussing better approaches. But I am not
> > sure your approach really is useful on generic systems for two
> > reasons:
> >
> > 1. no security model? you need to authenticate your initrd in
> >2023. There's no execuse to not doing that anymore these days. Not
> >in automotive, and not anywhere else really.
> >
> > 2. no way to deal with complex storage? i.e. people use FDE, want to
> >unlock their root disks with TPM2 and similar things. People use
> >RAID, LVM, and all that mess.
> >
> > Actually the above are kinda the same problem in a way: you need
> > complex storage, but if you need that you kinda need udev, and
> > services, and then also systemd and all that other stuff, and that's
> > why the system works like the system works right now.
> >
> > Whenever you devise a system like yours by cutting corners, and
> > declaring that you don't want TPM, you don't want signed initrds, you
> > don't want to support weird storage, you just solve your problem in a
> > very specific way, ignoring the big picture. Which is OK, *if* you can
> > actually really work without all that and are willing to maintain the
> > solution for your specific problem only.
> >
> > As I understand you are trying to solve multiple problems at once
> > here, and I think one should start with figuring out clearly what
> > those are before trying to address them, maybe without compromising on
> > security. So my guess is you want to address the following:
> >
> > 1. You don't want the whole big initrd to be read off disk on every
> >boot, but only the parts of it that are actually needed.
> >
> > 2. You don't want the whole big initrd to be fully decompressed on every
> >boot, but only the parts of it that are actually needed.
> >
> > 3. You want to share data between root fs and initrd
> >
> > 4. You want to save some boot time by not bringing up an init system
> >in the initrd once, then tearing it down again, and starting it
> >again from the root fs.
> >
> > For the items listed above I think you can find different solutions
> > which do not necessarily compromise security as much.
> >
> > So, in the list above you could address the latter three like this:
> >
> > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> >loader load the erofs into contigous memory, then use memmap=X!Y on
> >the kernel cmdline to synthesize a block device from that, which
> >you then mount directly (without any initrd) via
> >root=/dev/pmem0. This means yout boot loader will still load the
> >whole image into memory, but only decompress the bits actually
> >neeed. (It also has some other nice benefits I like, such as an
> >immutable rootfs, which tmpfs-based initrds don't have.)
> >
> > 3. Simply never transition to the root fs, don't marke the initrds in
> >systemd's eyes as an initrd (specifically: don't add an
> >/etc/initrd-release file to it). Instead, just merge resources of
> >the root fs into your initrd fs via overlayfs. systemd has
> >infrastructure for this: "systemd-sysext". It takes immutable,
> >authenticated erofs images (with verity, we call them "DDIs",
> >i.e. "discoverable disk images") that it overlays into /usr/. [You
> >could also very nicely combine this approach with systemd's
> >portable services, and npsawn containers, which operate on the same
> >authenticated images]. At MSFT we have a major product that works
> >exactly like this: the OS runs off a rootfs that is loaded as an
> >initrd, and everything that runs on top of this are just these
> >verity disk images, using overlayfs and portable services.
> >
> > 4. The proposal in 3 also addresses goal 4.
> >
> > Which leaves item 1, which is a bit harder to address. We have been
> > discussing this off an on internally too. A generic solution to this
> > is hard. My current thinking for this could be something like this,
> > covering the UEFI world: support sticking a 

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Demi Marie Obenour
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On Mon, Dec 11, 2023 at 05:03:13PM +, Eric Curtin wrote:
> On Mon, 11 Dec 2023 at 16:36, Demi Marie Obenour
>  wrote:
> >
> > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
> > >
> > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > storage devices initialized. storage-init is a process that is not
> > > > designed to replace init, it does just enough to initialize storage
> > > > (performs a targeted udev trigger on storage), switches to
> > > > initoverlayfs as root and then executes init.
> > > >
> > > > ```
> > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > >
> > > > fw -> bootloader -> kernel -> storage-init   -> init ->
> > > > ```
> > >
> > > I am not sure I follow what these chains are supposed to mean? Why are
> > > there two lines?
> > >
> > > So, I generally would agree that the current initrd scheme is not
> > > ideal, and we have been discussing better approaches. But I am not
> > > sure your approach really is useful on generic systems for two
> > > reasons:
> > >
> > > 1. no security model? you need to authenticate your initrd in
> > >2023. There's no execuse to not doing that anymore these days. Not
> > >in automotive, and not anywhere else really.
> > >
> > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > >unlock their root disks with TPM2 and similar things. People use
> > >RAID, LVM, and all that mess.
> > >
> > > Actually the above are kinda the same problem in a way: you need
> > > complex storage, but if you need that you kinda need udev, and
> > > services, and then also systemd and all that other stuff, and that's
> > > why the system works like the system works right now.
> > >
> > > Whenever you devise a system like yours by cutting corners, and
> > > declaring that you don't want TPM, you don't want signed initrds, you
> > > don't want to support weird storage, you just solve your problem in a
> > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > actually really work without all that and are willing to maintain the
> > > solution for your specific problem only.
> > >
> > > As I understand you are trying to solve multiple problems at once
> > > here, and I think one should start with figuring out clearly what
> > > those are before trying to address them, maybe without compromising on
> > > security. So my guess is you want to address the following:
> > >
> > > 1. You don't want the whole big initrd to be read off disk on every
> > >boot, but only the parts of it that are actually needed.
> > >
> > > 2. You don't want the whole big initrd to be fully decompressed on every
> > >boot, but only the parts of it that are actually needed.
> > >
> > > 3. You want to share data between root fs and initrd
> > >
> > > 4. You want to save some boot time by not bringing up an init system
> > >in the initrd once, then tearing it down again, and starting it
> > >again from the root fs.
> > >
> > > For the items listed above I think you can find different solutions
> > > which do not necessarily compromise security as much.
> > >
> > > So, in the list above you could address the latter three like this:
> > >
> > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > >loader load the erofs into contigous memory, then use memmap=X!Y on
> > >the kernel cmdline to synthesize a block device from that, which
> > >you then mount directly (without any initrd) via
> > >root=/dev/pmem0. This means yout boot loader will still load the
> > >whole image into memory, but only decompress the bits actually
> > >neeed. (It also has some other nice benefits I like, such as an
> > >immutable rootfs, which tmpfs-based initrds don't have.)
> > >
> > > 3. Simply never transition to the root fs, don't marke the initrds in
> > >systemd's eyes as an initrd (specifically: don't add an
> > >/etc/initrd-release file to it). Instead, just merge resources of
> > >the root fs into your initrd fs via overlayfs. systemd has
> > >infrastructure for this: "systemd-sysext". It takes immutable,
> > >authenticated erofs images (with verity, we call them "DDIs",
> > >i.e. "discoverable disk images") that it overlays into /usr/. [You
> > >could also very nicely combine this approach with systemd's
> > >portable services, and npsawn containers, which operate on the same
> > >authenticated images]. At MSFT we have a major product that works
> > >exactly like this: the OS runs off a rootfs that is loaded as an
> > >initrd, and everything that runs on top of this are just these
> > >verity disk images, using overlayfs and portable services.
> > >
> > > 4. The proposal in 

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Neal Gompa
On Mon, Dec 11, 2023 at 12:30 PM Demi Marie Obenour
 wrote:
>
> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
> >
> > > Here is the boot sequence with initoverlayfs integrated, the
> > > mini-initramfs contains just enough to get storage drivers loaded and
> > > storage devices initialized. storage-init is a process that is not
> > > designed to replace init, it does just enough to initialize storage
> > > (performs a targeted udev trigger on storage), switches to
> > > initoverlayfs as root and then executes init.
> > >
> > > ```
> > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > >
> > > fw -> bootloader -> kernel -> storage-init   -> init ->
> > > ```
> >
> > I am not sure I follow what these chains are supposed to mean? Why are
> > there two lines?
> >
> > So, I generally would agree that the current initrd scheme is not
> > ideal, and we have been discussing better approaches. But I am not
> > sure your approach really is useful on generic systems for two
> > reasons:
> >
> > 1. no security model? you need to authenticate your initrd in
> >2023. There's no execuse to not doing that anymore these days. Not
> >in automotive, and not anywhere else really.
> >
> > 2. no way to deal with complex storage? i.e. people use FDE, want to
> >unlock their root disks with TPM2 and similar things. People use
> >RAID, LVM, and all that mess.
> >
> > Actually the above are kinda the same problem in a way: you need
> > complex storage, but if you need that you kinda need udev, and
> > services, and then also systemd and all that other stuff, and that's
> > why the system works like the system works right now.
> >
> > Whenever you devise a system like yours by cutting corners, and
> > declaring that you don't want TPM, you don't want signed initrds, you
> > don't want to support weird storage, you just solve your problem in a
> > very specific way, ignoring the big picture. Which is OK, *if* you can
> > actually really work without all that and are willing to maintain the
> > solution for your specific problem only.
> >
> > As I understand you are trying to solve multiple problems at once
> > here, and I think one should start with figuring out clearly what
> > those are before trying to address them, maybe without compromising on
> > security. So my guess is you want to address the following:
> >
> > 1. You don't want the whole big initrd to be read off disk on every
> >boot, but only the parts of it that are actually needed.
> >
> > 2. You don't want the whole big initrd to be fully decompressed on every
> >boot, but only the parts of it that are actually needed.
> >
> > 3. You want to share data between root fs and initrd
> >
> > 4. You want to save some boot time by not bringing up an init system
> >in the initrd once, then tearing it down again, and starting it
> >again from the root fs.
> >
> > For the items listed above I think you can find different solutions
> > which do not necessarily compromise security as much.
> >
> > So, in the list above you could address the latter three like this:
> >
> > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> >loader load the erofs into contigous memory, then use memmap=X!Y on
> >the kernel cmdline to synthesize a block device from that, which
> >you then mount directly (without any initrd) via
> >root=/dev/pmem0. This means yout boot loader will still load the
> >whole image into memory, but only decompress the bits actually
> >neeed. (It also has some other nice benefits I like, such as an
> >immutable rootfs, which tmpfs-based initrds don't have.)
> >
> > 3. Simply never transition to the root fs, don't marke the initrds in
> >systemd's eyes as an initrd (specifically: don't add an
> >/etc/initrd-release file to it). Instead, just merge resources of
> >the root fs into your initrd fs via overlayfs. systemd has
> >infrastructure for this: "systemd-sysext". It takes immutable,
> >authenticated erofs images (with verity, we call them "DDIs",
> >i.e. "discoverable disk images") that it overlays into /usr/. [You
> >could also very nicely combine this approach with systemd's
> >portable services, and npsawn containers, which operate on the same
> >authenticated images]. At MSFT we have a major product that works
> >exactly like this: the OS runs off a rootfs that is loaded as an
> >initrd, and everything that runs on top of this are just these
> >verity disk images, using overlayfs and portable services.
> >
> > 4. The proposal in 3 also addresses goal 4.
> >
> > Which leaves item 1, which is a bit harder to address. We have been
> > discussing this off an on internally too. A generic solution to this
> > is hard. My current thinking for this could be something like this,
> > covering the UEFI world: support sticking 

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Eric Curtin
On Mon, 11 Dec 2023 at 16:36, Demi Marie Obenour
 wrote:
>
> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
> >
> > > Here is the boot sequence with initoverlayfs integrated, the
> > > mini-initramfs contains just enough to get storage drivers loaded and
> > > storage devices initialized. storage-init is a process that is not
> > > designed to replace init, it does just enough to initialize storage
> > > (performs a targeted udev trigger on storage), switches to
> > > initoverlayfs as root and then executes init.
> > >
> > > ```
> > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > >
> > > fw -> bootloader -> kernel -> storage-init   -> init ->
> > > ```
> >
> > I am not sure I follow what these chains are supposed to mean? Why are
> > there two lines?
> >
> > So, I generally would agree that the current initrd scheme is not
> > ideal, and we have been discussing better approaches. But I am not
> > sure your approach really is useful on generic systems for two
> > reasons:
> >
> > 1. no security model? you need to authenticate your initrd in
> >2023. There's no execuse to not doing that anymore these days. Not
> >in automotive, and not anywhere else really.
> >
> > 2. no way to deal with complex storage? i.e. people use FDE, want to
> >unlock their root disks with TPM2 and similar things. People use
> >RAID, LVM, and all that mess.
> >
> > Actually the above are kinda the same problem in a way: you need
> > complex storage, but if you need that you kinda need udev, and
> > services, and then also systemd and all that other stuff, and that's
> > why the system works like the system works right now.
> >
> > Whenever you devise a system like yours by cutting corners, and
> > declaring that you don't want TPM, you don't want signed initrds, you
> > don't want to support weird storage, you just solve your problem in a
> > very specific way, ignoring the big picture. Which is OK, *if* you can
> > actually really work without all that and are willing to maintain the
> > solution for your specific problem only.
> >
> > As I understand you are trying to solve multiple problems at once
> > here, and I think one should start with figuring out clearly what
> > those are before trying to address them, maybe without compromising on
> > security. So my guess is you want to address the following:
> >
> > 1. You don't want the whole big initrd to be read off disk on every
> >boot, but only the parts of it that are actually needed.
> >
> > 2. You don't want the whole big initrd to be fully decompressed on every
> >boot, but only the parts of it that are actually needed.
> >
> > 3. You want to share data between root fs and initrd
> >
> > 4. You want to save some boot time by not bringing up an init system
> >in the initrd once, then tearing it down again, and starting it
> >again from the root fs.
> >
> > For the items listed above I think you can find different solutions
> > which do not necessarily compromise security as much.
> >
> > So, in the list above you could address the latter three like this:
> >
> > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> >loader load the erofs into contigous memory, then use memmap=X!Y on
> >the kernel cmdline to synthesize a block device from that, which
> >you then mount directly (without any initrd) via
> >root=/dev/pmem0. This means yout boot loader will still load the
> >whole image into memory, but only decompress the bits actually
> >neeed. (It also has some other nice benefits I like, such as an
> >immutable rootfs, which tmpfs-based initrds don't have.)
> >
> > 3. Simply never transition to the root fs, don't marke the initrds in
> >systemd's eyes as an initrd (specifically: don't add an
> >/etc/initrd-release file to it). Instead, just merge resources of
> >the root fs into your initrd fs via overlayfs. systemd has
> >infrastructure for this: "systemd-sysext". It takes immutable,
> >authenticated erofs images (with verity, we call them "DDIs",
> >i.e. "discoverable disk images") that it overlays into /usr/. [You
> >could also very nicely combine this approach with systemd's
> >portable services, and npsawn containers, which operate on the same
> >authenticated images]. At MSFT we have a major product that works
> >exactly like this: the OS runs off a rootfs that is loaded as an
> >initrd, and everything that runs on top of this are just these
> >verity disk images, using overlayfs and portable services.
> >
> > 4. The proposal in 3 also addresses goal 4.
> >
> > Which leaves item 1, which is a bit harder to address. We have been
> > discussing this off an on internally too. A generic solution to this
> > is hard. My current thinking for this could be something like this,
> > covering the UEFI world: support sticking a 

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Demi Marie Obenour
On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
> 
> > Here is the boot sequence with initoverlayfs integrated, the
> > mini-initramfs contains just enough to get storage drivers loaded and
> > storage devices initialized. storage-init is a process that is not
> > designed to replace init, it does just enough to initialize storage
> > (performs a targeted udev trigger on storage), switches to
> > initoverlayfs as root and then executes init.
> >
> > ```
> > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> >
> > fw -> bootloader -> kernel -> storage-init   -> init ->
> > ```
> 
> I am not sure I follow what these chains are supposed to mean? Why are
> there two lines?
> 
> So, I generally would agree that the current initrd scheme is not
> ideal, and we have been discussing better approaches. But I am not
> sure your approach really is useful on generic systems for two
> reasons:
> 
> 1. no security model? you need to authenticate your initrd in
>2023. There's no execuse to not doing that anymore these days. Not
>in automotive, and not anywhere else really.
> 
> 2. no way to deal with complex storage? i.e. people use FDE, want to
>unlock their root disks with TPM2 and similar things. People use
>RAID, LVM, and all that mess.
> 
> Actually the above are kinda the same problem in a way: you need
> complex storage, but if you need that you kinda need udev, and
> services, and then also systemd and all that other stuff, and that's
> why the system works like the system works right now.
> 
> Whenever you devise a system like yours by cutting corners, and
> declaring that you don't want TPM, you don't want signed initrds, you
> don't want to support weird storage, you just solve your problem in a
> very specific way, ignoring the big picture. Which is OK, *if* you can
> actually really work without all that and are willing to maintain the
> solution for your specific problem only.
> 
> As I understand you are trying to solve multiple problems at once
> here, and I think one should start with figuring out clearly what
> those are before trying to address them, maybe without compromising on
> security. So my guess is you want to address the following:
> 
> 1. You don't want the whole big initrd to be read off disk on every
>boot, but only the parts of it that are actually needed.
> 
> 2. You don't want the whole big initrd to be fully decompressed on every
>boot, but only the parts of it that are actually needed.
> 
> 3. You want to share data between root fs and initrd
> 
> 4. You want to save some boot time by not bringing up an init system
>in the initrd once, then tearing it down again, and starting it
>again from the root fs.
> 
> For the items listed above I think you can find different solutions
> which do not necessarily compromise security as much.
> 
> So, in the list above you could address the latter three like this:
> 
> 2. Use an erofs rather than a packed cpio as initrd. Make the boot
>loader load the erofs into contigous memory, then use memmap=X!Y on
>the kernel cmdline to synthesize a block device from that, which
>you then mount directly (without any initrd) via
>root=/dev/pmem0. This means yout boot loader will still load the
>whole image into memory, but only decompress the bits actually
>neeed. (It also has some other nice benefits I like, such as an
>immutable rootfs, which tmpfs-based initrds don't have.)
> 
> 3. Simply never transition to the root fs, don't marke the initrds in
>systemd's eyes as an initrd (specifically: don't add an
>/etc/initrd-release file to it). Instead, just merge resources of
>the root fs into your initrd fs via overlayfs. systemd has
>infrastructure for this: "systemd-sysext". It takes immutable,
>authenticated erofs images (with verity, we call them "DDIs",
>i.e. "discoverable disk images") that it overlays into /usr/. [You
>could also very nicely combine this approach with systemd's
>portable services, and npsawn containers, which operate on the same
>authenticated images]. At MSFT we have a major product that works
>exactly like this: the OS runs off a rootfs that is loaded as an
>initrd, and everything that runs on top of this are just these
>verity disk images, using overlayfs and portable services.
> 
> 4. The proposal in 3 also addresses goal 4.
> 
> Which leaves item 1, which is a bit harder to address. We have been
> discussing this off an on internally too. A generic solution to this
> is hard. My current thinking for this could be something like this,
> covering the UEFI world: support sticking a DDI for the main initrd in
> the ESP. The ESP is per definition unencrypted and unauthenticated,
> but otherwise relatively well defined, i.e. known to be vfat and
> discoverable via UUID on a GPT disk. So: build a minimal
> 

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Eric Curtin
On Mon, 11 Dec 2023 at 12:48, Eric Curtin  wrote:
>
> On Mon, 11 Dec 2023 at 11:51, Lennart Poettering  
> wrote:
> >
> > On Mo, 11.12.23 11:28, Eric Curtin (ecur...@redhat.com) wrote:
> >
> > > > > For the items listed above I think you can find different solutions
> > > > > which do not necessarily compromise security as much.
> > > > >
> > > > > So, in the list above you could address the latter three like this:
> > > > >
> > > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > > >loader load the erofs into contigous memory, then use memmap=X!Y on
> > > > >the kernel cmdline to synthesize a block device from that, which
> > > > >you then mount directly (without any initrd) via
> > > > >root=/dev/pmem0. This means yout boot loader will still load the
> > > > >whole image into memory, but only decompress the bits actually
> > > > >neeed. (It also has some other nice benefits I like, such as an
> > > > >immutable rootfs, which tmpfs-based initrds don't have.)
> > >
> > > What I am unsure about here, is the "make the bootloader load the
> > > erofs into contiguous memory" part. I wonder could we try and use the
> > > existing initramfs data as is.
> >
> > Today's initrds are packed cpio archives of an OS file system
> > hierarchy. What I proposed means you'd have to put the OS file system
> > hiearchy into an erofs image instead. Which is a trivial operation,
> > just unpack and repack.
> >
> > Note that there are two concepts of "initrd" out there.
> >
> > a) from the kernel perspective an initrd/initramfs (which both are
> >badly named, because its a tmpfs these days) is that packed cpio
> >archive that is unpacked into a tmpfs, and then jumped into.
> >
> > b) from systemd's perspective an initrd is an OS image that carries an
> >/etc/initrd-release file. If that file exists then systemd will not
> >boot up the system regularly, but instead just prepare everything
> >that it can transition into some other root fs.
> >
> > While most often in real life the initrds currently qualify under both
> > definitions. But there's no reason to always do this. You can also
> > have images the kernel would consider an initrd, but systemd does not,
> > which is something we use in the "USI" concept, i.e. "unified system
> > images", which are basically UKIs (large UKIs) with a complete rootfs
> > that is the main system of the OS. And you can also do it the other
> > way round, which is potentially what I am suggesting to you here: use
> > an erofs image that would not be considered an initrd by the kernel,
> > but that systemd would consider one, and transition out of.
> >
> > > I dunno if
> > > bootloaders make much assumptions about the format of that data, worst
> > > case scenario we could encapsulate erofs in the initramfs, cpio looking
> > > data.
> >
> > boot loaders generally don't bother with the cpio, it's just "data"
> > for them. Compression algorithms have changed in the past, and it only
> > mattered that the kernel could decompress it, the boot loader doesn't care.
> >
> > > Teach the kernel not to decompress and process the whole
> > > thing and mount it like an erofs alternatively. Does this sound crazy
> > > or reasonable?
> >
> > You are re-inventing the traditional "initrd" logic of the kernel
> > which was a ramdisk (i.e. a block device /dev/ram0), that was filled
> > with some fs of your choice loaded by the boot loader.
>
> Sort of yes, but preferably using that __initramfs_start /
> initrd_start buffer as is without copying any bytes anywhere else and
> without teaching the bootloaders to do things.
>
> The "memmap=" approach you suggested sounds like what we are thinking,
> but do you think we could do this without teaching bootloaders to do
> new things?

Like could we do that with a "initrd3.0=on" karg and it just uses the
__initramfs_start and __initramfs_size to memmap? (that probably
wouldn't be the arg name, it's just for description purposes here,
maybe it's even a build time flag, etc.)

>
> Although the nice thing about a storage-init like approach is there's
> basically zero copies up front. What storage-init is trying to be, is
> a tool to just call systemd storage things, without also inheriting
> all the systemd stack.
>
> >
> > Lennart
> >
> > --
> > Lennart Poettering, Berlin
> >



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Eric Curtin
On Mon, 11 Dec 2023 at 11:51, Lennart Poettering  wrote:
>
> On Mo, 11.12.23 11:28, Eric Curtin (ecur...@redhat.com) wrote:
>
> > > > For the items listed above I think you can find different solutions
> > > > which do not necessarily compromise security as much.
> > > >
> > > > So, in the list above you could address the latter three like this:
> > > >
> > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > >loader load the erofs into contigous memory, then use memmap=X!Y on
> > > >the kernel cmdline to synthesize a block device from that, which
> > > >you then mount directly (without any initrd) via
> > > >root=/dev/pmem0. This means yout boot loader will still load the
> > > >whole image into memory, but only decompress the bits actually
> > > >neeed. (It also has some other nice benefits I like, such as an
> > > >immutable rootfs, which tmpfs-based initrds don't have.)
> >
> > What I am unsure about here, is the "make the bootloader load the
> > erofs into contiguous memory" part. I wonder could we try and use the
> > existing initramfs data as is.
>
> Today's initrds are packed cpio archives of an OS file system
> hierarchy. What I proposed means you'd have to put the OS file system
> hiearchy into an erofs image instead. Which is a trivial operation,
> just unpack and repack.
>
> Note that there are two concepts of "initrd" out there.
>
> a) from the kernel perspective an initrd/initramfs (which both are
>badly named, because its a tmpfs these days) is that packed cpio
>archive that is unpacked into a tmpfs, and then jumped into.
>
> b) from systemd's perspective an initrd is an OS image that carries an
>/etc/initrd-release file. If that file exists then systemd will not
>boot up the system regularly, but instead just prepare everything
>that it can transition into some other root fs.
>
> While most often in real life the initrds currently qualify under both
> definitions. But there's no reason to always do this. You can also
> have images the kernel would consider an initrd, but systemd does not,
> which is something we use in the "USI" concept, i.e. "unified system
> images", which are basically UKIs (large UKIs) with a complete rootfs
> that is the main system of the OS. And you can also do it the other
> way round, which is potentially what I am suggesting to you here: use
> an erofs image that would not be considered an initrd by the kernel,
> but that systemd would consider one, and transition out of.
>
> > I dunno if
> > bootloaders make much assumptions about the format of that data, worst
> > case scenario we could encapsulate erofs in the initramfs, cpio looking
> > data.
>
> boot loaders generally don't bother with the cpio, it's just "data"
> for them. Compression algorithms have changed in the past, and it only
> mattered that the kernel could decompress it, the boot loader doesn't care.
>
> > Teach the kernel not to decompress and process the whole
> > thing and mount it like an erofs alternatively. Does this sound crazy
> > or reasonable?
>
> You are re-inventing the traditional "initrd" logic of the kernel
> which was a ramdisk (i.e. a block device /dev/ram0), that was filled
> with some fs of your choice loaded by the boot loader.

Sort of yes, but preferably using that __initramfs_start /
initrd_start buffer as is without copying any bytes anywhere else and
without teaching the bootloaders to do things.

The "memmap=" approach you suggested sounds like what we are thinking,
but do you think we could do this without teaching bootloaders to do
new things?

Although the nice thing about a storage-init like approach is there's
basically zero copies up front. What storage-init is trying to be, is
a tool to just call systemd storage things, without also inheriting
all the systemd stack.

>
> Lennart
>
> --
> Lennart Poettering, Berlin
>



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Lennart Poettering
On Mo, 11.12.23 11:42, Eric Curtin (ecur...@redhat.com) wrote:

> I am also thinking, what is the difference between "make the
> bootloader load the erofs into contiguous memory" part and doing
> something like storage-init.

Well, from my PoV there's value in reducing the stages of the boot
process, and reducing the amount of storage stacks you need in the
mix. Hence, the boot loader can load stuff from disk into memory
anyway, it always has done that, typically the kernel and the
initrd. just swapping out the format of the initrd to get better
behaviour is relatively cheap there, means no additional storage
logic, no additional stage of the boot. You basically only have "boot
loader" (which loads kernel and initrd), and the "host os" (which runs
of the final rootfs).

Otoh if you let your storage-init load the initrd, then you basically
have a third step in the middle, which shares a lot of props with the
last step, but also is distinct. I mean, you probably would reinvent
your own udev and DM stack for that, to get verity in the mix (because
that depends on DM, and udev, to some degree)

In my ideal model, initrds are just part of the UKI btw, so they end
up being loaded together with the rest of the kernel, and need no
verity becaused signed along with the UKI itself.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Lennart Poettering
On Mo, 11.12.23 11:28, Eric Curtin (ecur...@redhat.com) wrote:

> > > For the items listed above I think you can find different solutions
> > > which do not necessarily compromise security as much.
> > >
> > > So, in the list above you could address the latter three like this:
> > >
> > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > >loader load the erofs into contigous memory, then use memmap=X!Y on
> > >the kernel cmdline to synthesize a block device from that, which
> > >you then mount directly (without any initrd) via
> > >root=/dev/pmem0. This means yout boot loader will still load the
> > >whole image into memory, but only decompress the bits actually
> > >neeed. (It also has some other nice benefits I like, such as an
> > >immutable rootfs, which tmpfs-based initrds don't have.)
>
> What I am unsure about here, is the "make the bootloader load the
> erofs into contiguous memory" part. I wonder could we try and use the
> existing initramfs data as is.

Today's initrds are packed cpio archives of an OS file system
hierarchy. What I proposed means you'd have to put the OS file system
hiearchy into an erofs image instead. Which is a trivial operation,
just unpack and repack.

Note that there are two concepts of "initrd" out there.

a) from the kernel perspective an initrd/initramfs (which both are
   badly named, because its a tmpfs these days) is that packed cpio
   archive that is unpacked into a tmpfs, and then jumped into.

b) from systemd's perspective an initrd is an OS image that carries an
   /etc/initrd-release file. If that file exists then systemd will not
   boot up the system regularly, but instead just prepare everything
   that it can transition into some other root fs.

While most often in real life the initrds currently qualify under both
definitions. But there's no reason to always do this. You can also
have images the kernel would consider an initrd, but systemd does not,
which is something we use in the "USI" concept, i.e. "unified system
images", which are basically UKIs (large UKIs) with a complete rootfs
that is the main system of the OS. And you can also do it the other
way round, which is potentially what I am suggesting to you here: use
an erofs image that would not be considered an initrd by the kernel,
but that systemd would consider one, and transition out of.

> I dunno if
> bootloaders make much assumptions about the format of that data, worst
> case scenario we could encapsulate erofs in the initramfs, cpio looking
> data.

boot loaders generally don't bother with the cpio, it's just "data"
for them. Compression algorithms have changed in the past, and it only
mattered that the kernel could decompress it, the boot loader doesn't care.

> Teach the kernel not to decompress and process the whole
> thing and mount it like an erofs alternatively. Does this sound crazy
> or reasonable?

You are re-inventing the traditional "initrd" logic of the kernel
which was a ramdisk (i.e. a block device /dev/ram0), that was filled
with some fs of your choice loaded by the boot loader.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Eric Curtin
I am also thinking, what is the difference between "make the
bootloader load the erofs into contiguous memory" part and doing
something like storage-init.

They are similar approaches, introduce something in the middle to
handle the erofs.

Is mise le meas/Regards,

Eric Curtin

On Mon, 11 Dec 2023 at 11:28, Eric Curtin  wrote:
>
> On Mon, 11 Dec 2023 at 11:20, Eric Curtin  wrote:
> >
> > On Mon, 11 Dec 2023 at 10:06, Lennart Poettering  
> > wrote:
> > >
> > > On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
> > >
> > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > storage devices initialized. storage-init is a process that is not
> > > > designed to replace init, it does just enough to initialize storage
> > > > (performs a targeted udev trigger on storage), switches to
> > > > initoverlayfs as root and then executes init.
> > > >
> > > > ```
> > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > >
> > > > fw -> bootloader -> kernel -> storage-init   -> init ->
> > > > ```
> > >
> > > I am not sure I follow what these chains are supposed to mean? Why are
> > > there two lines?
> >
> > The top line is the filesystem transition, the bottom is more like a
> > process perspective. Will make this clearer in future.
> >
> > >
> > > So, I generally would agree that the current initrd scheme is not
> > > ideal, and we have been discussing better approaches. But I am not
> > > sure your approach really is useful on generic systems for two
> > > reasons:
> > >
> > > 1. no security model? you need to authenticate your initrd in
> > >2023. There's no execuse to not doing that anymore these days. Not
> > >in automotive, and not anywhere else really.
> >
> > Yes you are right, there is no excuse, the plan was to mount using
> > dm-verity most likely with the details from the initramfs, but
> > admittedly we had not looked into that into great detail.
> >
> > >
> > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > >unlock their root disks with TPM2 and similar things. People use
> > >RAID, LVM, and all that mess.
> >
> > We had 3 thoughts on this:
> >
> > 1. Just worry about the common use-cases and leave everyone else
> > fallback to the approaches we use today.
> > 2. Try and split up systemd to make it even smaller. We do use
> > systemd-udev in the small initramfs storage-init process so far.
> > 3. Reimplement some things? But as little as possible, on a case by
> > case basis, we certainly don't want to fall into the trap of rewriting
> > systemd that's for sure, systemd does these things very well.
> >
> > Tbh, if we try and implement this in kernelspace a lot of these
> > questions go away. You just teach the kernel to deal with the
> > filesystem image early (say erofs or whatever other filesystem) and
> > have that data where initramfs data currently is. You still pay for
> > the initial read, but you still save a bunch of kernel time.
> >
> > >
> > > Actually the above are kinda the same problem in a way: you need
> > > complex storage, but if you need that you kinda need udev, and
> > > services, and then also systemd and all that other stuff, and that's
> > > why the system works like the system works right now.
> >
> > True, but there is also a bunch of stuff in current initrd's today
> > that aren't required to mount basic storage, but are designed around
> > the whole idea of having an early throwaway filesystem.
> >
> > >
> > > Whenever you devise a system like yours by cutting corners, and
> > > declaring that you don't want TPM, you don't want signed initrds, you
> > > don't want to support weird storage, you just solve your problem in a
> > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > actually really work without all that and are willing to maintain the
> > > solution for your specific problem only.
> > >
> > > As I understand you are trying to solve multiple problems at once
> > > here, and I think one should start with figuring out clearly what
> > > those are before trying to address them, maybe without compromising on
> > > security. So my guess is you want to address the following:
> > >
> > > 1. You don't want the whole big initrd to be read off disk on every
> > >boot, but only the parts of it that are actually needed.
> > >
> > > 2. You don't want the whole big initrd to be fully decompressed on every
> > >boot, but only the parts of it that are actually needed.
> > >
> > > 3. You want to share data between root fs and initrd
> > >
> > > 4. You want to save some boot time by not bringing up an init system
> > >in the initrd once, then tearing it down again, and starting it
> > >again from the root fs.
> >
> > It's mainly the top 3 that were the goals. And that people have the
> > freedom to consider using heavier weight generic libraries, 

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Eric Curtin
On Mon, 11 Dec 2023 at 11:20, Eric Curtin  wrote:
>
> On Mon, 11 Dec 2023 at 10:06, Lennart Poettering  wrote:
> >
> > On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
> >
> > > Here is the boot sequence with initoverlayfs integrated, the
> > > mini-initramfs contains just enough to get storage drivers loaded and
> > > storage devices initialized. storage-init is a process that is not
> > > designed to replace init, it does just enough to initialize storage
> > > (performs a targeted udev trigger on storage), switches to
> > > initoverlayfs as root and then executes init.
> > >
> > > ```
> > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > >
> > > fw -> bootloader -> kernel -> storage-init   -> init ->
> > > ```
> >
> > I am not sure I follow what these chains are supposed to mean? Why are
> > there two lines?
>
> The top line is the filesystem transition, the bottom is more like a
> process perspective. Will make this clearer in future.
>
> >
> > So, I generally would agree that the current initrd scheme is not
> > ideal, and we have been discussing better approaches. But I am not
> > sure your approach really is useful on generic systems for two
> > reasons:
> >
> > 1. no security model? you need to authenticate your initrd in
> >2023. There's no execuse to not doing that anymore these days. Not
> >in automotive, and not anywhere else really.
>
> Yes you are right, there is no excuse, the plan was to mount using
> dm-verity most likely with the details from the initramfs, but
> admittedly we had not looked into that into great detail.
>
> >
> > 2. no way to deal with complex storage? i.e. people use FDE, want to
> >unlock their root disks with TPM2 and similar things. People use
> >RAID, LVM, and all that mess.
>
> We had 3 thoughts on this:
>
> 1. Just worry about the common use-cases and leave everyone else
> fallback to the approaches we use today.
> 2. Try and split up systemd to make it even smaller. We do use
> systemd-udev in the small initramfs storage-init process so far.
> 3. Reimplement some things? But as little as possible, on a case by
> case basis, we certainly don't want to fall into the trap of rewriting
> systemd that's for sure, systemd does these things very well.
>
> Tbh, if we try and implement this in kernelspace a lot of these
> questions go away. You just teach the kernel to deal with the
> filesystem image early (say erofs or whatever other filesystem) and
> have that data where initramfs data currently is. You still pay for
> the initial read, but you still save a bunch of kernel time.
>
> >
> > Actually the above are kinda the same problem in a way: you need
> > complex storage, but if you need that you kinda need udev, and
> > services, and then also systemd and all that other stuff, and that's
> > why the system works like the system works right now.
>
> True, but there is also a bunch of stuff in current initrd's today
> that aren't required to mount basic storage, but are designed around
> the whole idea of having an early throwaway filesystem.
>
> >
> > Whenever you devise a system like yours by cutting corners, and
> > declaring that you don't want TPM, you don't want signed initrds, you
> > don't want to support weird storage, you just solve your problem in a
> > very specific way, ignoring the big picture. Which is OK, *if* you can
> > actually really work without all that and are willing to maintain the
> > solution for your specific problem only.
> >
> > As I understand you are trying to solve multiple problems at once
> > here, and I think one should start with figuring out clearly what
> > those are before trying to address them, maybe without compromising on
> > security. So my guess is you want to address the following:
> >
> > 1. You don't want the whole big initrd to be read off disk on every
> >boot, but only the parts of it that are actually needed.
> >
> > 2. You don't want the whole big initrd to be fully decompressed on every
> >boot, but only the parts of it that are actually needed.
> >
> > 3. You want to share data between root fs and initrd
> >
> > 4. You want to save some boot time by not bringing up an init system
> >in the initrd once, then tearing it down again, and starting it
> >again from the root fs.
>
> It's mainly the top 3 that were the goals. And that people have the
> freedom to consider using heavier weight generic libraries, tools,
> etc. if they want. You want to use Rust (or languages X, Y, Z) to
> write something early boot, go ahead! You'll only pay the cost for the
> larger binary if you actually use it. The week I started tinkering at
> this, there was a mini-debate on whether we should include glib or not
> in the initrd. And we are regularly under pressure to reduce boot time
> at the moment.
>
> Number 4 was a convenient way to do an early version of this, stick a
> process in between systemd and the kernel. But it turns out, it works
> 

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Eric Curtin
On Mon, 11 Dec 2023 at 10:06, Lennart Poettering  wrote:
>
> On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:
>
> > Here is the boot sequence with initoverlayfs integrated, the
> > mini-initramfs contains just enough to get storage drivers loaded and
> > storage devices initialized. storage-init is a process that is not
> > designed to replace init, it does just enough to initialize storage
> > (performs a targeted udev trigger on storage), switches to
> > initoverlayfs as root and then executes init.
> >
> > ```
> > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> >
> > fw -> bootloader -> kernel -> storage-init   -> init ->
> > ```
>
> I am not sure I follow what these chains are supposed to mean? Why are
> there two lines?

The top line is the filesystem transition, the bottom is more like a
process perspective. Will make this clearer in future.

>
> So, I generally would agree that the current initrd scheme is not
> ideal, and we have been discussing better approaches. But I am not
> sure your approach really is useful on generic systems for two
> reasons:
>
> 1. no security model? you need to authenticate your initrd in
>2023. There's no execuse to not doing that anymore these days. Not
>in automotive, and not anywhere else really.

Yes you are right, there is no excuse, the plan was to mount using
dm-verity most likely with the details from the initramfs, but
admittedly we had not looked into that into great detail.

>
> 2. no way to deal with complex storage? i.e. people use FDE, want to
>unlock their root disks with TPM2 and similar things. People use
>RAID, LVM, and all that mess.

We had 3 thoughts on this:

1. Just worry about the common use-cases and leave everyone else
fallback to the approaches we use today.
2. Try and split up systemd to make it even smaller. We do use
systemd-udev in the small initramfs storage-init process so far.
3. Reimplement some things? But as little as possible, on a case by
case basis, we certainly don't want to fall into the trap of rewriting
systemd that's for sure, systemd does these things very well.

Tbh, if we try and implement this in kernelspace a lot of these
questions go away. You just teach the kernel to deal with the
filesystem image early (say erofs or whatever other filesystem) and
have that data where initramfs data currently is. You still pay for
the initial read, but you still save a bunch of kernel time.

>
> Actually the above are kinda the same problem in a way: you need
> complex storage, but if you need that you kinda need udev, and
> services, and then also systemd and all that other stuff, and that's
> why the system works like the system works right now.

True, but there is also a bunch of stuff in current initrd's today
that aren't required to mount basic storage, but are designed around
the whole idea of having an early throwaway filesystem.

>
> Whenever you devise a system like yours by cutting corners, and
> declaring that you don't want TPM, you don't want signed initrds, you
> don't want to support weird storage, you just solve your problem in a
> very specific way, ignoring the big picture. Which is OK, *if* you can
> actually really work without all that and are willing to maintain the
> solution for your specific problem only.
>
> As I understand you are trying to solve multiple problems at once
> here, and I think one should start with figuring out clearly what
> those are before trying to address them, maybe without compromising on
> security. So my guess is you want to address the following:
>
> 1. You don't want the whole big initrd to be read off disk on every
>boot, but only the parts of it that are actually needed.
>
> 2. You don't want the whole big initrd to be fully decompressed on every
>boot, but only the parts of it that are actually needed.
>
> 3. You want to share data between root fs and initrd
>
> 4. You want to save some boot time by not bringing up an init system
>in the initrd once, then tearing it down again, and starting it
>again from the root fs.

It's mainly the top 3 that were the goals. And that people have the
freedom to consider using heavier weight generic libraries, tools,
etc. if they want. You want to use Rust (or languages X, Y, Z) to
write something early boot, go ahead! You'll only pay the cost for the
larger binary if you actually use it. The week I started tinkering at
this, there was a mini-debate on whether we should include glib or not
in the initrd. And we are regularly under pressure to reduce boot time
at the moment.

Number 4 was a convenient way to do an early version of this, stick a
process in between systemd and the kernel. But it turns out, it works
very well, the only problem is the reimplementation problem really.

Theoretically this could be systemd-storage-init -> systemd also. Or
systemd and dlopen more libraries as they become available later down
the line.

>
> For the items listed above I 

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Lennart Poettering
On Mo, 11.12.23 10:57, Lennart Poettering (mzerq...@0pointer.de) wrote:

> Which leaves item 1, which is a bit harder to address. We have been
> discussing this off an on internally too. A generic solution to this
> is hard. My current thinking for this could be something like this,
> covering the UEFI world: support sticking a DDI for the main initrd in
> the ESP. The ESP is per definition unencrypted and unauthenticated,
> but otherwise relatively well defined, i.e. known to be vfat and
> discoverable via UUID on a GPT disk. So: build a minimal
> single-process initrd into the kernel (i.e. UKI) that has exactly the
> storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> jump into the rootfs stored in the ESP. That latter then has proper
> file system drivers, storage drivers, crypto stack, and can unlock the
> real root. This would still be a pretty specific solution to one set
> of devices though, as it could not cover network boots (i.e. where
> there is just no ESP to boot from), but I think this could be kept
> relatively close, as the logic in that case could just fall back into
> loading the DDI that normally would still in the ESP fully into
> memory.

BTW, one thing I would like to emphasize though. i think this item is
really the last thing you should focus on. If your OS never
transitions out of the initrd, and gets its payload merged in via
DDIs, then the root fs should be reasonably small enough and "fully
used at boot" (i.e. every sector read anyway) that doing this extra
work of finding a split-out DDI on the ESP is entirely unnecessary and
just a waste of time (both of developer time and boot time).

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Lennart Poettering
On Fr, 08.12.23 17:59, Eric Curtin (ecur...@redhat.com) wrote:

> Here is the boot sequence with initoverlayfs integrated, the
> mini-initramfs contains just enough to get storage drivers loaded and
> storage devices initialized. storage-init is a process that is not
> designed to replace init, it does just enough to initialize storage
> (performs a targeted udev trigger on storage), switches to
> initoverlayfs as root and then executes init.
>
> ```
> fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
>
> fw -> bootloader -> kernel -> storage-init   -> init ->
> ```

I am not sure I follow what these chains are supposed to mean? Why are
there two lines?

So, I generally would agree that the current initrd scheme is not
ideal, and we have been discussing better approaches. But I am not
sure your approach really is useful on generic systems for two
reasons:

1. no security model? you need to authenticate your initrd in
   2023. There's no execuse to not doing that anymore these days. Not
   in automotive, and not anywhere else really.

2. no way to deal with complex storage? i.e. people use FDE, want to
   unlock their root disks with TPM2 and similar things. People use
   RAID, LVM, and all that mess.

Actually the above are kinda the same problem in a way: you need
complex storage, but if you need that you kinda need udev, and
services, and then also systemd and all that other stuff, and that's
why the system works like the system works right now.

Whenever you devise a system like yours by cutting corners, and
declaring that you don't want TPM, you don't want signed initrds, you
don't want to support weird storage, you just solve your problem in a
very specific way, ignoring the big picture. Which is OK, *if* you can
actually really work without all that and are willing to maintain the
solution for your specific problem only.

As I understand you are trying to solve multiple problems at once
here, and I think one should start with figuring out clearly what
those are before trying to address them, maybe without compromising on
security. So my guess is you want to address the following:

1. You don't want the whole big initrd to be read off disk on every
   boot, but only the parts of it that are actually needed.

2. You don't want the whole big initrd to be fully decompressed on every
   boot, but only the parts of it that are actually needed.

3. You want to share data between root fs and initrd

4. You want to save some boot time by not bringing up an init system
   in the initrd once, then tearing it down again, and starting it
   again from the root fs.

For the items listed above I think you can find different solutions
which do not necessarily compromise security as much.

So, in the list above you could address the latter three like this:

2. Use an erofs rather than a packed cpio as initrd. Make the boot
   loader load the erofs into contigous memory, then use memmap=X!Y on
   the kernel cmdline to synthesize a block device from that, which
   you then mount directly (without any initrd) via
   root=/dev/pmem0. This means yout boot loader will still load the
   whole image into memory, but only decompress the bits actually
   neeed. (It also has some other nice benefits I like, such as an
   immutable rootfs, which tmpfs-based initrds don't have.)

3. Simply never transition to the root fs, don't marke the initrds in
   systemd's eyes as an initrd (specifically: don't add an
   /etc/initrd-release file to it). Instead, just merge resources of
   the root fs into your initrd fs via overlayfs. systemd has
   infrastructure for this: "systemd-sysext". It takes immutable,
   authenticated erofs images (with verity, we call them "DDIs",
   i.e. "discoverable disk images") that it overlays into /usr/. [You
   could also very nicely combine this approach with systemd's
   portable services, and npsawn containers, which operate on the same
   authenticated images]. At MSFT we have a major product that works
   exactly like this: the OS runs off a rootfs that is loaded as an
   initrd, and everything that runs on top of this are just these
   verity disk images, using overlayfs and portable services.

4. The proposal in 3 also addresses goal 4.

Which leaves item 1, which is a bit harder to address. We have been
discussing this off an on internally too. A generic solution to this
is hard. My current thinking for this could be something like this,
covering the UEFI world: support sticking a DDI for the main initrd in
the ESP. The ESP is per definition unencrypted and unauthenticated,
but otherwise relatively well defined, i.e. known to be vfat and
discoverable via UUID on a GPT disk. So: build a minimal
single-process initrd into the kernel (i.e. UKI) that has exactly the
storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
drivers, and dm-verity. Then have a PID 1 that does exactly enough to
jump into the rootfs stored in the ESP. That latter then has 

Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Eric Curtin
On Sat, 9 Dec 2023 at 18:12, Luca Boccassi  wrote:
>
> On Sat, 9 Dec 2023 at 17:58, Eric Curtin  wrote:
> >
> > On Sat, 9 Dec 2023 at 17:46, Luca Boccassi  wrote:
> > >
> > > On Sat, 9 Dec 2023 at 17:25, Eric Curtin  wrote:
> > > >
> > > > On Sat, 9 Dec 2023 at 17:19, Luca Boccassi  wrote:
> > > > >
> > > > > On Sat, 9 Dec 2023 at 15:08, Eric Curtin  wrote:
> > > > > >
> > > > > > On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov  
> > > > > > wrote:
> > > > > > >
> > > > > > > On 09.12.2023 17:42, Eric Curtin wrote:
> > > > > > > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  
> > > > > > > > wrote:
> > > > > > > >>
> > > > > > > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin  
> > > > > > > >> wrote:
> > > > > > > >>>
> > > > > > > >>> We have been working on a new initial filesystem called 
> > > > > > > >>> initoverlayfs.
> > > > > > > >>> It is a new filesystem that provides a more scalable approach 
> > > > > > > >>> to
> > > > > > > >>> initial filesystems as opposed to just using initrds. We are 
> > > > > > > >>> writing
> > > > > > > >>> this RFC to the systemd and dracut mailing lists (feel free 
> > > > > > > >>> to forward
> > > > > > > >>> to UAPI group also) because although this solution works 
> > > > > > > >>> without
> > > > > > > >>> changing the code in these projects, it operates in the same 
> > > > > > > >>> area as
> > > > > > > >>> systemd, udev, dracut, etc. and uses these tools.
> > > > > > > >>
> > > > > > > >> It seems to me everything you described already exists? If you 
> > > > > > > >> want to
> > > > > > > >> avoid having an initrd -> rootfs transition, you can already 
> > > > > > > >> do that -
> > > > > > > >
> > > > > > > > You need a initrd -> rootfs transition for generic linux 
> > > > > > > > operating
> > > > > > > > systems right?
> > > > > > >
> > > > > > > No, you do not. Nothing stops you from running off initramfs 
> > > > > > > (today you
> > > > > > > do not really have init*RAM Disk* - the content of initrd is 
> > > > > > > unpacked
> > > > > > > into initramfs.
> > > > > >
> > > > > > Apologies if I am misinterpreting this response, I use terms initrd
> > > > > > and initramfs
> > > > > > interchangeably (not technically correct, but it's common to do 
> > > > > > this). The
> > > > > > point is to avoid unpacking as much as possible, because in many 
> > > > > > initrds
> > > > > > the majority of the software need not be unpacked, but is designed 
> > > > > > to work
> > > > > > with throwaway initial filesystems.
> > > > >
> > > > > sd-stub already supports having a small initrd shipped in the UKI,
> > > > > that is extended via sysexts, and systemd already supports running
> > > > > from it, without any transition to a final rootfs. What else do you
> > > > > need? What problem is this attempting to solve?
> > > >
> > > > I must give sd-stub a try. The bootloader I most commonly work with 
> > > > (and is one
> > > > of the target platforms this is intended for) isn't UEFI, we need 
> > > > something more
> > > > portable.
> > >
> > > Do we, though? All modern hardware platforms (and VMs) that matter are
> > > UEFI. Why would any of this be needed for legacy hardware platforms?
> > > The existing mechanisms can work just fine on those until they reach
> > > EOL, they won't stop working.
> >
> > Respectfully, this is not true. Especially on ARM platforms. I would
> > like it to be true, but it's not true today.
>
> Where any of this would actually matter, they mostly do, and where
> they don't one can put together uboot with uefi mode.

When you are trying to improve boot performance, introducing another
layer of bootloader with uboot doesn't help. You also have to port
every hardware platform you encounter to uboot. And if you can solve
the problem in the Linux stack somewhere rather than the bootloader.
Why would we choose to fix the problem in the bootloader?

>
> > I should have expanded, we are not trying to avoid transitioning to a
> > final rootfs, the goal is to transition to a final rootfs. But not to 
> > decompress
> > and copy all the bytes to a tmpfs up front, rather use something like erofs,
> > overlayfs, etc. sysexts uses erofs+overlayfs, but it's designed with
> > a different goal in mind.
>
> In what way is the goal different?

This project is basically build an initrd, but put it in a
erofs+overlayfs alternatively (technically it builds a really small
initrd to initialize some basic storage drivers etc. and build a
second initrd in an erofs format). All existing software that we've
tested "just works" with this approach, including all the systemd
stuff. And you can do transparent decompression with lz4hc
alternatively. It also means you don't have to be as afraid of
bloating your initial filesystem, because minimizing initrd's is
tedious work.

>



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Luca Boccassi
On Sat, 9 Dec 2023 at 17:58, Eric Curtin  wrote:
>
> On Sat, 9 Dec 2023 at 17:46, Luca Boccassi  wrote:
> >
> > On Sat, 9 Dec 2023 at 17:25, Eric Curtin  wrote:
> > >
> > > On Sat, 9 Dec 2023 at 17:19, Luca Boccassi  wrote:
> > > >
> > > > On Sat, 9 Dec 2023 at 15:08, Eric Curtin  wrote:
> > > > >
> > > > > On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov  
> > > > > wrote:
> > > > > >
> > > > > > On 09.12.2023 17:42, Eric Curtin wrote:
> > > > > > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  
> > > > > > > wrote:
> > > > > > >>
> > > > > > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin  
> > > > > > >> wrote:
> > > > > > >>>
> > > > > > >>> We have been working on a new initial filesystem called 
> > > > > > >>> initoverlayfs.
> > > > > > >>> It is a new filesystem that provides a more scalable approach to
> > > > > > >>> initial filesystems as opposed to just using initrds. We are 
> > > > > > >>> writing
> > > > > > >>> this RFC to the systemd and dracut mailing lists (feel free to 
> > > > > > >>> forward
> > > > > > >>> to UAPI group also) because although this solution works without
> > > > > > >>> changing the code in these projects, it operates in the same 
> > > > > > >>> area as
> > > > > > >>> systemd, udev, dracut, etc. and uses these tools.
> > > > > > >>
> > > > > > >> It seems to me everything you described already exists? If you 
> > > > > > >> want to
> > > > > > >> avoid having an initrd -> rootfs transition, you can already do 
> > > > > > >> that -
> > > > > > >
> > > > > > > You need a initrd -> rootfs transition for generic linux operating
> > > > > > > systems right?
> > > > > >
> > > > > > No, you do not. Nothing stops you from running off initramfs (today 
> > > > > > you
> > > > > > do not really have init*RAM Disk* - the content of initrd is 
> > > > > > unpacked
> > > > > > into initramfs.
> > > > >
> > > > > Apologies if I am misinterpreting this response, I use terms initrd
> > > > > and initramfs
> > > > > interchangeably (not technically correct, but it's common to do 
> > > > > this). The
> > > > > point is to avoid unpacking as much as possible, because in many 
> > > > > initrds
> > > > > the majority of the software need not be unpacked, but is designed to 
> > > > > work
> > > > > with throwaway initial filesystems.
> > > >
> > > > sd-stub already supports having a small initrd shipped in the UKI,
> > > > that is extended via sysexts, and systemd already supports running
> > > > from it, without any transition to a final rootfs. What else do you
> > > > need? What problem is this attempting to solve?
> > >
> > > I must give sd-stub a try. The bootloader I most commonly work with (and 
> > > is one
> > > of the target platforms this is intended for) isn't UEFI, we need 
> > > something more
> > > portable.
> >
> > Do we, though? All modern hardware platforms (and VMs) that matter are
> > UEFI. Why would any of this be needed for legacy hardware platforms?
> > The existing mechanisms can work just fine on those until they reach
> > EOL, they won't stop working.
>
> Respectfully, this is not true. Especially on ARM platforms. I would
> like it to be true, but it's not true today.

Where any of this would actually matter, they mostly do, and where
they don't one can put together uboot with uefi mode.

> I should have expanded, we are not trying to avoid transitioning to a
> final rootfs, the goal is to transition to a final rootfs. But not to 
> decompress
> and copy all the bytes to a tmpfs up front, rather use something like erofs,
> overlayfs, etc. sysexts uses erofs+overlayfs, but it's designed with
> a different goal in mind.

In what way is the goal different?


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Eric Curtin
On Sat, 9 Dec 2023 at 17:46, Luca Boccassi  wrote:
>
> On Sat, 9 Dec 2023 at 17:25, Eric Curtin  wrote:
> >
> > On Sat, 9 Dec 2023 at 17:19, Luca Boccassi  wrote:
> > >
> > > On Sat, 9 Dec 2023 at 15:08, Eric Curtin  wrote:
> > > >
> > > > On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov  
> > > > wrote:
> > > > >
> > > > > On 09.12.2023 17:42, Eric Curtin wrote:
> > > > > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  wrote:
> > > > > >>
> > > > > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin  
> > > > > >> wrote:
> > > > > >>>
> > > > > >>> We have been working on a new initial filesystem called 
> > > > > >>> initoverlayfs.
> > > > > >>> It is a new filesystem that provides a more scalable approach to
> > > > > >>> initial filesystems as opposed to just using initrds. We are 
> > > > > >>> writing
> > > > > >>> this RFC to the systemd and dracut mailing lists (feel free to 
> > > > > >>> forward
> > > > > >>> to UAPI group also) because although this solution works without
> > > > > >>> changing the code in these projects, it operates in the same area 
> > > > > >>> as
> > > > > >>> systemd, udev, dracut, etc. and uses these tools.
> > > > > >>
> > > > > >> It seems to me everything you described already exists? If you 
> > > > > >> want to
> > > > > >> avoid having an initrd -> rootfs transition, you can already do 
> > > > > >> that -
> > > > > >
> > > > > > You need a initrd -> rootfs transition for generic linux operating
> > > > > > systems right?
> > > > >
> > > > > No, you do not. Nothing stops you from running off initramfs (today 
> > > > > you
> > > > > do not really have init*RAM Disk* - the content of initrd is unpacked
> > > > > into initramfs.
> > > >
> > > > Apologies if I am misinterpreting this response, I use terms initrd
> > > > and initramfs
> > > > interchangeably (not technically correct, but it's common to do this). 
> > > > The
> > > > point is to avoid unpacking as much as possible, because in many initrds
> > > > the majority of the software need not be unpacked, but is designed to 
> > > > work
> > > > with throwaway initial filesystems.
> > >
> > > sd-stub already supports having a small initrd shipped in the UKI,
> > > that is extended via sysexts, and systemd already supports running
> > > from it, without any transition to a final rootfs. What else do you
> > > need? What problem is this attempting to solve?
> >
> > I must give sd-stub a try. The bootloader I most commonly work with (and is 
> > one
> > of the target platforms this is intended for) isn't UEFI, we need something 
> > more
> > portable.
>
> Do we, though? All modern hardware platforms (and VMs) that matter are
> UEFI. Why would any of this be needed for legacy hardware platforms?
> The existing mechanisms can work just fine on those until they reach
> EOL, they won't stop working.

Respectfully, this is not true. Especially on ARM platforms. I would
like it to be true, but it's not true today.

I should have expanded, we are not trying to avoid transitioning to a
final rootfs, the goal is to transition to a final rootfs. But not to decompress
and copy all the bytes to a tmpfs up front, rather use something like erofs,
overlayfs, etc. sysexts uses erofs+overlayfs, but it's designed with
a different goal in mind.

>



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Luca Boccassi
On Sat, 9 Dec 2023 at 17:25, Eric Curtin  wrote:
>
> On Sat, 9 Dec 2023 at 17:19, Luca Boccassi  wrote:
> >
> > On Sat, 9 Dec 2023 at 15:08, Eric Curtin  wrote:
> > >
> > > On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov  wrote:
> > > >
> > > > On 09.12.2023 17:42, Eric Curtin wrote:
> > > > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  wrote:
> > > > >>
> > > > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin  wrote:
> > > > >>>
> > > > >>> We have been working on a new initial filesystem called 
> > > > >>> initoverlayfs.
> > > > >>> It is a new filesystem that provides a more scalable approach to
> > > > >>> initial filesystems as opposed to just using initrds. We are writing
> > > > >>> this RFC to the systemd and dracut mailing lists (feel free to 
> > > > >>> forward
> > > > >>> to UAPI group also) because although this solution works without
> > > > >>> changing the code in these projects, it operates in the same area as
> > > > >>> systemd, udev, dracut, etc. and uses these tools.
> > > > >>
> > > > >> It seems to me everything you described already exists? If you want 
> > > > >> to
> > > > >> avoid having an initrd -> rootfs transition, you can already do that 
> > > > >> -
> > > > >
> > > > > You need a initrd -> rootfs transition for generic linux operating
> > > > > systems right?
> > > >
> > > > No, you do not. Nothing stops you from running off initramfs (today you
> > > > do not really have init*RAM Disk* - the content of initrd is unpacked
> > > > into initramfs.
> > >
> > > Apologies if I am misinterpreting this response, I use terms initrd
> > > and initramfs
> > > interchangeably (not technically correct, but it's common to do this). The
> > > point is to avoid unpacking as much as possible, because in many initrds
> > > the majority of the software need not be unpacked, but is designed to work
> > > with throwaway initial filesystems.
> >
> > sd-stub already supports having a small initrd shipped in the UKI,
> > that is extended via sysexts, and systemd already supports running
> > from it, without any transition to a final rootfs. What else do you
> > need? What problem is this attempting to solve?
>
> I must give sd-stub a try. The bootloader I most commonly work with (and is 
> one
> of the target platforms this is intended for) isn't UEFI, we need something 
> more
> portable.

Do we, though? All modern hardware platforms (and VMs) that matter are
UEFI. Why would any of this be needed for legacy hardware platforms?
The existing mechanisms can work just fine on those until they reach
EOL, they won't stop working.


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Eric Curtin
On Sat, 9 Dec 2023 at 17:19, Luca Boccassi  wrote:
>
> On Sat, 9 Dec 2023 at 15:08, Eric Curtin  wrote:
> >
> > On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov  wrote:
> > >
> > > On 09.12.2023 17:42, Eric Curtin wrote:
> > > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  wrote:
> > > >>
> > > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin  wrote:
> > > >>>
> > > >>> We have been working on a new initial filesystem called initoverlayfs.
> > > >>> It is a new filesystem that provides a more scalable approach to
> > > >>> initial filesystems as opposed to just using initrds. We are writing
> > > >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> > > >>> to UAPI group also) because although this solution works without
> > > >>> changing the code in these projects, it operates in the same area as
> > > >>> systemd, udev, dracut, etc. and uses these tools.
> > > >>
> > > >> It seems to me everything you described already exists? If you want to
> > > >> avoid having an initrd -> rootfs transition, you can already do that -
> > > >
> > > > You need a initrd -> rootfs transition for generic linux operating
> > > > systems right?
> > >
> > > No, you do not. Nothing stops you from running off initramfs (today you
> > > do not really have init*RAM Disk* - the content of initrd is unpacked
> > > into initramfs.
> >
> > Apologies if I am misinterpreting this response, I use terms initrd
> > and initramfs
> > interchangeably (not technically correct, but it's common to do this). The
> > point is to avoid unpacking as much as possible, because in many initrds
> > the majority of the software need not be unpacked, but is designed to work
> > with throwaway initial filesystems.
>
> sd-stub already supports having a small initrd shipped in the UKI,
> that is extended via sysexts, and systemd already supports running
> from it, without any transition to a final rootfs. What else do you
> need? What problem is this attempting to solve?

I must give sd-stub a try. The bootloader I most commonly work with (and is one
of the target platforms this is intended for) isn't UEFI, we need something more
portable.

Is mise le meas/Regards,

Eric Curtin

>



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Luca Boccassi
On Sat, 9 Dec 2023 at 15:08, Eric Curtin  wrote:
>
> On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov  wrote:
> >
> > On 09.12.2023 17:42, Eric Curtin wrote:
> > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  wrote:
> > >>
> > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin  wrote:
> > >>>
> > >>> We have been working on a new initial filesystem called initoverlayfs.
> > >>> It is a new filesystem that provides a more scalable approach to
> > >>> initial filesystems as opposed to just using initrds. We are writing
> > >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> > >>> to UAPI group also) because although this solution works without
> > >>> changing the code in these projects, it operates in the same area as
> > >>> systemd, udev, dracut, etc. and uses these tools.
> > >>
> > >> It seems to me everything you described already exists? If you want to
> > >> avoid having an initrd -> rootfs transition, you can already do that -
> > >
> > > You need a initrd -> rootfs transition for generic linux operating
> > > systems right?
> >
> > No, you do not. Nothing stops you from running off initramfs (today you
> > do not really have init*RAM Disk* - the content of initrd is unpacked
> > into initramfs.
>
> Apologies if I am misinterpreting this response, I use terms initrd
> and initramfs
> interchangeably (not technically correct, but it's common to do this). The
> point is to avoid unpacking as much as possible, because in many initrds
> the majority of the software need not be unpacked, but is designed to work
> with throwaway initial filesystems.

sd-stub already supports having a small initrd shipped in the UKI,
that is extended via sysexts, and systemd already supports running
from it, without any transition to a final rootfs. What else do you
need? What problem is this attempting to solve?


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Eric Curtin
On Sat, 9 Dec 2023 at 15:23, Daan De Meyer  wrote:
>
> > We have been working on a new initial filesystem called initoverlayfs.
> > It is a new filesystem that provides a more scalable approach to
> > initial filesystems as opposed to just using initrds. We are writing
> > this RFC to the systemd and dracut mailing lists (feel free to forward
> > to UAPI group also) because although this solution works without
> > changing the code in these projects, it operates in the same area as
> > systemd, udev, dracut, etc. and uses these tools.
>
> I like the concept of using erofs instead of a compressed cpio and we have
> been discussing doing something similar within systemd. I very much dislike
> the implementation though. I believe this should be implemented natively 
> within
> the Linux kernel instead of hacking around the missing kernel support
> in userspace.
>

I'm not against eventually implementing this in kernelspace, it's
something I've thought about. Implementing in userspace made more
sense to start as a lot of this tooling is much easier to work with in
userspace. It was much faster to write this in userspace to prove the
benefits, test, etc.

It is easier to maintain and develop software in userspace though. So
we would need to have serious thought on why we are pushing this into
kernelspace, what are the benefits, etc.

> If the kernel would add support for supplying an erofs initramfs
> instead of a cpio
> initramfs, put a writable tmpfs on top of it and would unpack any
> extra cpios provided
> by the bootloader on top of the tmpfs, then there wouldn't be any need
> for initoverlayfs.

Do we have to unpack extra cpio's, could that be optional? Mounting
erofs with transient overlay is really fast. Of course if people want
to do that it's fine :)


>
> Before adopting anything like this I believe there should be a serious
> effort to get
> this implemented within Linux itself. Only if that turns out to be
> impossible should
> we fall back to exploring userspace only solutions.
>
> Cheers,
>
> Daan
>
>
> On Sat, 9 Dec 2023 at 16:08, Eric Curtin  wrote:
> >
> > On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov  wrote:
> > >
> > > On 09.12.2023 17:42, Eric Curtin wrote:
> > > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  wrote:
> > > >>
> > > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin  wrote:
> > > >>>
> > > >>> We have been working on a new initial filesystem called initoverlayfs.
> > > >>> It is a new filesystem that provides a more scalable approach to
> > > >>> initial filesystems as opposed to just using initrds. We are writing
> > > >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> > > >>> to UAPI group also) because although this solution works without
> > > >>> changing the code in these projects, it operates in the same area as
> > > >>> systemd, udev, dracut, etc. and uses these tools.
> > > >>
> > > >> It seems to me everything you described already exists? If you want to
> > > >> avoid having an initrd -> rootfs transition, you can already do that -
> > > >
> > > > You need a initrd -> rootfs transition for generic linux operating
> > > > systems right?
> > >
> > > No, you do not. Nothing stops you from running off initramfs (today you
> > > do not really have init*RAM Disk* - the content of initrd is unpacked
> > > into initramfs.
> >
> > Apologies if I am misinterpreting this response, I use terms initrd
> > and initramfs
> > interchangeably (not technically correct, but it's common to do this). The
> > point is to avoid unpacking as much as possible, because in many initrds
> > the majority of the software need not be unpacked, but is designed to work
> > with throwaway initial filesystems.
> >
> > >
> > > > Or else you start building all sorts of things directly
> > > > into the kernel which isn't really scalable.
> > > >
> > >
> > > See above.
> > >
> >
>



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Daan De Meyer
> We have been working on a new initial filesystem called initoverlayfs.
> It is a new filesystem that provides a more scalable approach to
> initial filesystems as opposed to just using initrds. We are writing
> this RFC to the systemd and dracut mailing lists (feel free to forward
> to UAPI group also) because although this solution works without
> changing the code in these projects, it operates in the same area as
> systemd, udev, dracut, etc. and uses these tools.

I like the concept of using erofs instead of a compressed cpio and we have
been discussing doing something similar within systemd. I very much dislike
the implementation though. I believe this should be implemented natively within
the Linux kernel instead of hacking around the missing kernel support
in userspace.

If the kernel would add support for supplying an erofs initramfs
instead of a cpio
initramfs, put a writable tmpfs on top of it and would unpack any
extra cpios provided
by the bootloader on top of the tmpfs, then there wouldn't be any need
for initoverlayfs.

Before adopting anything like this I believe there should be a serious
effort to get
this implemented within Linux itself. Only if that turns out to be
impossible should
we fall back to exploring userspace only solutions.

Cheers,

Daan


On Sat, 9 Dec 2023 at 16:08, Eric Curtin  wrote:
>
> On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov  wrote:
> >
> > On 09.12.2023 17:42, Eric Curtin wrote:
> > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  wrote:
> > >>
> > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin  wrote:
> > >>>
> > >>> We have been working on a new initial filesystem called initoverlayfs.
> > >>> It is a new filesystem that provides a more scalable approach to
> > >>> initial filesystems as opposed to just using initrds. We are writing
> > >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> > >>> to UAPI group also) because although this solution works without
> > >>> changing the code in these projects, it operates in the same area as
> > >>> systemd, udev, dracut, etc. and uses these tools.
> > >>
> > >> It seems to me everything you described already exists? If you want to
> > >> avoid having an initrd -> rootfs transition, you can already do that -
> > >
> > > You need a initrd -> rootfs transition for generic linux operating
> > > systems right?
> >
> > No, you do not. Nothing stops you from running off initramfs (today you
> > do not really have init*RAM Disk* - the content of initrd is unpacked
> > into initramfs.
>
> Apologies if I am misinterpreting this response, I use terms initrd
> and initramfs
> interchangeably (not technically correct, but it's common to do this). The
> point is to avoid unpacking as much as possible, because in many initrds
> the majority of the software need not be unpacked, but is designed to work
> with throwaway initial filesystems.
>
> >
> > > Or else you start building all sorts of things directly
> > > into the kernel which isn't really scalable.
> > >
> >
> > See above.
> >
>


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Eric Curtin
On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov  wrote:
>
> On 09.12.2023 17:42, Eric Curtin wrote:
> > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  wrote:
> >>
> >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin  wrote:
> >>>
> >>> We have been working on a new initial filesystem called initoverlayfs.
> >>> It is a new filesystem that provides a more scalable approach to
> >>> initial filesystems as opposed to just using initrds. We are writing
> >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> >>> to UAPI group also) because although this solution works without
> >>> changing the code in these projects, it operates in the same area as
> >>> systemd, udev, dracut, etc. and uses these tools.
> >>
> >> It seems to me everything you described already exists? If you want to
> >> avoid having an initrd -> rootfs transition, you can already do that -
> >
> > You need a initrd -> rootfs transition for generic linux operating
> > systems right?
>
> No, you do not. Nothing stops you from running off initramfs (today you
> do not really have init*RAM Disk* - the content of initrd is unpacked
> into initramfs.

Apologies if I am misinterpreting this response, I use terms initrd
and initramfs
interchangeably (not technically correct, but it's common to do this). The
point is to avoid unpacking as much as possible, because in many initrds
the majority of the software need not be unpacked, but is designed to work
with throwaway initial filesystems.

>
> > Or else you start building all sorts of things directly
> > into the kernel which isn't really scalable.
> >
>
> See above.
>



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Andrei Borzenkov

On 09.12.2023 17:42, Eric Curtin wrote:

On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  wrote:


On Fri, 8 Dec 2023 at 19:00, Eric Curtin  wrote:


We have been working on a new initial filesystem called initoverlayfs.
It is a new filesystem that provides a more scalable approach to
initial filesystems as opposed to just using initrds. We are writing
this RFC to the systemd and dracut mailing lists (feel free to forward
to UAPI group also) because although this solution works without
changing the code in these projects, it operates in the same area as
systemd, udev, dracut, etc. and uses these tools.


It seems to me everything you described already exists? If you want to
avoid having an initrd -> rootfs transition, you can already do that -


You need a initrd -> rootfs transition for generic linux operating
systems right?


No, you do not. Nothing stops you from running off initramfs (today you 
do not really have init*RAM Disk* - the content of initrd is unpacked 
into initramfs.



Or else you start building all sorts of things directly
into the kernel which isn't really scalable.



See above.



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Eric Curtin
On Sat, 9 Dec 2023 at 12:46, Luca Boccassi  wrote:
>
> On Fri, 8 Dec 2023 at 19:00, Eric Curtin  wrote:
> >
> > We have been working on a new initial filesystem called initoverlayfs.
> > It is a new filesystem that provides a more scalable approach to
> > initial filesystems as opposed to just using initrds. We are writing
> > this RFC to the systemd and dracut mailing lists (feel free to forward
> > to UAPI group also) because although this solution works without
> > changing the code in these projects, it operates in the same area as
> > systemd, udev, dracut, etc. and uses these tools.
>
> It seems to me everything you described already exists? If you want to
> avoid having an initrd -> rootfs transition, you can already do that -

You need a initrd -> rootfs transition for generic linux operating
systems right? Or else you start building all sorts of things directly
into the kernel which isn't really scalable.

> the initrd code paths run because there's /etc/initrd-release, omit
> that and the transition/phase is avoided. If you want to have an
> overlay with r/o images, you can already do that with sysexts. You'll
> need to reimplement and maintain separately TPM support, LUKS support,
> fido2, etc etc

This is intended to be something you can use with or without sysexts,
not a competing alternative. There will be some reimplementations, but
our hope is to minimize that, leave as much as possible to systemd,
initoverlayfs stage, etc. where you don't pay the upfront cost for
decompressing and copying all the bytes.

We are open to executing minified systemd libraries/binaries in the
minified initramfs, we do that in the current version of storage-init
by calling systemd udev binaries.

>



Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-09 Thread Luca Boccassi
On Fri, 8 Dec 2023 at 19:00, Eric Curtin  wrote:
>
> We have been working on a new initial filesystem called initoverlayfs.
> It is a new filesystem that provides a more scalable approach to
> initial filesystems as opposed to just using initrds. We are writing
> this RFC to the systemd and dracut mailing lists (feel free to forward
> to UAPI group also) because although this solution works without
> changing the code in these projects, it operates in the same area as
> systemd, udev, dracut, etc. and uses these tools.

It seems to me everything you described already exists? If you want to
avoid having an initrd -> rootfs transition, you can already do that -
the initrd code paths run because there's /etc/initrd-release, omit
that and the transition/phase is avoided. If you want to have an
overlay with r/o images, you can already do that with sysexts. You'll
need to reimplement and maintain separately TPM support, LUKS support,
fido2, etc etc