Re: Installation image layout

2019-01-16 Thread Chris Murphy
On Wed, Jan 16, 2019 at 10:42 AM Georg Sauthoff  wrote:
>
> On Fri, Jan 04, 2019 at 09:27:33PM +0100, Georg Sauthoff wrote:
> > On Sun, Oct 14, 2018 at 04:35:53PM -0600, Chris Murphy wrote:
> > [..]
> > > And now it replicates extents from seed to sprout.  The copy is faster
> > > than pvmove, rsync, dd, or rpm-ostree deploy.
> >
> > This sounds great!
> >
> > I just tried it (on Fedora 29), but those steps don't work for me:
> >
> > # cryptsetup --readonly luksOpen /dev/nbd0p4 tmp
> > # mount -o noatime /dev/mapper/tmp /mnt/tmp
> > # mount: /mnt/tmp: WARNING: device write-protected, mounted read-only.
> > # btrfs device add /dev/nbd1 /mnt/tmp
> > Performing full device TRIM /dev/nbd1 (4.00GiB) ...
> > ERROR: error adding device '/dev/nbd1': Read-only file system
> >
> > Am I missing something?
>
> Ok, a necessary condition for creating a sprout is setting the seed
> parameter on the source filesystem (via btrfstune). [1]
>
> (with the seed parameter a mount of that FS is automatically read-only)
>
> Thus, this works for me:
>
> # cryptsetup luksOpen /dev/nbd0p4 tmp
> # btrfstune -S 1/dev/mapper/tmp
> # mount -o noatime /dev/mapper/tmp /mnt/tmp
> # mount: /mnt/tmp: WARNING: device write-protected, mounted read-only.
> # btrfs device add /dev/nbd1 /mnt/tmp
> Performing full device TRIM /dev/nbd1 (2.80GiB) ...
> # mount -o remount,rw /mnt/tmp
> # time btrfs device remove /dev/mapper/tmp /mnt/tmp
> # umount /mnt/tmp
>

Yeah sorry I made the assumption that "the seed" is already flagged
with btrfstune. If it weren't flagged as seed and is rw mounted,
replication does still happen however the first device has its
signature wiped, and the second device inherits the same fs UUID. The
use case here is live migration from one device to another.

In the seed/sprout use case the seed is not wiped (so it can be an
on-going source), and the sprout gets a new fs UUID assigned.

-- 
Chris Murphy
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: Installation image layout

2019-01-16 Thread Georg Sauthoff
On Fri, Jan 04, 2019 at 09:27:33PM +0100, Georg Sauthoff wrote:
> On Sun, Oct 14, 2018 at 04:35:53PM -0600, Chris Murphy wrote:
> [..]
> > And now it replicates extents from seed to sprout.  The copy is faster
> > than pvmove, rsync, dd, or rpm-ostree deploy.
> 
> This sounds great!
> 
> I just tried it (on Fedora 29), but those steps don't work for me:
> 
> # cryptsetup --readonly luksOpen /dev/nbd0p4 tmp
> # mount -o noatime /dev/mapper/tmp /mnt/tmp
> # mount: /mnt/tmp: WARNING: device write-protected, mounted read-only.
> # btrfs device add /dev/nbd1 /mnt/tmp   
> Performing full device TRIM /dev/nbd1 (4.00GiB) ...
> ERROR: error adding device '/dev/nbd1': Read-only file system
> 
> Am I missing something?

Ok, a necessary condition for creating a sprout is setting the seed
parameter on the source filesystem (via btrfstune). [1]

(with the seed parameter a mount of that FS is automatically read-only)

Thus, this works for me:

# cryptsetup luksOpen /dev/nbd0p4 tmp
# btrfstune -S 1/dev/mapper/tmp
# mount -o noatime /dev/mapper/tmp /mnt/tmp
# mount: /mnt/tmp: WARNING: device write-protected, mounted read-only.
# btrfs device add /dev/nbd1 /mnt/tmp   
Performing full device TRIM /dev/nbd1 (2.80GiB) ...
# mount -o remount,rw /mnt/tmp
# time btrfs device remove /dev/mapper/tmp /mnt/tmp
# umount /mnt/tmp

Best regards
Georg Sauthoff

[1]: btrfstune is also mentioned in the previously referenced
https://btrfs.wiki.kernel.org/index.php/Seed-device article
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: Installation image layout

2019-01-04 Thread Georg Sauthoff
On Sun, Oct 14, 2018 at 04:35:53PM -0600, Chris Murphy wrote:
[..]
> Ha! I just realized after all this time that the Btrfs wiki does not
> make clear how to make a sprout, even though it mentions the more
> esoteric recursive seed.[1] Of course you can mkfs.btrfs, mount it,
> and send/receive. But send requires read only snapshots. Making a
> sprout is easier, you just remove the seed device. This is supported
> since 2009.
> 
> # losetup -r /dev/loop0 root.img
> # mount /dev/loop0 /mnt/
> # btrfs device add /dev/sda3 /mnt
> # mount -o remount,rw /mnt
> # btrfs device remove /dev/loop0 /mnt
> 
> And now it replicates extents from seed to sprout.  The copy is faster
> than pvmove, rsync, dd, or rpm-ostree deploy.

This sounds great!

I just tried it (on Fedora 29), but those steps don't work for me:

# cryptsetup --readonly luksOpen /dev/nbd0p4 tmp
# mount -o noatime /dev/mapper/tmp /mnt/tmp
# mount: /mnt/tmp: WARNING: device write-protected, mounted read-only.
# btrfs device add /dev/nbd1 /mnt/tmp   
Performing full device TRIM /dev/nbd1 (4.00GiB) ...
ERROR: error adding device '/dev/nbd1': Read-only file system

Am I missing something?

Best regards
Georg
-- 
'NEVER USE A GNUPG VERSION YOU JUST DOWNLOADED TO CHECK THE
INTEGRITY OF THE SOURCE - USE AN EXISTING GNUPG INSTALLATION!'
(http://lists.gnupg.org/pipermail/gnupg-announce/2006q4/000239.html)
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: Installation image layout

2018-10-16 Thread Martin Kolman
On Fri, 2018-10-12 at 15:53 -0700, Adam Williamson wrote:
> On Fri, 2018-10-12 at 15:44 -0600, Chris Murphy wrote:
> > On Fri, Oct 12, 2018 at 4:30 AM, Marek Marczykowski-Górecki
> >  wrote:
> > > On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
> > > > I'm pretty sure the original reason was the default live install use
> > > > dd to block copy the root file system into the fedora-root LV, and
> > > > then resized the LV and ext4 file system.
> > > 
> > > How is it done now?
> > 
> > On Live media installs, anaconda does:
> > 
> > rsync -pogAXtlHrDx --exclude /dev/ --exclude /proc/ --exclude /sys/
> > --exclude /run/ --exclude /boot/*rescue* --exclude /etc/machine-id
> > /mnt/install/source/ /mnt/sysimage
> > 
> > On DVD and netinstalls, I'm guessing based on packaging.log that it's
> > a dnf+rpm installation even though I never see a dnf or rpm process in
> > either top or ps. In any case, the rpm packages are directly on the
> > iso9660 file system, not baked into the
> 
> anaconda uses dnf's python interface, it does not *run* 'dnf'.
> 
> https://github.com/rhinstaller/anaconda/blob/master/pyanaconda/payload/dnfpayload.py
Yep, but the DNF Python code still actually runs in a Python subprocess. 
This is needed as aparently something during the package installation 
transaction
- most likely RPM - does a chroot. If the DNF code did run directly in Anaconda 
process, 
Anaconda would get chrooted as well and BAD THINGS (TM) would happen.
Bad things ranging from missing icons to GTK crashing due to files is uses 
suddenly vanishing.

> -- 
> Adam Williamson
> Fedora QA Community Monkey
> IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
> http://www.happyassassin.net
> ___
> devel mailing list -- devel@lists.fedoraproject.org
> To unsubscribe send an email to devel-le...@lists.fedoraproject.org
> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: Installation image layout

2018-10-16 Thread Colin Walters
On Thu, Oct 11, 2018, at 8:37 PM, Marek Marczykowski-Górecki wrote:
> Hi all!
> 
> I'm new on this list. I work on Qubes OS, where Fedora is used as a base
> distribution.

Tangentially: Qubes is very cool and I'm glad you find Fedora useful
as a base system.  I work on Fedora CoreOS and have patches in a
lot of OS components; lorax, systemd, etc. If there's something blocking
you feel free to reach out and I may be able to
spend some time to help.  Also, if you decide to investigate using
rpm-ostree for the Qubes dom0 - I'd be very interested in helping. 

> *guess* it's there for historical reason, from before
> aufs/overlayfs being available. Is there any other reason for that?

There is one thing to consider; overlayfs-on-overlayfs is not supported,
so this would break podman/docker out of the box.   But we could probably
have them fall back to the vfs backend; it's not like we really care a lot
about performance in the live media case.

One side note for OSTree-based systems we have some built-in support for
the read-only media case:
https://github.com/ostreedev/ostree/commit/ff6883ca0655ac8844cd783caf6a7d8815515ba3
Which then changed to use systemd:
https://github.com/ostreedev/ostree/commit/05d0ee5cbecd1287b87d38e969862a5d8b1f2e58
That still has /etc and /sysroot be read-only though.  Extending to overlayfs
for / would allow e.g. `rpm-ostree install` to work as well if the upperdir
is on a writable layer.
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: Installation image layout

2018-10-15 Thread Matthew Miller
On Mon, Oct 15, 2018 at 05:52:29PM +0200, Marek Marczykowski-Górecki wrote:
> Anyway, can somebody help me with change proposal? For example I'm not
> sure if this is "Self Contained" or "System Wide" Change, or what should
> specifically be listed in "Scope". If IRC would be more appropriate for
> such discussion, that's fine for me too.

I would suggest system-wide, since every edition and spin relies on the
installer.

"Scope" should cover everything you're changing, and everyone who is
impacted in some way (whether they need to be directly involved, or are
impacted and need to change something, or whether they just might like to be
aware).


-- 
Matthew Miller

Fedora Project Leader
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: Installation image layout

2018-10-15 Thread Marek Marczykowski-Górecki
On Mon, Oct 15, 2018 at 08:30:39AM -0700, Brian C. Lane wrote:
> On Sun, Oct 14, 2018 at 02:21:47PM -0400, Neal Gompa wrote:
> > > Neal, any ideas who Marek could be a co-owner of the feature and help
> > > navigate the Fedora process? Maybe someone on the Anaconda or releng
> > > teams?
> > >
> > 
> > Brian C. Lane from the Weldr team is probably the guy to work with on
> > this. He is the chief developer of Lorax, which is where
> > livemedia-creator comes from. I've CC'd him to this email.
> 
> 
> Thanks, Marek and I are already in touch :) As long as overlayfs can do
> what we need the bulk of the extra work needs to be done in
> anaconda-dracut.

FWIW the change to anaconda-dracut to support _only_ overlayfs is
quite small:
https://github.com/marmarek/qubes-installer-qubes-os/commit/332be8e1e3e1006013772528078914f491d14c1f#diff-aa15299e3bf81c1e427c9d521d63778f

And it drops dependency on dmsquash-live module.

> We may also want to make this switch an option for a bit, while we work
> out the details.

Support for both layouts will be more tricky, because of the split between
anaconda-dracut and dmsquash-live. Integrating (parts of?) the latter in
the former would make it much easier.
But IMO it's worth making it support both layouts, at least for now.

Anyway, can somebody help me with change proposal? For example I'm not
sure if this is "Self Contained" or "System Wide" Change, or what should
specifically be listed in "Scope". If IRC would be more appropriate for
such discussion, that's fine for me too.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?


signature.asc
Description: PGP signature
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: Installation image layout

2018-10-15 Thread Brian C. Lane
On Sun, Oct 14, 2018 at 02:21:47PM -0400, Neal Gompa wrote:
> > Neal, any ideas who Marek could be a co-owner of the feature and help
> > navigate the Fedora process? Maybe someone on the Anaconda or releng
> > teams?
> >
> 
> Brian C. Lane from the Weldr team is probably the guy to work with on
> this. He is the chief developer of Lorax, which is where
> livemedia-creator comes from. I've CC'd him to this email.


Thanks, Marek and I are already in touch :) As long as overlayfs can do
what we need the bulk of the extra work needs to be done in
anaconda-dracut.

We may also want to make this switch an option for a bit, while we work
out the details.

-- 
Brian C. Lane (PST8PDT)
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: Installation image layout

2018-10-14 Thread Chris Murphy
On Sun, Oct 14, 2018 at 12:21 PM, Neal Gompa  wrote:
> On Sun, Oct 14, 2018 at 1:58 PM Chris Murphy  wrote:
>>
>> On Sat, Oct 13, 2018 at 6:24 PM, Neal Gompa  wrote:

>> > This is a really interesting idea...
>>
>>
>> https://lore.kernel.org/linux-btrfs/CAJCQCtTPwQnzwkpk=4zszxfwtc7hymyetxp-9xuu_tsvotw...@mail.gmail.com/t.atom
>>
>
> I'm interested to see how that thread turns out... It's a tempting
> idea, because it gives you so much more flexibility. Installation onto
> a disk could be a "btrfs send" and overlay changes could be easily
> flattened on top of the target system. It'd also be much cheaper and
> lighter for supporting the live environment.

Ha! I just realized after all this time that the Btrfs wiki does not
make clear how to make a sprout, even though it mentions the more
esoteric recursive seed.[1] Of course you can mkfs.btrfs, mount it,
and send/receive. But send requires read only snapshots. Making a
sprout is easier, you just remove the seed device. This is supported
since 2009.

# losetup -r /dev/loop0 root.img
# mount /dev/loop0 /mnt/
# btrfs device add /dev/sda3 /mnt
# mount -o remount,rw /mnt
# btrfs device remove /dev/loop0 /mnt

And now it replicates extents from seed to sprout.  The copy is faster
than pvmove, rsync, dd, or rpm-ostree deploy.

OK so let's say you have a USB stick 'sdb' and internal drive 'sda'.
And the stick already has a Fedora LiveOS imaged on it, only change is
the root.img is a Btrfs seed. The simplistic systemd pre-mount and
mount look like:

# losetup -r /dev/loop0 root.img
# mount -t btrfs /dev/loop0 /
# btrfs device add /dev/zram1 /
# mount -t btrfs -o remount,rw /

- now you have a live overlay in RAM; user can start using this LiveOS
environment including making changes like installing software; setting
up non-volatile persistence on the stick looks like:

# btrfs device add /dev/sdb3 /
# btrfs device remove /dev/zram1 /
# echo 1 > /sys/class/zram-control/hot_remove

- now the extents on zram1 are moved from zram1 to sdb3 (the stick);
setting up an installation to the internal drive 'sda' by "flattening"
as you say, merely means adding the internal drive to the mounted
Btrfs volume and removing all others:

# btrfs device add /dev/sda3 /
# btrfs device remove /dev/sdb3 /
# btrfs device remove /dev/loop0 /

- now extents on sdb3 (stick) and loop0 (seed) are copied to sda3
(internal), including any changes the user is making while all of this
is happening. In fact, the user does not even have to reboot because
once the operation finishes, and the loop is torn down, the stick is
not in use by the kernel. The user can just unplug the stick and keep
working. A spin or downstream could very sanely, and straightforwardly
build a no-UI OS installation.

It's not obvious that 'btrfs device add' incorporates a mkfs and that
you can now just delete the ro seed. Also not obvious is the 'dev add'
on an ro mounted seed causes a new volume UUID to be generated. This
is immediately discovered by libblkid. The kernel knows that this new
volume is a two device (or three device, whatever the case is) btrfs
and which devices they are. And this is such basic btrfs handling code
that GRUB and extlinux Btrfs code understand it.

[1]
https://btrfs.wiki.kernel.org/index.php/Seed-device


-- 
Chris Murphy
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: Installation image layout

2018-10-14 Thread Neal Gompa
On Sun, Oct 14, 2018 at 1:58 PM Chris Murphy  wrote:
>
> On Sat, Oct 13, 2018 at 6:24 PM, Neal Gompa  wrote:
> > On Sat, Oct 13, 2018 at 8:17 PM Chris Murphy  
> > wrote:
> >>
> >> On Fri, Oct 12, 2018 at 5:26 PM, Marek Marczykowski-Górecki
> >>  wrote:
> >> > On Fri, Oct 12, 2018 at 03:44:38PM -0600, Chris Murphy wrote:
> >>
> >> >> mkfs.btrfs has --rootdir and --shrink features to pre-allocate a
> >> >> volume with files at mkfs time; I have no idea to what degree it
> >> >> depends on kernel code.
> >> >
> >> > Probably not at all, given it works as non-root user too.
> >> > I've tried to run it twice on the same directory (and with the same
> >> > --uuid) on 32MB of data and got different images (~2000 lines of hexdump
> >> > diff). Could be some timestamps, could be something else.
> >>
> >> There is volume UUID which is what --uuid affects. But there are other
> >> uuids, including the chunk uuid which gets repeated in every leaf and
> >> node along with the volume uuid, device uuid, each files tree
> >> (subvolume) get its own uuid, etc. Time stamps include atime, otime,
> >> mtime, and ctime. Some objects have all 0's for uuid, and some items
> >> have only 0.0 for times. I'll float the reproducibility question on
> >> the Btrfs list, if it's desirable, useful, and how difficult it is. I
> >> think subsetting Btrfs features to reduce complexity generally, and
> >> therefore increase reproducibility as a consequence of that, has
> >> merit.
> >>
> >
> > This is a really interesting idea...
>
>
> https://lore.kernel.org/linux-btrfs/CAJCQCtTPwQnzwkpk=4zszxfwtc7hymyetxp-9xuu_tsvotw...@mail.gmail.com/t.atom
>

I'm interested to see how that thread turns out... It's a tempting
idea, because it gives you so much more flexibility. Installation onto
a disk could be a "btrfs send" and overlay changes could be easily
flattened on top of the target system. It'd also be much cheaper and
lighter for supporting the live environment.

>
> > squashfs has supported zstd along with btrfs since kernel 4.14. zstd
> > support was mainlined into squashfs-tools a year ago:
> > https://github.com/plougher/squashfs-tools/commit/6113361316d5ce5bfdc118d188e5617a1fcd747c
> >
> > However, there's been no releases since the migration from CVS on SF
> > to Git on GitHub.
>
> Ahh I missed that. And looking at koji, it seems like squashfs-tools
> are currently FTBFS on Fedora 29. I have F29 but
> squashfs-tools-4.3-16.fc28.x86_64.
>
>
> OK, so it sounds to me like the current proposals for this thread as
> it relates to installer images for Fedora 30:
>
> - Drop devicemapper in favor of overlayfs
> - Drop squashfs+ext4 images in favor of squashfs only image
> - Maybe move to zstd in the squashfs image
>
> I think part of the feature/change proposal should be building an
> example LiveOS image in copr so we can get an idea of how to blow it
> up, and ask QA to run it through OpenQA tests and see what sorts of
> things break there.
>
> Neal, any ideas who Marek could be a co-owner of the feature and help
> navigate the Fedora process? Maybe someone on the Anaconda or releng
> teams?
>

Brian C. Lane from the Weldr team is probably the guy to work with on
this. He is the chief developer of Lorax, which is where
livemedia-creator comes from. I've CC'd him to this email.


--
真実はいつも一つ!/ Always, there's only one truth!
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: Installation image layout

2018-10-14 Thread Chris Murphy
On Sat, Oct 13, 2018 at 6:24 PM, Neal Gompa  wrote:
> On Sat, Oct 13, 2018 at 8:17 PM Chris Murphy  wrote:
>>
>> On Fri, Oct 12, 2018 at 5:26 PM, Marek Marczykowski-Górecki
>>  wrote:
>> > On Fri, Oct 12, 2018 at 03:44:38PM -0600, Chris Murphy wrote:
>>
>> >> mkfs.btrfs has --rootdir and --shrink features to pre-allocate a
>> >> volume with files at mkfs time; I have no idea to what degree it
>> >> depends on kernel code.
>> >
>> > Probably not at all, given it works as non-root user too.
>> > I've tried to run it twice on the same directory (and with the same
>> > --uuid) on 32MB of data and got different images (~2000 lines of hexdump
>> > diff). Could be some timestamps, could be something else.
>>
>> There is volume UUID which is what --uuid affects. But there are other
>> uuids, including the chunk uuid which gets repeated in every leaf and
>> node along with the volume uuid, device uuid, each files tree
>> (subvolume) get its own uuid, etc. Time stamps include atime, otime,
>> mtime, and ctime. Some objects have all 0's for uuid, and some items
>> have only 0.0 for times. I'll float the reproducibility question on
>> the Btrfs list, if it's desirable, useful, and how difficult it is. I
>> think subsetting Btrfs features to reduce complexity generally, and
>> therefore increase reproducibility as a consequence of that, has
>> merit.
>>
>
> This is a really interesting idea...


https://lore.kernel.org/linux-btrfs/CAJCQCtTPwQnzwkpk=4zszxfwtc7hymyetxp-9xuu_tsvotw...@mail.gmail.com/t.atom


> squashfs has supported zstd along with btrfs since kernel 4.14. zstd
> support was mainlined into squashfs-tools a year ago:
> https://github.com/plougher/squashfs-tools/commit/6113361316d5ce5bfdc118d188e5617a1fcd747c
>
> However, there's been no releases since the migration from CVS on SF
> to Git on GitHub.

Ahh I missed that. And looking at koji, it seems like squashfs-tools
are currently FTBFS on Fedora 29. I have F29 but
squashfs-tools-4.3-16.fc28.x86_64.


OK, so it sounds to me like the current proposals for this thread as
it relates to installer images for Fedora 30:

- Drop devicemapper in favor of overlayfs
- Drop squashfs+ext4 images in favor of squashfs only image
- Maybe move to zstd in the squashfs image

I think part of the feature/change proposal should be building an
example LiveOS image in copr so we can get an idea of how to blow it
up, and ask QA to run it through OpenQA tests and see what sorts of
things break there.

Neal, any ideas who Marek could be a co-owner of the feature and help
navigate the Fedora process? Maybe someone on the Anaconda or releng
teams?


-- 
Chris Murphy
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: Installation image layout

2018-10-13 Thread Neal Gompa
On Sat, Oct 13, 2018 at 8:17 PM Chris Murphy  wrote:
>
> On Fri, Oct 12, 2018 at 5:26 PM, Marek Marczykowski-Górecki
>  wrote:
> > On Fri, Oct 12, 2018 at 03:44:38PM -0600, Chris Murphy wrote:
>
> >> mkfs.btrfs has --rootdir and --shrink features to pre-allocate a
> >> volume with files at mkfs time; I have no idea to what degree it
> >> depends on kernel code.
> >
> > Probably not at all, given it works as non-root user too.
> > I've tried to run it twice on the same directory (and with the same
> > --uuid) on 32MB of data and got different images (~2000 lines of hexdump
> > diff). Could be some timestamps, could be something else.
>
> There is volume UUID which is what --uuid affects. But there are other
> uuids, including the chunk uuid which gets repeated in every leaf and
> node along with the volume uuid, device uuid, each files tree
> (subvolume) get its own uuid, etc. Time stamps include atime, otime,
> mtime, and ctime. Some objects have all 0's for uuid, and some items
> have only 0.0 for times. I'll float the reproducibility question on
> the Btrfs list, if it's desirable, useful, and how difficult it is. I
> think subsetting Btrfs features to reduce complexity generally, and
> therefore increase reproducibility as a consequence of that, has
> merit.
>

This is a really interesting idea...

>
> >> It's also
> >> possible with dm-verity or dm-integrity but then that adds back the dm
> >> complexity.
> >
> > Oh, please, no...
>
> Haha...
>

This made me giggle a bit. :)

> >
> > There are two almost separate aspects here:
> >  - image layout (squashfs+ext4, squashfs alone, squashfs+btrfs)
> >  - how copy-on-write is achieved (dm-snapshot, overlay fs)
>
> ext4 alone, and btrfs alone are also viable. But since ext4 has no
> compression, image size grows by maybe a factor of 2. Btrfs supports
> lzo and zlib compression since forever, and zstd since kernel 4.14,
> same as squashfs. What's been missing is mksquashfs with zstd support,
> which I imagine will be in 5.0. The compression ratio compares well
> with xz currently being used by mksquashfs in Fedora composes, but
> with much less CPU to compress and decompress. So I'd say go with zstd
> in any case.
>

squashfs has supported zstd along with btrfs since kernel 4.14. zstd
support was mainlined into squashfs-tools a year ago:
https://github.com/plougher/squashfs-tools/commit/6113361316d5ce5bfdc118d188e5617a1fcd747c

However, there's been no releases since the migration from CVS on SF
to Git on GitHub.

>
> >
> > For reproducibility, squashfs alone is the best option, but does not
> > improve integrity checking (but also doesn't make it worse).
>
> I'm not able to estimate how much work it is to add a files hash
> manifest to squashfs, and to always use it on reads, and then add some
> error handling to EIO upon any mismatch. But yeah it'd need user space
> code in mksquashfs and also kernel code to support it.
>
>
> > As for copy-on-write, dm-snapshot is quite complex to setup and require
> > underlying FS to support write. Also, doesn't allow to write more data
> > than original image size (may be an issue for persistent partition
> > case). Overlay fs on the other hand works with any underlying fs, you
> > can write as much data as you want. And in case of persistent partition,
> > you can access that data even if base image (the lower layer) is
> > unavailable/broken. I think the only downside of overlay fs is when you
> > modify large file it gets copied in full to the upper layer. But I don't
> > think that's an issue in this use case.
> >
> > For me, overlay fs is a clear winner here.
> > But as for image layout, it isn't that simple. For reproducibility,
> > squashfs alone is better. But if the goal of this change would be also
> > improving read errors detection, then it isn't that clear anymore. It
> > may be that it takes a simple mkfs.btrfs patch to make it reproducible,
> > but it isn't obvious for me at this stage. Also, keeping two layers
> > looks like unnecessary complexity.
>
> I agree. Overlayfs works fine with any of the discussed filesystems.
> I'd give a slight edge to Btrfs seed+sprout as the overlay mechanism
> in the case of persistence on a USB stick: a) checksumming b)
> compression helps improve performance of USB flash drives and reduces
> wear c) kernel discovers both seed and sprout in early boot by sprout
> uuid alone, no special mount options needed for setup. But it's a
> really minor point because a) and b) are still possible with overlayfs
> with a new independent btrfs as the upperdir.
>
>
> > What do you think about sidestepping this discussion a little and
> > replacing dm-snapshot with overlay fs regardless of other changes here?
> > That should be doable without any change to image format and will give
> > more flexibility there.
>
> Agreed. What I can't tell you off hand is if livecd-iso-to-disk would
> be affected by this in some way; or whether the change policy applies.
> But I think it's better to file 

Re: Installation image layout

2018-10-13 Thread Chris Murphy
On Fri, Oct 12, 2018 at 5:26 PM, Marek Marczykowski-Górecki
 wrote:
> On Fri, Oct 12, 2018 at 03:44:38PM -0600, Chris Murphy wrote:

>> mkfs.btrfs has --rootdir and --shrink features to pre-allocate a
>> volume with files at mkfs time; I have no idea to what degree it
>> depends on kernel code.
>
> Probably not at all, given it works as non-root user too.
> I've tried to run it twice on the same directory (and with the same
> --uuid) on 32MB of data and got different images (~2000 lines of hexdump
> diff). Could be some timestamps, could be something else.

There is volume UUID which is what --uuid affects. But there are other
uuids, including the chunk uuid which gets repeated in every leaf and
node along with the volume uuid, device uuid, each files tree
(subvolume) get its own uuid, etc. Time stamps include atime, otime,
mtime, and ctime. Some objects have all 0's for uuid, and some items
have only 0.0 for times. I'll float the reproducibility question on
the Btrfs list, if it's desirable, useful, and how difficult it is. I
think subsetting Btrfs features to reduce complexity generally, and
therefore increase reproducibility as a consequence of that, has
merit.


>> It's also
>> possible with dm-verity or dm-integrity but then that adds back the dm
>> complexity.
>
> Oh, please, no...

Haha...

>
> There are two almost separate aspects here:
>  - image layout (squashfs+ext4, squashfs alone, squashfs+btrfs)
>  - how copy-on-write is achieved (dm-snapshot, overlay fs)

ext4 alone, and btrfs alone are also viable. But since ext4 has no
compression, image size grows by maybe a factor of 2. Btrfs supports
lzo and zlib compression since forever, and zstd since kernel 4.14,
same as squashfs. What's been missing is mksquashfs with zstd support,
which I imagine will be in 5.0. The compression ratio compares well
with xz currently being used by mksquashfs in Fedora composes, but
with much less CPU to compress and decompress. So I'd say go with zstd
in any case.


>
> For reproducibility, squashfs alone is the best option, but does not
> improve integrity checking (but also doesn't make it worse).

I'm not able to estimate how much work it is to add a files hash
manifest to squashfs, and to always use it on reads, and then add some
error handling to EIO upon any mismatch. But yeah it'd need user space
code in mksquashfs and also kernel code to support it.


> As for copy-on-write, dm-snapshot is quite complex to setup and require
> underlying FS to support write. Also, doesn't allow to write more data
> than original image size (may be an issue for persistent partition
> case). Overlay fs on the other hand works with any underlying fs, you
> can write as much data as you want. And in case of persistent partition,
> you can access that data even if base image (the lower layer) is
> unavailable/broken. I think the only downside of overlay fs is when you
> modify large file it gets copied in full to the upper layer. But I don't
> think that's an issue in this use case.
>
> For me, overlay fs is a clear winner here.
> But as for image layout, it isn't that simple. For reproducibility,
> squashfs alone is better. But if the goal of this change would be also
> improving read errors detection, then it isn't that clear anymore. It
> may be that it takes a simple mkfs.btrfs patch to make it reproducible,
> but it isn't obvious for me at this stage. Also, keeping two layers
> looks like unnecessary complexity.

I agree. Overlayfs works fine with any of the discussed filesystems.
I'd give a slight edge to Btrfs seed+sprout as the overlay mechanism
in the case of persistence on a USB stick: a) checksumming b)
compression helps improve performance of USB flash drives and reduces
wear c) kernel discovers both seed and sprout in early boot by sprout
uuid alone, no special mount options needed for setup. But it's a
really minor point because a) and b) are still possible with overlayfs
with a new independent btrfs as the upperdir.


> What do you think about sidestepping this discussion a little and
> replacing dm-snapshot with overlay fs regardless of other changes here?
> That should be doable without any change to image format and will give
> more flexibility there.

Agreed. What I can't tell you off hand is if livecd-iso-to-disk would
be affected by this in some way; or whether the change policy applies.
But I think it's better to file the change so there's awareness and
coordination: installer team would have to sign off on the pull
request for lorax, and then releng team probably should know about it
because they define their own compose settings (I guess they often use
upstreams defaults but they don't have to), and then QA might want a
heads up so if things blow up they know who to ask what's up, and then
it's also a good idea to let SOAS folks know about it. And a central
point of filing changes is coordination.

https://fedoraproject.org/wiki/Changes/Policy



-- 
Chris Murphy

Re: Installation image layout

2018-10-12 Thread Chris Murphy
On Fri, Oct 12, 2018 at 3:44 PM, Chris Murphy  wrote:
> On Fri, Oct 12, 2018 at 4:30 AM, Marek Marczykowski-Górecki
>  wrote:
>> On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
>
>>> I'm pretty sure the original reason was the default live install use
>>> dd to block copy the root file system into the fedora-root LV, and
>>> then resized the LV and ext4 file system.
>>
>> How is it done now?
>
> On Live media installs, anaconda does:
>
> rsync -pogAXtlHrDx --exclude /dev/ --exclude /proc/ --exclude /sys/
> --exclude /run/ --exclude /boot/*rescue* --exclude /etc/machine-id
> /mnt/install/source/ /mnt/sysimage
>
> On DVD and netinstalls, I'm guessing based on packaging.log that it's
> a dnf+rpm installation even though I never see a dnf or rpm process in
> either top or ps. In any case, the rpm packages are directly on the
> iso9660 file system, not baked into the

One other thing that really hogs system resources for some reason, is
one of the loopback mount devices, I think loop1 which is root.img,
hogs nearly 100% CPU for the duration of the installation for LiveOS
media. I don't know why, but it might be worth benchmarking nbd based
mounts for comparison. The installation turns my computers into hair
dryers. The installation process bottleneck should be reading the
compressed root image, not CPU.



-- 
Chris Murphy
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: Installation image layout

2018-10-12 Thread Marek Marczykowski-Górecki
On Fri, Oct 12, 2018 at 03:44:38PM -0600, Chris Murphy wrote:
> On Fri, Oct 12, 2018 at 4:30 AM, Marek Marczykowski-Górecki
>  wrote:
> > On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
> >> Why does efiboot.img have a 32MiB limit?
> >
> > Because "32MB should be enough for everybody"...
> > Long story short, "El Torito" boot catalog structure have 16-bit field
> > for image size (expressed in 512-bytes sectors). For details see here:
> > https://wiki.osdev.org/El-Torito
> > https://web.archive.org/web/20180112220141/https://download.intel.com/support/motherboards/desktop/sb/specscdrom.pdf
> > (page 10)
> 
> OK. On Fedora 28 media, efiboot.img is ~9.2 MiB and does not contain
> either the kernel or initramfs.

I know, this particular problem was specific to Qubes OS, where
kernel+initramfs needed to be on ESP, because of Xen+EFI limitation
(basically kernel needs to be loaded through through UEFI instead of
by grub, so it needs to live on something that UEFI understands). And
actually recent Xen version doesn't have this limitation anymore (at
least in theory...). This is just a bit of context from where it all
got here, much less relevant today.

(...)

> > Full story:
> > https://github.com/QubesOS/qubes-issues/issues/794#issuecomment-135988806
> >
> > I've spent a lot of time debugging this, because mkisofs doesn't
> > complain about it, just silently overflow higher bits to adjacent field,
> > which results in weird results, depending on where you boot it. Adding
> > isohybrid to the picture doesn't make it easier (there, higher bits are
> > truncated, or actually not copied to the MBR partition table, as wasn't
> > part of the original field).
> 
> 
> I think we're stuck with isohybrid for a while. Having UEFI and BIOS
> bootloaders, along with isohybrid supporting both as well as Macs, all
> on one media image, that can be burned to optical media and written to
> a USB stick - is hugely beneficial.

I have no problem with isohybrid alone. It's major hack, but definitely
worth it.

> The compose process takes about 12 hours. That every ISO for all the
> editions, and the spins, and the VM images, for all archs. Even having
> separate UEFI and BIOS images, or splitting out Macs with their own
> image, it'll increase compose times and complexity across the board.

And also complexity for the users - which image to download. I totally
understand why it is beneficial.

(...)

> >> I did give all of these things some thought a long time ago when I ran
> >> into a lorax hack by Will Woods who used Btrfs as the root.img file
> >> system, I'm not sure why it was used. But it gave me the idea of using
> >> a few features built into Btrfs specifically for this use case:
> >>
> >> - seed/sprout feature can be used with zram block device for volatile
> >> overlay; and used with a blank partition on the stick for persistent
> >> overlay. Discovery is part of the btrfs kernel code.
> >>
> >> - Since metadata and data is always checksummed on every read, we
> >> wouldn't have to depend on the slow and transient ISO checksum
> >> (rd.live.check which uses checkisomd5) which likewise breaks when
> >> creating a stick with livecd-iso-to-disk.
> >>
> >> - Btrfs supports zstd compression. I did some testing and squashfs is
> >> still a bit more efficient because it compresses fs metadata, whereas
> >> Btrfs only compresses data extents.
> >>
> >> The gotcha here is the resulting image isn't going to be bit for bit
> >> reproducible: UUIDs and time stamps are strewn throughout the file
> >> system (similar to ext4 and XFS), but any sufficiently complex file
> >> system is going to have this problem.
> >
> > I wouldn't worry about _files_ timestamps that much - in most cases this is
> > solvable problem by elaborate enough find+touch[4]. But that's not all
> > obviously, there are various timestamps in superblock, and other
> > metadata. The most problematic part in "normal" filesystems, using
> > kernel driver is inode allocation, block allocation etc. This greatly
> > depends on timing, ordering, specific kernel version etc.
> > See [5] for details.
> 
> 
> mkfs.btrfs has --rootdir and --shrink features to pre-allocate a
> volume with files at mkfs time; I have no idea to what degree it
> depends on kernel code. 

Probably not at all, given it works as non-root user too.
I've tried to run it twice on the same directory (and with the same
--uuid) on 32MB of data and got different images (~2000 lines of hexdump
diff). Could be some timestamps, could be something else.

> The main benefit with this is it's really easy
> to implement full checksum matching for metadata and data on every
> read, and user space ends up with EIO instead of corrupt data, and
> super clear kernel complaints. And such corruption whether on optical
> or USB sticks, is common. Even the more rare case of a stick that
> passes md5 checksum, can later have transient and silent corruption
> that ends up showing up in weird ways.
> 
> It's 

Re: Installation image layout

2018-10-12 Thread Adam Williamson
On Fri, 2018-10-12 at 15:44 -0600, Chris Murphy wrote:
> On Fri, Oct 12, 2018 at 4:30 AM, Marek Marczykowski-Górecki
>  wrote:
> > On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
> > > I'm pretty sure the original reason was the default live install use
> > > dd to block copy the root file system into the fedora-root LV, and
> > > then resized the LV and ext4 file system.
> > 
> > How is it done now?
> 
> On Live media installs, anaconda does:
> 
> rsync -pogAXtlHrDx --exclude /dev/ --exclude /proc/ --exclude /sys/
> --exclude /run/ --exclude /boot/*rescue* --exclude /etc/machine-id
> /mnt/install/source/ /mnt/sysimage
> 
> On DVD and netinstalls, I'm guessing based on packaging.log that it's
> a dnf+rpm installation even though I never see a dnf or rpm process in
> either top or ps. In any case, the rpm packages are directly on the
> iso9660 file system, not baked into the

anaconda uses dnf's python interface, it does not *run* 'dnf'.

https://github.com/rhinstaller/anaconda/blob/master/pyanaconda/payload/dnfpayload.py
-- 
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: Installation image layout

2018-10-12 Thread Chris Murphy
On Fri, Oct 12, 2018 at 4:30 AM, Marek Marczykowski-Górecki
 wrote:
> On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:

>> I'm pretty sure the original reason was the default live install use
>> dd to block copy the root file system into the fedora-root LV, and
>> then resized the LV and ext4 file system.
>
> How is it done now?

On Live media installs, anaconda does:

rsync -pogAXtlHrDx --exclude /dev/ --exclude /proc/ --exclude /sys/
--exclude /run/ --exclude /boot/*rescue* --exclude /etc/machine-id
/mnt/install/source/ /mnt/sysimage

On DVD and netinstalls, I'm guessing based on packaging.log that it's
a dnf+rpm installation even though I never see a dnf or rpm process in
either top or ps. In any case, the rpm packages are directly on the
iso9660 file system, not baked into the



>> Why does efiboot.img have a 32MiB limit?
>
> Because "32MB should be enough for everybody"...
> Long story short, "El Torito" boot catalog structure have 16-bit field
> for image size (expressed in 512-bytes sectors). For details see here:
> https://wiki.osdev.org/El-Torito
> https://web.archive.org/web/20180112220141/https://download.intel.com/support/motherboards/desktop/sb/specscdrom.pdf
> (page 10)

OK. On Fedora 28 media, efiboot.img is ~9.2 MiB and does not contain
either the kernel or initramfs. The kernel and initramfs are found on
the iso9660 file system at images/pxeboot/ and also at isolinux/ where
GRUB UEFI uses the former, and isolinux BIOS uses the latter. Both
initrd's are 65M so they're already too big to go into bootefi.img -
and they kinda need to be because this particular initramfs is built
by dracut with --nohostonly flag so that hopefully we can boot
anything. (Curiously, the initramfs is 65M on DVD/netinstall and 50M
on LiveOS - I don't have an explanation for that. I'm looking at
Fedora 28 release images.)

From my understanding, efiboot.img only would need to contain shim,
grubia32, grubx64 and supporting bootloader only files.

BTW, trivia: Fedora's installer creates EFI System partitions that are
always FAT16. So far as I know, no computer has complained, only
humans. FAT12/16 is OK for removable media but the spec pretty clearly
expects FAT32 for ESPs on permanent installs. The installer team
doesn't want to use mkfs flags, they expect the defaults to work
unless they don't work, and they do work, so FAT16 it is.



> Full story:
> https://github.com/QubesOS/qubes-issues/issues/794#issuecomment-135988806
>
> I've spent a lot of time debugging this, because mkisofs doesn't
> complain about it, just silently overflow higher bits to adjacent field,
> which results in weird results, depending on where you boot it. Adding
> isohybrid to the picture doesn't make it easier (there, higher bits are
> truncated, or actually not copied to the MBR partition table, as wasn't
> part of the original field).


I think we're stuck with isohybrid for a while. Having UEFI and BIOS
bootloaders, along with isohybrid supporting both as well as Macs, all
on one media image, that can be burned to optical media and written to
a USB stick - is hugely beneficial.

The compose process takes about 12 hours. That every ISO for all the
editions, and the spins, and the VM images, for all archs. Even having
separate UEFI and BIOS images, or splitting out Macs with their own
image, it'll increase compose times and complexity across the board.
I'm not sure which happens first: the end to optical media booting
support; or dropping support for BIOS and/or old Apple EFI Macs (only
this year did they start using UEFI, rather than their own variant of
Intel EFI pre-UEFI, so it'll take some time to see how that shakes out
which also involves whether and how Secure Boot can ever be supported
on Macs).

This talks a bit about isohybrid and all the very clever hacks
involved to make Fedora boot practically anything with a single ISO
9660 image. (I'm being x86_64 arch specific when I say that.)

https://mjg59.dreamwidth.org/11285.html



>>
>> I did give all of these things some thought a long time ago when I ran
>> into a lorax hack by Will Woods who used Btrfs as the root.img file
>> system, I'm not sure why it was used. But it gave me the idea of using
>> a few features built into Btrfs specifically for this use case:
>>
>> - seed/sprout feature can be used with zram block device for volatile
>> overlay; and used with a blank partition on the stick for persistent
>> overlay. Discovery is part of the btrfs kernel code.
>>
>> - Since metadata and data is always checksummed on every read, we
>> wouldn't have to depend on the slow and transient ISO checksum
>> (rd.live.check which uses checkisomd5) which likewise breaks when
>> creating a stick with livecd-iso-to-disk.
>>
>> - Btrfs supports zstd compression. I did some testing and squashfs is
>> still a bit more efficient because it compresses fs metadata, whereas
>> Btrfs only compresses data extents.
>>
>> The gotcha here is the resulting image isn't going to be bit for bit

Re: Installation image layout

2018-10-12 Thread Marek Marczykowski-Górecki
On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
> On Thu, Oct 11, 2018 at 6:37 PM, Marek Marczykowski-Górecki
>  wrote:
> > Hi all!
> >
> > I'm new on this list. I work on Qubes OS, where Fedora is used as a base
> > distribution.
> >
> > While trying to build the installation image in reproducible manner[1],
> > I found the current installation image have unusual layout. Quoting
> > dracut.cmdline manual page:
> >
> >squashfs.img  |  Squashfs from LiveCD .iso downloaded via 
> > network
> >   !(mount)
> >   /LiveOS
> >   |- rootfs.img  |  Filesystem image to mount read-only
> >!(mount)
> >/bin  |  Live filesystem
> >/boot |
> >/dev  |
> >...   |
> >
> > This rootfs.img layer makes the image build very much unreproducible.
> > Why is it even there? Bare squashfs.img layer should be enough. Then,
> > mount overlayfs over it (I see there is even some partial support for it
> > in dmsquash-live). Most other Live systems I've seen use just squashfs +
> > overlayfs (or aufs if kernel is older), so it's commonly tested
> > configuration. I *guess* it's there for historical reason, from before
> > aufs/overlayfs being available. Is there any other reason for that?
> 
> I'm pretty sure the original reason was the default live install use
> dd to block copy the root file system into the fedora-root LV, and
> then resized the LV and ext4 file system.

How is it done now?

> There have also been a
> number of squashfs improvements since that decision so there might
> have been limitations with squashfs that ext4 didn't have (I'm
> thinking xattr were long supported in ext4 before squashfs, and maybe
> capabilities?)
> 
> >
> > If there is no other reason, I propose to drop this and have
> > installer/live filesystem directly in squashfs.img. This have multiple
> > benefits:
> >  - it's much easier to make the image build process reproducible (see
> >below)
> >  - less complexity, both in the build and in the boot (the whole
> >dmsquash-live dracut module can be replaced with <20 line
> >function[2]
> >  - smaller initramfs (which is extremely important if needed to be
> >included in efiboot.img, which can't be larger than 32MB)
> >  - slightly faster boot time (device-mapper is slow)
> >
> > What do you think?
> 
> Whatever we do should take into account the persistent root and
> persistent home use cases, specifically:
> https://github.com/livecd-tools/livecd-tools/blob/master/tools/livecd-iso-to-disk.sh
> 
> --overlay-size-mb
> --home-size-mb
> 
> A particular criticism of the device-mapper solution currently being
> used is in that script: it blows up. Literally it's WORM, and deleting
> files simply dereferences them, it doesn't free up pool space, so it
> is inevitable that the pool will fill up, and when it does it blows up
> the file system, and it can't be repaired. All you can do is reset the
> overlay which means deleting all changes and starting over.
> 
> At least one of our spins, SOAS, depends on livecd-iso-to-disk for
> creating their final installation because it's predicated on running
> Fedora SOAS from a stick.
> 
> Why does efiboot.img have a 32MiB limit?

Because "32MB should be enough for everybody"...
Long story short, "El Torito" boot catalog structure have 16-bit field
for image size (expressed in 512-bytes sectors). For details see here:
https://wiki.osdev.org/El-Torito
https://web.archive.org/web/20180112220141/https://download.intel.com/support/motherboards/desktop/sb/specscdrom.pdf
(page 10)

Full story:
https://github.com/QubesOS/qubes-issues/issues/794#issuecomment-135988806

I've spent a lot of time debugging this, because mkisofs doesn't
complain about it, just silently overflow higher bits to adjacent field,
which results in weird results, depending on where you boot it. Adding
isohybrid to the picture doesn't make it easier (there, higher bits are
truncated, or actually not copied to the MBR partition table, as wasn't
part of the original field).

> > As for the reproducibility, I've made changes to lorax (including
> > dropping rootfs.img layer), anaconda, pungi and createrepo and this all
> > allows to build bit-by-bit identical image, given the same input (rpm
> > packages, pungi configuration, $SOURCE_DATE_EPOCH variable[3]). Well,
> > almost - there is an issue with efiboot.img, but I already have a
> > solution, just not pushed it yet.
> >
> > You can find all the pull requests collected here:
> > https://github.com/QubesOS/qubes-installer-qubes-os/pull/26
> >
> > I'll work further to make the changes merged upstream.
> >
> > [1] https://reproducible-builds.org/
> > [2] 
> > https://github.com/QubesOS/qubes-installer-qubes-os/pull/26/commits/332be8e1e3e1006013772528078914f491d14c1f
> > [3] https://reproducible-builds.org/specs/source-date-epoch/
> 
> Cool! Well you've already done most of 

Re: Installation image layout

2018-10-11 Thread Chris Murphy
On Thu, Oct 11, 2018 at 6:37 PM, Marek Marczykowski-Górecki
 wrote:
> Hi all!
>
> I'm new on this list. I work on Qubes OS, where Fedora is used as a base
> distribution.
>
> While trying to build the installation image in reproducible manner[1],
> I found the current installation image have unusual layout. Quoting
> dracut.cmdline manual page:
>
>squashfs.img  |  Squashfs from LiveCD .iso downloaded via 
> network
>   !(mount)
>   /LiveOS
>   |- rootfs.img  |  Filesystem image to mount read-only
>!(mount)
>/bin  |  Live filesystem
>/boot |
>/dev  |
>...   |
>
> This rootfs.img layer makes the image build very much unreproducible.
> Why is it even there? Bare squashfs.img layer should be enough. Then,
> mount overlayfs over it (I see there is even some partial support for it
> in dmsquash-live). Most other Live systems I've seen use just squashfs +
> overlayfs (or aufs if kernel is older), so it's commonly tested
> configuration. I *guess* it's there for historical reason, from before
> aufs/overlayfs being available. Is there any other reason for that?

I'm pretty sure the original reason was the default live install use
dd to block copy the root file system into the fedora-root LV, and
then resized the LV and ext4 file system. There have also been a
number of squashfs improvements since that decision so there might
have been limitations with squashfs that ext4 didn't have (I'm
thinking xattr were long supported in ext4 before squashfs, and maybe
capabilities?)

>
> If there is no other reason, I propose to drop this and have
> installer/live filesystem directly in squashfs.img. This have multiple
> benefits:
>  - it's much easier to make the image build process reproducible (see
>below)
>  - less complexity, both in the build and in the boot (the whole
>dmsquash-live dracut module can be replaced with <20 line
>function[2]
>  - smaller initramfs (which is extremely important if needed to be
>included in efiboot.img, which can't be larger than 32MB)
>  - slightly faster boot time (device-mapper is slow)
>
> What do you think?

Whatever we do should take into account the persistent root and
persistent home use cases, specifically:
https://github.com/livecd-tools/livecd-tools/blob/master/tools/livecd-iso-to-disk.sh

--overlay-size-mb
--home-size-mb

A particular criticism of the device-mapper solution currently being
used is in that script: it blows up. Literally it's WORM, and deleting
files simply dereferences them, it doesn't free up pool space, so it
is inevitable that the pool will fill up, and when it does it blows up
the file system, and it can't be repaired. All you can do is reset the
overlay which means deleting all changes and starting over.

At least one of our spins, SOAS, depends on livecd-iso-to-disk for
creating their final installation because it's predicated on running
Fedora SOAS from a stick.

Why does efiboot.img have a 32MiB limit?


> As for the reproducibility, I've made changes to lorax (including
> dropping rootfs.img layer), anaconda, pungi and createrepo and this all
> allows to build bit-by-bit identical image, given the same input (rpm
> packages, pungi configuration, $SOURCE_DATE_EPOCH variable[3]). Well,
> almost - there is an issue with efiboot.img, but I already have a
> solution, just not pushed it yet.
>
> You can find all the pull requests collected here:
> https://github.com/QubesOS/qubes-installer-qubes-os/pull/26
>
> I'll work further to make the changes merged upstream.
>
> [1] https://reproducible-builds.org/
> [2] 
> https://github.com/QubesOS/qubes-installer-qubes-os/pull/26/commits/332be8e1e3e1006013772528078914f491d14c1f
> [3] https://reproducible-builds.org/specs/source-date-epoch/

Cool! Well you've already done most of the work and if this has
support elsewhere already then I'm in favor of continuing in that
direction.

I did give all of these things some thought a long time ago when I ran
into a lorax hack by Will Woods who used Btrfs as the root.img file
system, I'm not sure why it was used. But it gave me the idea of using
a few features built into Btrfs specifically for this use case:

- seed/sprout feature can be used with zram block device for volatile
overlay; and used with a blank partition on the stick for persistent
overlay. Discovery is part of the btrfs kernel code.

- Since metadata and data is always checksummed on every read, we
wouldn't have to depend on the slow and transient ISO checksum
(rd.live.check which uses checkisomd5) which likewise breaks when
creating a stick with livecd-iso-to-disk.

- Btrfs supports zstd compression. I did some testing and squashfs is
still a bit more efficient because it compresses fs metadata, whereas
Btrfs only compresses data extents.

The gotcha here is the resulting image isn't going to be bit for bit
reproducible: UUIDs and time