Re: [gentoo-dev] [GLEP78] Updating specification r2
Hi Ulrich, Sorry, I don't know why the response I sent on September 13 didn't get forward by mailing list. So I write here again. ‐‐‐ Original Message ‐‐‐ On Thursday, September 23rd, 2021 at 06:30, Ulrich Mueller wrote: > Since you haven't addressed my comments from the first round of review, > I repeat them here: > > | Given that the outer archive is uncompressed tar, every file will be > | zero-padded to a full block which adds some amount of bloat. So, could > | the signature be inlined in the Manifest file? That's also what GLEP 74 > | specifies. Using inline signature makes sense but leads to another problem: we allowed user-defined GPG commands, which gives us no control over exactly what format is generated, and how to verify it. And I do not feel hard coded "--clear-sign" and "--detach-sign" to the commands are good practices. Also this is a very limited space saver, probably only max 1kb per package. This specification only using the Manifest DATA tag format in GLEP 74: DATA ... and their definition. So the inlined signature is not applied here. > | > | Also, IIRC one of the goals of the format was to allow partial download > | of metadata. That will only work if the Manifest file will be the first > | file in the archive (or at least appear before the image archive). The metadata signature is strictly requested to be the next file after the metadata archive, so it can be used to verify metadata without need Manifest. Although the specification said that non-standard order should be supported, but this does not apply to remote fetches. The biggest problem with moving the Manifest to the head is how to write it. Since this file can only be created after all other operations have been completed. To do this, we either have to store other files in the temporary area and copy them into binary package when the Manifest is created, and double the free space requirement. (especially for those who use tmpfs to get faster IO). Or reserve space in the binary package container and overwriting it later. But since both Manifest and signature size are variable, how much space to reserve becomes an issue. Too small, the package manager needs to copy the whole package, too big will require adding a padding file. Thanks, Sheng Yu
Re: [gentoo-dev] [GLEP78] Updating specification r2
> On Thu, 23 Sep 2021, Sheng Yu wrote: > Hi, > I attached second revision of the new draft of GLEP78 "Gentoo Binary > Package Container Format" > Please feel free to give any comments and suggestions. Since you haven't addressed my comments from the first round of review, I repeat them here: | Given that the outer archive is uncompressed tar, every file will be | zero-padded to a full block which adds some amount of bloat. So, could | the signature be inlined in the Manifest file? That's also what GLEP 74 | specifies. | | Also, IIRC one of the goals of the format was to allow partial download | of metadata. That will only work if the Manifest file will be the first | file in the archive (or at least appear before the image archive). Ulrich signature.asc Description: PGP signature
[gentoo-dev] [GLEP78] Updating specification r2
Hi, I attached second revision of the new draft of GLEP78 "Gentoo Binary Package Container Format" Please feel free to give any comments and suggestions. Thanks, Sheng Yu--- GLEP: 78 Title: Gentoo binary package container format Author: MichaŠGórny Sheng Yu Type: Standards Track Status: Draft Version: 1 Created: 2018-11-15 Last-Modified: 2021-09-22 Post-History: 2018-11-17, 2019-07-08, 2021-09-22 Content-Type: text/x-rst --- Abstract This GLEP proposes a new binary package container format for Gentoo. The current tbz2/XPAK format is shortly described, and its deficiences are explained. Accordingly, the requirements for a new format are set and a gpkg format satisfying them is proposed. The rationale for the design decisions is provided. Motivation == The current Portage binary package format - The historical ``.tbz2`` binary package format used by Portage is a concatenation of two distinct formats: header-oriented compressed .tar format (used to hold package files) and trailer-oriented custom XPAK format (used to hold metadata) [#MAN-XPAK]_. The format has already been extended incompatibly twice. The first time, support for storing multiple successive builds of binary package for a single ebuild version has been added. This feature relies on appending additional hyphen, followed by an integer to the package filename. It is disabled by default (preserving backwards compatibility) and controlled by ``binpkg-multi-instance`` feature. The second time, support for additional compression formats has been added. When format other than bzip2 is used, the ``.tbz2`` suffix is replaced by ``.xpak`` and Portage relies on magic bytes to detect compression used. For backwards compatibility, Portage still defaults to using bzip2; compression program can be switched using ``BINPKG_COMPRESS`` configuration variable. Additionally, there have been minor changes to the stored metadata and file storage policies. In particular, behavior regarding ``INSTALL_MASK``, controllable file compression and stripping has changed over time. The advantages of tbz2/XPAK format -- The tbz2/XPAK format used by Portage has three interesting features: 1. **Each binary package is fully contained within a single file.** While this might seem unnecessary, it makes it easier for the user to transfer binary packages without having to be concerned about finding all the necessary files to transfer. 2. **The binary packages are compatible with regular compressed tarballs, most of the time.** With notable exceptions of historical versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages can be extracted using regular tar utility with a compressor implementation that discards trailing garbage. 3. **The metadata is uncompressed, and can be efficiently accessed without decompressing package contents.** This includes the possibility of rewriting it (e.g. as a result of package moves) without the necessity of repacking the files. Transparency problem with the current binary package format --- Notwithstanding its advantages, the tbz2/XPAK format has a significant design fault that consists of two issues: 1. **The XPAK format is a custom binary format with explicit use of binary-encoded file offsets and field lengths.** As such, it is non-trivial to read or edit without specialized tools. Such tools are currently implemented separately from the package manager, as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_. 2. **The tarball compatibility feature relies on obscure feature of ignoring trailing garbage in compressed files**. While this is implemented consistently in most of the compressors, this feature is not really a part of specification but rather traditional behavior. Given that the original reasons for this no longer apply, new compressor implementations are likely to miss support for this. Both of the issues make the format hard to use without dedicated tools, or when the tools misbehave. This impacts the following scenarios: A. **Using binary packages for system recovery.** In case of serious breakage, it is really preferable that the format depends on as few tools a possible, and especially not on Gentoo-specific tools. B. **Inspecting binary packages in detail exceeding standard package manager facilities.** C. **Modifying binary packages in ways not predicted by the package manager authors.** A real-life example of this is working around broken ``pkg_*`` phases which prevent the package from being installed. OpenPGP extensibility problem - There are at least three obvious ways in which the current format could be extended to support OpenPGP signatures, and each of them has its own distinct problem: 1. **Adding a
Re: [gentoo-dev] [GLEP78] Updating specification
On Monday, September 13th, 2021 at 18:04, Rich Freeman wrote: > > On Mon, Sep 13, 2021 at 5:02 PM Michał Górny wrote: > > > > On Mon, 2021-09-13 at 12:08 +0200, Ulrich Mueller wrote: > > > > > > Also, IIRC one of the goals of the format was to allow partial > > > download > > > of metadata. That will only work if the Manifest file will be the > > > first > > > file in the archive (or at least appear before the image archive). > > > > I disagree. This is solved by having detached metadata signature -- you > > can do a partial fetch and verify the metadata directly. > > > > Another option I've tossed out there in the past is having a content > hash of the metadata and putting that in the filename. That obviously > won't tell you anything about the contents of the file without reading > it, but if you're looking for a file with specific metadata you could > predict its filename. This was intended to work with having multiple > hashes for the same file using subsets of the metadata, using symbolic > links. > > The thinking here is that you'd just hash a subset of metadata useful > for identifying what file you'd want to download, such as CHOST, > linked dependency versions, use flags, etc. You'd probably hash it > with/without stuff like use flags so that you could either take a shot > at getting the file exactly configured how you want, or accepting a > version with any set of flags. > > Of course, this idea goes in direct opposition to your statement about > not wanting to specify the filename. I get that argument. The intent > here was to allow portage to go hunting through trusted repositories > to find packages it can use without having to sync a lot of data - if > you know the exact filename then a simple GET tells you if it is there > or not. Interesting concept, although this should be counted in the binpkg-multi-instance. A predictable configuration hash, rather than relying on index to get the difference between variants. Something like: bar/foo-1.0-r2-e3b0c44298fc1c149afbf4c8996fb9.gpkg.tar Thanks, Sheng Yu
Re: [gentoo-dev] [GLEP78] Updating specification
‐‐‐ Original Message ‐‐‐ On Monday, September 13th, 2021 at 17:02, Michał Górny wrote: > On Mon, 2021-09-13 at 12:08 +0200, Ulrich Mueller wrote: > > > > > > > On Mon, 13 Sep 2021, Sheng Yu wrote: > > > > > -The archive contains a number of files, stored in a single > > > directory > > > -whose name should match the basename of the package file. However, > > > -the implementation must be able to process an archive where > > > -the directory name is mismatched. There should be no explicit > > > archive > > > -member entry for the directory. > > > +The archive contains a number of files. All package-related files > > > +should be stored in a single directory whose name matches the CPV > > > of > > > +the package file. However, the implementation must be able to > > > process > > > +an archive where the directory name is mismatched. There should be > > > no > > > +explicit archive member entry for the directory. > > > > I wonder about CPV here. That's ${CATEGORY}/${P} and contains a slash, > > so it cannot be the name of a directory. Also, what about the package > > revision? > > Please restore the previous wording. The GLEP deliberately did not > enforce a specific filename because it's about internal format. Got it, but maybe we need to add a requirement for human readability. Since users should not have to check the data within the metadata. > > > > > +6. The package manifest data file ``Manifest`` (required). > > > + > > > +7. A signature for the package Manifest file ``Manifest.sig`` > > > + (optional). > > > > Given that the outer archive is uncompressed tar, every file will be > > zero-padded to a full block which adds some amount of bloat. So, could > > the signature be inlined in the Manifest file? That's also what GLEP > > 74 > > specifies. > > Using inline signature in Manifest makes sense. This makes sense but leads to another problem: we allowed user-defined GPG commands, which gives us no control over exactly what format is generated. And I do not feel hard-code "--clear-sign" and "--detach-sign" is good practice. > > > > Also, IIRC one of the goals of the format was to allow partial > > download > > of metadata. That will only work if the Manifest file will be the > > first > > file in the archive (or at least appear before the image archive). > > I disagree. This is solved by having detached metadata signature -- you > can do a partial fetch and verify the metadata directly. > > On the other hand, putting Manifest first would make it impossible to > create the archive from data stream without using temporary files, > effectively doubling the needed free space. Well, technically you could > just reserve space and write Manifest later but that would strongly > depend on the size of PGP signature and that's not something I'd feel > comfortable relying on. > Reserve space also wasted extra space and need a padding file. Thanks, Sheng Yu
Re: [gentoo-dev] [GLEP78] Updating specification
On Mon, Sep 13, 2021 at 5:02 PM Michał Górny wrote: > > On Mon, 2021-09-13 at 12:08 +0200, Ulrich Mueller wrote: > > > > Also, IIRC one of the goals of the format was to allow partial > > download > > of metadata. That will only work if the Manifest file will be the > > first > > file in the archive (or at least appear before the image archive). > > I disagree. This is solved by having detached metadata signature -- you > can do a partial fetch and verify the metadata directly. > Another option I've tossed out there in the past is having a content hash of the metadata and putting that in the filename. That obviously won't tell you anything about the contents of the file without reading it, but if you're looking for a file with specific metadata you could predict its filename. This was intended to work with having multiple hashes for the same file using subsets of the metadata, using symbolic links. The thinking here is that you'd just hash a subset of metadata useful for identifying what file you'd want to download, such as CHOST, linked dependency versions, use flags, etc. You'd probably hash it with/without stuff like use flags so that you could either take a shot at getting the file exactly configured how you want, or accepting a version with any set of flags. Of course, this idea goes in direct opposition to your statement about not wanting to specify the filename. I get that argument. The intent here was to allow portage to go hunting through trusted repositories to find packages it can use without having to sync a lot of data - if you know the exact filename then a simple GET tells you if it is there or not. -- Rich
Re: [gentoo-dev] [GLEP78] Updating specification
On Mon, 2021-09-13 at 12:08 +0200, Ulrich Mueller wrote: > > > > > > On Mon, 13 Sep 2021, Sheng Yu wrote: > > > -The archive contains a number of files, stored in a single > > directory > > -whose name should match the basename of the package file. However, > > -the implementation must be able to process an archive where > > -the directory name is mismatched. There should be no explicit > > archive > > -member entry for the directory. > > +The archive contains a number of files. All package-related files > > +should be stored in a single directory whose name matches the CPV > > of > > +the package file. However, the implementation must be able to > > process > > +an archive where the directory name is mismatched. There should be > > no > > +explicit archive member entry for the directory. > > I wonder about CPV here. That's ${CATEGORY}/${P} and contains a slash, > so it cannot be the name of a directory. Also, what about the package > revision? Please restore the previous wording. The GLEP deliberately did not enforce a specific filename because it's about internal format. > > > +6. The package manifest data file ``Manifest`` (required). > > + > > +7. A signature for the package Manifest file ``Manifest.sig`` > > + (optional). > > Given that the outer archive is uncompressed tar, every file will be > zero-padded to a full block which adds some amount of bloat. So, could > the signature be inlined in the Manifest file? That's also what GLEP > 74 > specifies. Using inline signature in Manifest makes sense. > > Also, IIRC one of the goals of the format was to allow partial > download > of metadata. That will only work if the Manifest file will be the > first > file in the archive (or at least appear before the image archive). I disagree. This is solved by having detached metadata signature -- you can do a partial fetch and verify the metadata directly. On the other hand, putting Manifest first would make it impossible to create the archive from data stream without using temporary files, effectively doubling the needed free space. Well, technically you could just reserve space and write Manifest later but that would strongly depend on the size of PGP signature and that's not something I'd feel comfortable relying on. -- Best regards, Michał Górny
[gentoo-dev] [GLEP78] Updating specification
Hi, We are updating the draft of GLEP78 "Gentoo Binary Package Container Format" to address some security issues, and included some new designs for this purpose. The new draft and the difference version from the old one are attached. Please feel free to give any comments and suggestions. Thanks, Sheng Yu--- GLEP: 78 Title: Gentoo binary package container format Author: MichaŠGórny Sheng Yu Type: Standards Track Status: Draft Version: 1 Created: 2018-11-15 Last-Modified: 2021-09-13 Post-History: 2018-11-17, 2019-07-08, 2021-09-13 Content-Type: text/x-rst --- Abstract This GLEP proposes a new binary package container format for Gentoo. The current tbz2/XPAK format is shortly described, and its deficiences are explained. Accordingly, the requirements for a new format are set and a gpkg format satisfying them is proposed. The rationale for the design decisions is provided. Motivation == The current Portage binary package format - The historical ``.tbz2`` binary package format used by Portage is a concatenation of two distinct formats: header-oriented compressed .tar format (used to hold package files) and trailer-oriented custom XPAK format (used to hold metadata) [#MAN-XPAK]_. The format has already been extended incompatibly twice. The first time, support for storing multiple successive builds of binary package for a single ebuild version has been added. This feature relies on appending additional hyphen, followed by an integer to the package filename. It is disabled by default (preserving backwards compatibility) and controlled by ``binpkg-multi-instance`` feature. The second time, support for additional compression formats has been added. When format other than bzip2 is used, the ``.tbz2`` suffix is replaced by ``.xpak`` and Portage relies on magic bytes to detect compression used. For backwards compatibility, Portage still defaults to using bzip2; compression program can be switched using ``BINPKG_COMPRESS`` configuration variable. Additionally, there have been minor changes to the stored metadata and file storage policies. In particular, behavior regarding ``INSTALL_MASK``, controllable file compression and stripping has changed over time. The advantages of tbz2/XPAK format -- The tbz2/XPAK format used by Portage has three interesting features: 1. **Each binary package is fully contained within a single file.** While this might seem unnecessary, it makes it easier for the user to transfer binary packages without having to be concerned about finding all the necessary files to transfer. 2. **The binary packages are compatible with regular compressed tarballs, most of the time.** With notable exceptions of historical versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages can be extracted using regular tar utility with a compressor implementation that discards trailing garbage. 3. **The metadata is uncompressed, and can be efficiently accessed without decompressing package contents.** This includes the possibility of rewriting it (e.g. as a result of package moves) without the necessity of repacking the files. Transparency problem with the current binary package format --- Notwithstanding its advantages, the tbz2/XPAK format has a significant design fault that consists of two issues: 1. **The XPAK format is a custom binary format with explicit use of binary-encoded file offsets and field lengths.** As such, it is non-trivial to read or edit without specialized tools. Such tools are currently implemented separately from the package manager, as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_. 2. **The tarball compatibility feature relies on obscure feature of ignoring trailing garbage in compressed files**. While this is implemented consistently in most of the compressors, this feature is not really a part of specification but rather traditional behavior. Given that the original reasons for this no longer apply, new compressor implementations are likely to miss support for this. Both of the issues make the format hard to use without dedicated tools, or when the tools misbehave. This impacts the following scenarios: A. **Using binary packages for system recovery.** In case of serious breakage, it is really preferable that the format depends on as few tools a possible, and especially not on Gentoo-specific tools. B. **Inspecting binary packages in detail exceeding standard package manager facilities.** C. **Modifying binary packages in ways not predicted by the package manager authors.** A real-life example of this is working around broken ``pkg_*`` phases which prevent the package from being installed. OpenPGP extensibility problem - There are at least three obvious ways in
Re: [gentoo-dev] [GLEP78] Updating specification
> On Mon, 13 Sep 2021, Sheng Yu wrote: > -The archive contains a number of files, stored in a single directory > -whose name should match the basename of the package file. However, > -the implementation must be able to process an archive where > -the directory name is mismatched. There should be no explicit archive > -member entry for the directory. > +The archive contains a number of files. All package-related files > +should be stored in a single directory whose name matches the CPV of > +the package file. However, the implementation must be able to process > +an archive where the directory name is mismatched. There should be no > +explicit archive member entry for the directory. I wonder about CPV here. That's ${CATEGORY}/${P} and contains a slash, so it cannot be the name of a directory. Also, what about the package revision? > +6. The package manifest data file ``Manifest`` (required). > + > +7. A signature for the package Manifest file ``Manifest.sig`` > + (optional). Given that the outer archive is uncompressed tar, every file will be zero-padded to a full block which adds some amount of bloat. So, could the signature be inlined in the Manifest file? That's also what GLEP 74 specifies. Also, IIRC one of the goals of the format was to allow partial download of metadata. That will only work if the Manifest file will be the first file in the archive (or at least appear before the image archive). > +The implementation follows the Manifest specifications in GLEP 74 > +[#GLEP74]_ and uses the DATA tag for files within the archive. AFAICS, GLEP 74 specifies an OpenPGP cleartext signature in the file itself, not a detached signature. Ulrich signature.asc Description: PGP signature