Re: [gentoo-dev] [GLEP78] Updating specification r2

2021-09-23 Thread Sheng Yu
Hi Ulrich,

Sorry, I don't know why the response I sent on September 13 didn't get
forward by mailing list. So I write here again.

‐‐‐ Original Message ‐‐‐

On Thursday, September 23rd, 2021 at 06:30, Ulrich Mueller  
wrote:

> Since you haven't addressed my comments from the first round of review,
> I repeat them here:
>
> | Given that the outer archive is uncompressed tar, every file will be
> | zero-padded to a full block which adds some amount of bloat. So, could
> | the signature be inlined in the Manifest file? That's also what GLEP 74
> | specifies.

Using inline signature makes sense but leads to another problem: we allowed
user-defined GPG commands, which gives us no control over exactly what
format is generated, and how to verify it. And I do not feel hard coded
"--clear-sign" and "--detach-sign" to the commands are good practices.

Also this is a very limited space saver, probably only max 1kb per package.

This specification only using the Manifest DATA tag format in GLEP 74:
DATA   ...
and their definition. So the inlined signature is not applied here.


> |
> | Also, IIRC one of the goals of the format was to allow partial download
> | of metadata. That will only work if the Manifest file will be the first
> | file in the archive (or at least appear before the image archive).

The metadata signature is strictly requested to be the next file after the
metadata archive, so it can be used to verify metadata without need Manifest.
Although the specification said that non-standard order should be supported,
but this does not apply to remote fetches.

The biggest problem with moving the Manifest to the head is how to write it.
Since this file can only be created after all other operations have been
completed.

To do this, we either have to store other files in the temporary area and
copy them into binary package when the Manifest is created, and double the
free space requirement. (especially for those who use tmpfs to get faster IO).
Or reserve space in the binary package container and overwriting it later.
But since both Manifest and signature size are variable, how much space to
reserve becomes an issue. Too small, the package manager needs to copy the
whole package, too big will require adding a padding file.


Thanks,
Sheng Yu



Re: [gentoo-dev] [GLEP78] Updating specification r2

2021-09-23 Thread Ulrich Mueller
> On Thu, 23 Sep 2021, Sheng Yu wrote:

> Hi,
> I attached second revision of the new draft of GLEP78 "Gentoo Binary
> Package Container Format"

> Please feel free to give any comments and suggestions.

Since you haven't addressed my comments from the first round of review,
I repeat them here:

| Given that the outer archive is uncompressed tar, every file will be
| zero-padded to a full block which adds some amount of bloat. So, could
| the signature be inlined in the Manifest file? That's also what GLEP 74
| specifies.
|
| Also, IIRC one of the goals of the format was to allow partial download
| of metadata. That will only work if the Manifest file will be the first
| file in the archive (or at least appear before the image archive).

Ulrich


signature.asc
Description: PGP signature


[gentoo-dev] [GLEP78] Updating specification r2

2021-09-22 Thread Sheng Yu
Hi,

I attached second revision of the new draft of GLEP78 "Gentoo Binary
Package Container Format"

Please feel free to give any comments and suggestions.

Thanks,
Sheng Yu---
GLEP: 78
Title: Gentoo binary package container format
Author: Michał Górny 
Sheng Yu 
Type: Standards Track
Status: Draft
Version: 1
Created: 2018-11-15
Last-Modified: 2021-09-22
Post-History: 2018-11-17, 2019-07-08, 2021-09-22
Content-Type: text/x-rst
---

Abstract


This GLEP proposes a new binary package container format for Gentoo.
The current tbz2/XPAK format is shortly described, and its deficiences
are explained.  Accordingly, the requirements for a new format are set
and a gpkg format satisfying them is proposed.  The rationale for
the design decisions is provided.


Motivation
==

The current Portage binary package format
-

The historical ``.tbz2`` binary package format used by Portage is
a concatenation of two distinct formats: header-oriented compressed .tar
format (used to hold package files) and trailer-oriented custom XPAK
format (used to hold metadata)  [#MAN-XPAK]_.  The format has already
been extended incompatibly twice.

The first time, support for storing multiple successive builds of binary
package for a single ebuild version has been added.  This feature relies
on appending additional hyphen, followed by an integer to the package
filename.  It is disabled by default (preserving backwards
compatibility) and controlled by ``binpkg-multi-instance`` feature.

The second time, support for additional compression formats has been
added.  When format other than bzip2 is used, the ``.tbz2`` suffix
is replaced by ``.xpak`` and Portage relies on magic bytes to detect
compression used.  For backwards compatibility, Portage still defaults
to using bzip2; compression program can be switched using
``BINPKG_COMPRESS`` configuration variable.

Additionally, there have been minor changes to the stored metadata
and file storage policies.  In particular, behavior regarding
``INSTALL_MASK``, controllable file compression and stripping has
changed over time.


The advantages of tbz2/XPAK format
--

The tbz2/XPAK format used by Portage has three interesting features:

1. **Each binary package is fully contained within a single file.**
   While this might seem unnecessary, it makes it easier for the user
   to transfer binary packages without having to be concerned about
   finding all the necessary files to transfer.

2. **The binary packages are compatible with regular compressed
   tarballs, most of the time.**  With notable exceptions of historical
   versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages
   can be extracted using regular tar utility with a compressor
   implementation that discards trailing garbage.

3. **The metadata is uncompressed, and can be efficiently accessed
   without decompressing package contents.**  This includes
   the possibility of rewriting it (e.g. as a result of package moves)
   without the necessity of repacking the files.


Transparency problem with the current binary package format
---

Notwithstanding its advantages, the tbz2/XPAK format has a significant
design fault that consists of two issues:

1. **The XPAK format is a custom binary format with explicit use
   of binary-encoded file offsets and field lengths.**  As such, it is
   non-trivial to read or edit without specialized tools.  Such tools
   are currently implemented separately from the package manager,
   as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_.

2. **The tarball compatibility feature relies on obscure feature of
   ignoring trailing garbage in compressed files**.  While this is
   implemented consistently in most of the compressors, this feature
   is not really a part of specification but rather traditional
   behavior.  Given that the original reasons for this no longer apply,
   new compressor implementations are likely to miss support for this.

Both of the issues make the format hard to use without dedicated tools,
or when the tools misbehave.  This impacts the following scenarios:

A. **Using binary packages for system recovery.**  In case of serious
   breakage, it is really preferable that the format depends on as few
   tools a possible, and especially not on Gentoo-specific tools.

B. **Inspecting binary packages in detail exceeding standard package
   manager facilities.**

C. **Modifying binary packages in ways not predicted by the package
   manager authors.**  A real-life example of this is working around
   broken ``pkg_*`` phases which prevent the package from being
   installed.


OpenPGP extensibility problem
-

There are at least three obvious ways in which the current format could
be extended to support OpenPGP signatures, and each of them has its own
distinct problem:

1. **Adding a 

Re: [gentoo-dev] [GLEP78] Updating specification

2021-09-13 Thread Sheng Yu
On Monday, September 13th, 2021 at 18:04, Rich Freeman  wrote:
>
> On Mon, Sep 13, 2021 at 5:02 PM Michał Górny  wrote:
> >
> > On Mon, 2021-09-13 at 12:08 +0200, Ulrich Mueller wrote:
> > >
> > > Also, IIRC one of the goals of the format was to allow partial
> > > download
> > > of metadata. That will only work if the Manifest file will be the
> > > first
> > > file in the archive (or at least appear before the image archive).
> >
> > I disagree.  This is solved by having detached metadata signature -- you
> > can do a partial fetch and verify the metadata directly.
> >
>
> Another option I've tossed out there in the past is having a content
> hash of the metadata and putting that in the filename.  That obviously
> won't tell you anything about the contents of the file without reading
> it, but if you're looking for a file with specific metadata you could
> predict its filename.  This was intended to work with having multiple
> hashes for the same file using subsets of the metadata, using symbolic
> links.
>
> The thinking here is that you'd just hash a subset of metadata useful
> for identifying what file you'd want to download, such as CHOST,
> linked dependency versions, use flags, etc.  You'd probably hash it
> with/without stuff like use flags so that you could either take a shot
> at getting the file exactly configured how you want, or accepting a
> version with any set of flags.
>
> Of course, this idea goes in direct opposition to your statement about
> not wanting to specify the filename.  I get that argument.  The intent
> here was to allow portage to go hunting through trusted repositories
> to find packages it can use without having to sync a lot of data - if
> you know the exact filename then a simple GET tells you if it is there
> or not.

Interesting concept, although this should be counted in the
binpkg-multi-instance. A predictable configuration hash, rather than
relying on index to get the difference between variants.

Something like:
bar/foo-1.0-r2-e3b0c44298fc1c149afbf4c8996fb9.gpkg.tar

Thanks,
Sheng Yu




Re: [gentoo-dev] [GLEP78] Updating specification

2021-09-13 Thread Sheng Yu
‐‐‐ Original Message ‐‐‐

On Monday, September 13th, 2021 at 17:02, Michał Górny  
wrote:
> On Mon, 2021-09-13 at 12:08 +0200, Ulrich Mueller wrote:
> > > > > > > On Mon, 13 Sep 2021, Sheng Yu wrote:
> >
> > > -The archive contains a number of files, stored in a single
> > > directory
> > > -whose name should match the basename of the package file.  However,
> > > -the implementation must be able to process an archive where
> > > -the directory name is mismatched.  There should be no explicit
> > > archive
> > > -member entry for the directory.
> > > +The archive contains a number of files.  All package-related files
> > > +should be stored in a single directory whose name matches the CPV
> > > of
> > > +the package file.  However, the implementation must be able to
> > > process
> > > +an archive where the directory name is mismatched.  There should be
> > > no
> > > +explicit archive member entry for the directory.
> >
> > I wonder about CPV here. That's ${CATEGORY}/${P} and contains a slash,
> > so it cannot be the name of a directory. Also, what about the package
> > revision?
>
> Please restore the previous wording.  The GLEP deliberately did not
> enforce a specific filename because it's about internal format.

Got it, but maybe we need to add a requirement for human readability.
Since users should not have to check the data within the metadata.

> >
> > > +6. The package manifest data file ``Manifest`` (required).
> > > +
> > > +7. A signature for the package Manifest file ``Manifest.sig``
> > > +   (optional).
> >
> > Given that the outer archive is uncompressed tar, every file will be
> > zero-padded to a full block which adds some amount of bloat. So, could
> > the signature be inlined in the Manifest file? That's also what GLEP
> > 74
> > specifies.
>
> Using inline signature in Manifest makes sense.

This makes sense but leads to another problem: we allowed user-defined
GPG commands, which gives us no control over exactly what format is
generated. And I do not feel hard-code "--clear-sign" and "--detach-sign"
is good practice.

> >
> > Also, IIRC one of the goals of the format was to allow partial
> > download
> > of metadata. That will only work if the Manifest file will be the
> > first
> > file in the archive (or at least appear before the image archive).
>
> I disagree.  This is solved by having detached metadata signature -- you
> can do a partial fetch and verify the metadata directly.
>
> On the other hand, putting Manifest first would make it impossible to
> create the archive from data stream without using temporary files,
> effectively doubling the needed free space.  Well, technically you could
> just reserve space and write Manifest later but that would strongly
> depend on the size of PGP signature and that's not something I'd feel
> comfortable relying on.
>

Reserve space also wasted extra space and need a padding file.

Thanks,
Sheng Yu



Re: [gentoo-dev] [GLEP78] Updating specification

2021-09-13 Thread Rich Freeman
On Mon, Sep 13, 2021 at 5:02 PM Michał Górny  wrote:
>
> On Mon, 2021-09-13 at 12:08 +0200, Ulrich Mueller wrote:
> >
> > Also, IIRC one of the goals of the format was to allow partial
> > download
> > of metadata. That will only work if the Manifest file will be the
> > first
> > file in the archive (or at least appear before the image archive).
>
> I disagree.  This is solved by having detached metadata signature -- you
> can do a partial fetch and verify the metadata directly.
>

Another option I've tossed out there in the past is having a content
hash of the metadata and putting that in the filename.  That obviously
won't tell you anything about the contents of the file without reading
it, but if you're looking for a file with specific metadata you could
predict its filename.  This was intended to work with having multiple
hashes for the same file using subsets of the metadata, using symbolic
links.

The thinking here is that you'd just hash a subset of metadata useful
for identifying what file you'd want to download, such as CHOST,
linked dependency versions, use flags, etc.  You'd probably hash it
with/without stuff like use flags so that you could either take a shot
at getting the file exactly configured how you want, or accepting a
version with any set of flags.

Of course, this idea goes in direct opposition to your statement about
not wanting to specify the filename.  I get that argument.  The intent
here was to allow portage to go hunting through trusted repositories
to find packages it can use without having to sync a lot of data - if
you know the exact filename then a simple GET tells you if it is there
or not.

-- 
Rich



Re: [gentoo-dev] [GLEP78] Updating specification

2021-09-13 Thread Michał Górny
On Mon, 2021-09-13 at 12:08 +0200, Ulrich Mueller wrote:
> > > > > > On Mon, 13 Sep 2021, Sheng Yu wrote:
> 
> > -The archive contains a number of files, stored in a single
> > directory
> > -whose name should match the basename of the package file.  However,
> > -the implementation must be able to process an archive where
> > -the directory name is mismatched.  There should be no explicit
> > archive
> > -member entry for the directory.
> > +The archive contains a number of files.  All package-related files
> > +should be stored in a single directory whose name matches the CPV
> > of
> > +the package file.  However, the implementation must be able to
> > process
> > +an archive where the directory name is mismatched.  There should be
> > no
> > +explicit archive member entry for the directory.
> 
> I wonder about CPV here. That's ${CATEGORY}/${P} and contains a slash,
> so it cannot be the name of a directory. Also, what about the package
> revision?

Please restore the previous wording.  The GLEP deliberately did not
enforce a specific filename because it's about internal format.

> 
> > +6. The package manifest data file ``Manifest`` (required).
> > +
> > +7. A signature for the package Manifest file ``Manifest.sig``
> > +   (optional).
> 
> Given that the outer archive is uncompressed tar, every file will be
> zero-padded to a full block which adds some amount of bloat. So, could
> the signature be inlined in the Manifest file? That's also what GLEP
> 74
> specifies.

Using inline signature in Manifest makes sense.

> 
> Also, IIRC one of the goals of the format was to allow partial
> download
> of metadata. That will only work if the Manifest file will be the
> first
> file in the archive (or at least appear before the image archive).

I disagree.  This is solved by having detached metadata signature -- you
can do a partial fetch and verify the metadata directly.

On the other hand, putting Manifest first would make it impossible to
create the archive from data stream without using temporary files,
effectively doubling the needed free space.  Well, technically you could
just reserve space and write Manifest later but that would strongly
depend on the size of PGP signature and that's not something I'd feel
comfortable relying on.

-- 
Best regards,
Michał Górny





[gentoo-dev] [GLEP78] Updating specification

2021-09-13 Thread Sheng Yu
Hi,

We are updating the draft of GLEP78 "Gentoo Binary Package Container
Format" to address some security issues, and included some new designs
for this purpose.

The new draft and the difference version from the old one are attached.

Please feel free to give any comments and suggestions.

Thanks,
Sheng Yu---
GLEP: 78
Title: Gentoo binary package container format
Author: Michał Górny 
Sheng Yu 
Type: Standards Track
Status: Draft
Version: 1
Created: 2018-11-15
Last-Modified: 2021-09-13
Post-History: 2018-11-17, 2019-07-08, 2021-09-13
Content-Type: text/x-rst
---

Abstract


This GLEP proposes a new binary package container format for Gentoo.
The current tbz2/XPAK format is shortly described, and its deficiences
are explained.  Accordingly, the requirements for a new format are set
and a gpkg format satisfying them is proposed.  The rationale for
the design decisions is provided.


Motivation
==

The current Portage binary package format
-

The historical ``.tbz2`` binary package format used by Portage is
a concatenation of two distinct formats: header-oriented compressed .tar
format (used to hold package files) and trailer-oriented custom XPAK
format (used to hold metadata)  [#MAN-XPAK]_.  The format has already
been extended incompatibly twice.

The first time, support for storing multiple successive builds of binary
package for a single ebuild version has been added.  This feature relies
on appending additional hyphen, followed by an integer to the package
filename.  It is disabled by default (preserving backwards
compatibility) and controlled by ``binpkg-multi-instance`` feature.

The second time, support for additional compression formats has been
added.  When format other than bzip2 is used, the ``.tbz2`` suffix
is replaced by ``.xpak`` and Portage relies on magic bytes to detect
compression used.  For backwards compatibility, Portage still defaults
to using bzip2; compression program can be switched using
``BINPKG_COMPRESS`` configuration variable.

Additionally, there have been minor changes to the stored metadata
and file storage policies.  In particular, behavior regarding
``INSTALL_MASK``, controllable file compression and stripping has
changed over time.


The advantages of tbz2/XPAK format
--

The tbz2/XPAK format used by Portage has three interesting features:

1. **Each binary package is fully contained within a single file.**
   While this might seem unnecessary, it makes it easier for the user
   to transfer binary packages without having to be concerned about
   finding all the necessary files to transfer.

2. **The binary packages are compatible with regular compressed
   tarballs, most of the time.**  With notable exceptions of historical
   versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages
   can be extracted using regular tar utility with a compressor
   implementation that discards trailing garbage.

3. **The metadata is uncompressed, and can be efficiently accessed
   without decompressing package contents.**  This includes
   the possibility of rewriting it (e.g. as a result of package moves)
   without the necessity of repacking the files.


Transparency problem with the current binary package format
---

Notwithstanding its advantages, the tbz2/XPAK format has a significant
design fault that consists of two issues:

1. **The XPAK format is a custom binary format with explicit use
   of binary-encoded file offsets and field lengths.**  As such, it is
   non-trivial to read or edit without specialized tools.  Such tools
   are currently implemented separately from the package manager,
   as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_.

2. **The tarball compatibility feature relies on obscure feature of
   ignoring trailing garbage in compressed files**.  While this is
   implemented consistently in most of the compressors, this feature
   is not really a part of specification but rather traditional
   behavior.  Given that the original reasons for this no longer apply,
   new compressor implementations are likely to miss support for this.

Both of the issues make the format hard to use without dedicated tools,
or when the tools misbehave.  This impacts the following scenarios:

A. **Using binary packages for system recovery.**  In case of serious
   breakage, it is really preferable that the format depends on as few
   tools a possible, and especially not on Gentoo-specific tools.

B. **Inspecting binary packages in detail exceeding standard package
   manager facilities.**

C. **Modifying binary packages in ways not predicted by the package
   manager authors.**  A real-life example of this is working around
   broken ``pkg_*`` phases which prevent the package from being
   installed.


OpenPGP extensibility problem
-

There are at least three obvious ways in 

Re: [gentoo-dev] [GLEP78] Updating specification

2021-09-13 Thread Ulrich Mueller
> On Mon, 13 Sep 2021, Sheng Yu wrote:

> -The archive contains a number of files, stored in a single directory
> -whose name should match the basename of the package file.  However,
> -the implementation must be able to process an archive where
> -the directory name is mismatched.  There should be no explicit archive
> -member entry for the directory.
> +The archive contains a number of files.  All package-related files
> +should be stored in a single directory whose name matches the CPV of
> +the package file.  However, the implementation must be able to process
> +an archive where the directory name is mismatched.  There should be no
> +explicit archive member entry for the directory.

I wonder about CPV here. That's ${CATEGORY}/${P} and contains a slash,
so it cannot be the name of a directory. Also, what about the package
revision?

> +6. The package manifest data file ``Manifest`` (required).
> +
> +7. A signature for the package Manifest file ``Manifest.sig``
> +   (optional).

Given that the outer archive is uncompressed tar, every file will be
zero-padded to a full block which adds some amount of bloat. So, could
the signature be inlined in the Manifest file? That's also what GLEP 74
specifies.

Also, IIRC one of the goals of the format was to allow partial download
of metadata. That will only work if the Manifest file will be the first
file in the archive (or at least appear before the image archive).

> +The implementation follows the Manifest specifications in GLEP 74
> +[#GLEP74]_ and uses the DATA tag for files within the archive.

AFAICS, GLEP 74 specifies an OpenPGP cleartext signature in the file
itself, not a detached signature.

Ulrich


signature.asc
Description: PGP signature