Re: [gentoo-dev] [pre-GLEP r2] Gentoo binary package container format

2018-11-21 Thread Michał Górny
On Wed, 2018-11-21 at 14:10 +0100, Fabian Groffen wrote:
> On 20-11-2018 21:33:17 +0100, Michał Górny wrote:
> > The volume label
> > 
> > 
> > The volume label provides an easy way for users to identify the binary
> > package without dedicated tooling or specific format knowledge.
> > 
> > The implementations should include a volume label consisting of fixed
> > string ``gpkg:``, followed by a single space, followed by full package
> > identifier.  However, the implementations must not rely on the volume
> > label being present or attempt to parse its value when it is.
> > 
> > Furthermore, since the volume label is included in the .tar archive
> > as the first member, it provides a magic string at a fixed location
> > that can be used by tools such as file(1) to easily distinguish Gentoo
> > binary packages from regular .tar archives.
> 
> Just for clarity on this point.
> Are you proposing that we patch file(1) to print the Volume Header here?
> file-5.35 seems to not say much but "tar archive" or "POSIX tar archive"
> for tar-files containing a Volume Header as shown by tar -tv.

I'm wondering about that as well, yes.  However, my main idea is to
specifically detect 'gpkg:' there and use it to explicitly identify
the file as Gentoo binary package (and print package name).

> 
> > Container and archive formats
> > -
> > 
> > During the debate, the actual archive formats to use were considered.
> > The .tar format seemed an obvious choice for the image archive since
> > it is the only widely deployed archive format that stores all kinds
> > of file metadata on POSIX systems.  However, multiple options for
> > the outer format has been debated.
> 
> You mention POSIX, which triggered me.  I think it would be good to
> specify which tar format to use.
> 
> POSIX.1-2001/pax format doesn't have a 100/256 char filename length
> restriction, which is good but it is not (yet) used by default by GNU
> tar.  busybox tar can read pax tars, it seems.
> 

I think the modern GNU tar format is the obvious choice here.  I think
it doesn't suffer any portability problems these days, and is more
compact than the PAX format.


-- 
Best regards,
Michał Górny


signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] [pre-GLEP r2] Gentoo binary package container format

2018-11-21 Thread Fabian Groffen
On 20-11-2018 21:33:17 +0100, Michał Górny wrote:
> The volume label
> 
> 
> The volume label provides an easy way for users to identify the binary
> package without dedicated tooling or specific format knowledge.
> 
> The implementations should include a volume label consisting of fixed
> string ``gpkg:``, followed by a single space, followed by full package
> identifier.  However, the implementations must not rely on the volume
> label being present or attempt to parse its value when it is.
> 
> Furthermore, since the volume label is included in the .tar archive
> as the first member, it provides a magic string at a fixed location
> that can be used by tools such as file(1) to easily distinguish Gentoo
> binary packages from regular .tar archives.

Just for clarity on this point.
Are you proposing that we patch file(1) to print the Volume Header here?
file-5.35 seems to not say much but "tar archive" or "POSIX tar archive"
for tar-files containing a Volume Header as shown by tar -tv.

> Container and archive formats
> -
> 
> During the debate, the actual archive formats to use were considered.
> The .tar format seemed an obvious choice for the image archive since
> it is the only widely deployed archive format that stores all kinds
> of file metadata on POSIX systems.  However, multiple options for
> the outer format has been debated.

You mention POSIX, which triggered me.  I think it would be good to
specify which tar format to use.

POSIX.1-2001/pax format doesn't have a 100/256 char filename length
restriction, which is good but it is not (yet) used by default by GNU
tar.  busybox tar can read pax tars, it seems.

Thanks,
Fabian

-- 
Fabian Groffen
Gentoo on a different level


signature.asc
Description: PGP signature


Re: [gentoo-dev] [pre-GLEP r2] Gentoo binary package container format

2018-11-20 Thread Michał Górny
Hi,

On Sat, 2018-11-17 at 12:21 +0100, Michał Górny wrote:
> Here's a pre-GLEP draft based on the earlier discussion on gentoo-
> portage-dev mailing list.  The specification uses GLEP form as it
> provides for cleanly specifying the motivation and rationale.

Here's third iteration.  Changes since r1:
- removed unnecessary OpenPGP details, made them out of scope,
- added explicit section on (lack of) versioning and how to recognize
packages and their compatibility,
- explained why squashfs is a no-go.


---
GLEP: 
Title: Gentoo binary package container format
Author: Michał Górny 
Type: Standards Track
Status: Draft
Version: 1
Created: 2018-11-15
Last-Modified: 2018-11-20
Post-History: 2018-11-17
Content-Type: text/x-rst
---

Abstract


This GLEP proposes a new binary package container format for Gentoo.
The current tbz2/XPAK format is shortly described, and its deficiences
are explained.  Accordingly, the requirements for a new format are set
and a gpkg format satisfying them is proposed.  The rationale for
the design decisions is provided.


Motivation
==

The current Portage binary package format
-

The historical ``.tbz2`` binary package format used by Portage is
a concatenation of two distinct formats: header-oriented compressed .tar
format (used to hold package files) and trailer-oriented custom XPAK
format (used to hold metadata)  [#MAN-XPAK]_.  The format has already
been extended incompatibly twice.

The first time, support for storing multiple successive builds of binary
package for a single ebuild version has been added.  This feature relies
on appending additional hyphen, followed by an integer to the package
filename.  It is disabled by default (preserving backwards
compatibility) and controlled by ``binpkg-multi-instance`` feature.

The second time, support for additional compression formats has been
added.  When format other than bzip2 is used, the ``.tbz2`` suffix
is replaced by ``.xpak`` and Portage relies on magic bytes to detect
compression used.  For backwards compatibility, Portage still defaults
to using bzip2; compression program can be switched using
``BINPKG_COMPRESS`` configuration variable.

Additionally, there have been minor changes to the stored metadata
and file storage policies.  In particular, behavior regarding
``INSTALL_MASK``, controllable file compression and stripping has
changed over time.


The advantages of tbz2/XPAK format
--

The tbz2/XPAK format used by Portage has three interesting features:

1. **Each binary package is fully contained within a single file.**
   While this might seem unnecessary, it makes it easier for the user
   to transfer binary packages without having to be concerned about
   finding all the necessary files to transfer.

2. **The binary packages are compatible with regular compressed
   tarballs, most of the time.**  With notable exceptions of historical
   versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages
   can be extracted using regular tar utility with a compressor
   implementation that discards trailing garbage.

3. **The metadata is uncompressed, and can be efficiently accessed
   without decompressing package contents.**  This includes
   the possibility of rewriting it (e.g. as a result of package moves)
   without the necessity of repacking the files.


Transparency problem with the current binary package format
---

Notwithstanding its advantages, the tbz2/XPAK format has a significant
design fault that consists of two issues:

1. **The XPAK format is a custom binary format with explicit use
   of binary-encoded file offsets and field lengths.**  As such, it is
   non-trivial to read or edit without specialized tools.  Such tools
   are currently implemented separately from the package manager,
   as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_.

2. **The tarball compatibility feature relies on obscure feature of
   ignoring trailing garbage in compressed files**.  While this is
   implemented consistently in most of the compressors, this feature
   is not really a part of specification but rather traditional
   behavior.  Given that the original reasons for this no longer apply,
   new compressor implementations are likely to miss support for this.

Both of the issues make the format hard to use without dedicated tools,
or when the tools misbehave.  This impacts the following scenarios:

A. **Using binary packages for system recovery.**  In case of serious
   breakage, it is really preferable that the format depends on as few
   tools a possible, and especially not on Gentoo-specific tools.

B. **Inspecting binary packages in detail exceeding standard package
   manager facilities.**

C. **Modifying binary packages in ways not predicted by the package
   manager authors.**  A real-life example of this is working around
   broken ``pkg_*``