Re: [gentoo-dev] [News item review] Portage rsync tree verification (v4)

2018-01-28 Thread Robin H. Johnson
On Sun, Jan 28, 2018 at 09:30:31PM +0100, Andrew Barchuk wrote:
> Hi everyone,
> 
> > three possible solutions for splitting distfiles were listed:
> There's another option to use character ranges for each directory
> computed in a way to have the files distributed evenly. One way to do
> that is to use filename prefix of dynamic length so that each range
> holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but
> texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar
> but simpler option is to use file names as range bounds (the same way
> dictionaries use words to demarcate page bounds): each directory will
> have a name of the first file located inside. This way files will be
> distributed evenly and it's still easy to pick a correct directory where
> a file will be located manually.
This was discussed early on, but thank you for the reminder, as it got
dropped from later discussions.

> [snip code]
> Using the approach above the files will distributed evenly among the
> directories keeping the possibility to determine the directory for a
> specific file by hand. It's possible if necessary to keep the directory
> structure unchanged for very long time and it will likely stay
> well-balanced. Picking a directory for a file is very cheap. The only
> obvious downside I see is that it's necessary to know list of
> directories to pick the correct one (can be mitigated by caching the
> list of directories if important). If it's desirable to make directory
> names shorter or to look less like file names it's fairly easy to
> achieve by keeping only unique prefixes of directories. For example:
As for the problem you describe, one of the requirements in the
discussion is that given ONLY the file or filename, and NOTHING ELSE, it
should be possible to determine where in a hierarchy it should go. No
prior knowledge about the hierarchy was permitted. Some parties might
answer that you just need an index file then, but that means you have to
keep the index file in sync often.

It's a superbly readable result (in the general class of perfect hashes
based on lots of well-known input). The class of solution suffers
another problem in addition the one you noted: if input changes
sufficiently, then rebalancing is expensive/hard.

As a concrete example, say we add a new category for something something
with lots of common prefixes in distfiles. 
dev-scratch/ as an example, where all distfiles start with 'scratch-'.
Unless we know up-front that we're going to add a thousand distfiles
here (not unreasonable, dev-python is ~1800 packages), they might start
by going into the 'sc' directory, but later we want them to be in
'scratch', as the tree is unweighted otherwise.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: Digital signature


Re: [gentoo-dev] [News item review] Portage rsync tree verification (v4)

2018-01-28 Thread Andrew Barchuk
Hi everyone,

> three possible solutions for splitting distfiles were listed:
> 
> a. using initial portion of filename,
> 
> b. using initial portion of file hash,
> 
> c. using initial portion of filename hash.
> 
> The significant advantage of the filename option was simplicity.  With
> that solution, the users could easily determine the correct subdirectory
> themselves.  However, it's significant disadvantage was very uneven
> shuffling of data.  In particular, the TeΧ Live packages alone count
> almost 23500 distfiles and all use a common prefix, making it impossible
> to split them further.
> 
> The alternate option of using file hash has the advantage of having
> a more balanced split.

There's another option to use character ranges for each directory
computed in a way to have the files distributed evenly. One way to do
that is to use filename prefix of dynamic length so that each range
holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but
texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar
but simpler option is to use file names as range bounds (the same way
dictionaries use words to demarcate page bounds): each directory will
have a name of the first file located inside. This way files will be
distributed evenly and it's still easy to pick a correct directory where
a file will be located manually.

I have implemented a sketch of distfiles splitting that's using file
names as bounds in Python to demonstrate the idea (excuse possibly
non-idiomatic code, I'm not very versed in Python):

$ cat distfile-dirs.py
#!/usr/bin/env python3

import sys

"""
Builds list of dictionary directories to split the list of input files
into evenly. Each directory has name of the first file that is located
in the directory. Takes number of directories as an argument and reads
list of files from stdin. The resulting list or directories is printed
to stdout.
"""

dir_num = int(sys.argv[1])
distfiles = sys.stdin.read().splitlines()
distfile_num = len(distfiles)
dir_size = distfile_num / dir_num
# allows adding files in the beginning without repartitioning
dirs = ["0"]
next_dir = dir_size
while next_dir < distfile_num:
dirs.append(distfiles[round(next_dir)])
next_dir += dir_size
print("/\n".join(dirs) + "/")

$ cat pick-distfiles-dir.py
#!/usr/bin/env python3

"""
Picks the directory for a given file name. Takes a distfile name as an
argument. Reads sorted list of directories from stdin, name of each
directory is assumed to be the name of first file that's located inside.
"""

import sys

distfile = sys.argv[1]
dirs = sys.stdin.read().splitlines()
left = 0
right = len(dirs) - 1
while left < right:
pivot = round((left + right) / 2)
if (dirs[pivot] <= distfile):
left = pivot + 1
else:
right = pivot - 1

if distfile < dirs[right]:
print(dirs[right-1])
else:
print(dirs[right])

$ # distfiles.txt contains all the distfile names
$ head -n5 distfiles.txt
0CD9CDDE3F56BB5250D87C54592F04CBC24F03BF-wagon-provider-api-2.10.jar
0CE1EDB914C94EBC388F086C6827E8BDEEC71AC2-commons-lang-2.6.jar
0DCC973606CBD9737541AA5F3E76DED6E3F4D0D0-iri.jar
0ad-0.0.22-alpha-unix-build.tar.xz
0ad-0.0.22-alpha-unix-data.tar.xz

$ # calculate 500 directories to split distfiles into evenly
$ cat distfiles.txt | ./distfile-dirs.py 500 > dirs.txt
$ tail -n5 dirs.txt
xrmap-2.29.tar.bz2/
xview-3.2p1.4-18c.tar.gz/
yasat-700.tar.gz/
yubikey-manager-qt-0.4.0.tar.gz/
zimg-2.5.1.tar.gz

$ # pick a directory for xvinfo-1.0.1.tar.bz2
$ cat dirs.txt | ./pick-distfiles-dir.py xvinfo-1.0.1.tar.bz2
xview-3.2p1.4-18c.tar.gz/

Using the approach above the files will distributed evenly among the
directories keeping the possibility to determine the directory for a
specific file by hand. It's possible if necessary to keep the directory
structure unchanged for very long time and it will likely stay
well-balanced. Picking a directory for a file is very cheap. The only
obvious downside I see is that it's necessary to know list of
directories to pick the correct one (can be mitigated by caching the
list of directories if important). If it's desirable to make directory
names shorter or to look less like file names it's fairly easy to
achieve by keeping only unique prefixes of directories. For example:

xrmap-2.29.tar.bz2/
xview-3.2p1.4-18c.tar.gz/
yasat-700.tar.gz/
yubikey-manager-qt-0.4.0.tar.gz/
zimg-2.5.1.tar.gz/

will become

xr/
xv/
ya/
yu/
z/

Thanks for taking time to consider the suggestion.

---
Andrew



Re: [gentoo-dev] [News item review] Portage rsync tree verification (v4)

2018-01-28 Thread Michał Górny
Hopefully the final version.

---
Title: Portage rsync tree verification
Author: Michał Górny 
Posted: 2018-01-xx
Revision: 1
News-Item-Format: 2.0
Display-If-Installed: sys-apps/portage

Starting with sys-apps/portage-2.3.22, Portage will verify the Gentoo
repository after rsync by default.

The new verification is intended for users who are syncing via rsync.
Verification mechanisms for other methods of sync will be provided
in the future.

This does not affect users syncing using git and other methods.
Appropriate verification mechanisms for them will be provided
in the future.

The verification is implemented via app-portage/gemato. Currently,
the whole repository is verified after syncing. On systems with slow
hard drives, this could take around 2 minutes. If you wish to disable
it, you can disable the 'rsync-verify' USE flag on sys-apps/portage
or set 'sync-rsync-verify-metamanifest = no' in your repos.conf.

Please note that the verification currently does not prevent Portage
from using the repository after syncing. If 'emerge --sync' fails,
do not install any packages and retry syncing. In case of prolonged
or frequent verification failures, please make sure to report a bug
including the failing mirror addresses (found in emerge.log).

The verification uses information from the binary keyring provided
by the app-crypt/gentoo-keys package. The keys are refreshed
from the keyserver before every use in order to check for revocation.
The post-sync verification ensures that the key package is verified
itself. However, manual verification is required before the first use.

On Gentoo installations created using installation media that included
portage-2.3.22, the keys will already be covered by the installation
media signatures. On existing installations, you need to manually
compare the primary key fingerprint (reported by gemato on every sync)
against the official Gentoo keys [1]. An example gemato output is:

  INFO:root:Valid OpenPGP signature found:
  INFO:root:- primary key: 1234567890ABCDEF1234567890ABCDEF12345678
  INFO:root:- subkey: FEDCBA0987654321FEDCBA0987654321FEDCBA09

Please note that the above snippet does not include the real key id
on purpose. The primary key actually printed by gemato must match
the 'Gentoo Portage Snapshot Signing Key' on the website. Please make
sure to also check the certificate used for the secure connection
to the site!

[1]:https://www.gentoo.org/downloads/signatures/

-- 
Best regards,
Michał Górny