Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure

2018-01-29 Thread Andrew Barchuk
> I don't really like the idea of having to reshuffle all the mirrors all
> of a sudden.

If keeping invariable directory structure is the goal (even though it
should be possible to shuffle files among the directories without having
to retransfer the whole tree with hard links) in addition to be able to
determine the target directory without knowledge of directory layout as
Robin mentioned in another thread[1]:

> As for the problem you describe, one of the requirements in the
> discussion is that given ONLY the file or filename, and NOTHING ELSE, it
> should be possible to determine where in a hierarchy it should go. No
> prior knowledge about the hierarchy was permitted.

then hashing is indeed a better option.

1. 
https://archives.gentoo.org/gentoo-dev/message/1d69d65b47a18fdbe36f5edac58c7591

---
Andrew



Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure

2018-01-28 Thread Andrew Barchuk

> In order to use that for distfiles mirrors, such that clients could know
> where to fetch the files from, you'd need the mirror's http server to
> redirect the request to the appropriate location (since the location
> would not be predictable from the client side).

That's not necessary: the client can fetch listing of the top level distfiles/
which now would be hundreds of directories, not tens of thousands
of files. That would allow the client to calculate the correct directory to
fetch a distfile from locally. List of directories can be cached until an
attempt to fetch a distfile will result in a miss (meaning that distfiles were
redistributed).
-- 
Andrew



Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure

2018-01-28 Thread Andrew Barchuk
> To the contrary, that would not remain balanced, because your
> boundaries are entirely dependent on exactly what is in the tree at
> the moment you run your script. Now the package manager has to perform
> directory listing, sort and find the file name that's closest, open
> that directory, find the next closest filename (assuming multiple
> levels of hierarchy), and so on, or you have to store yet another
> index that duplicates information and takes additional space. Locating
> by distfile name hash is effectively free.

Sure, the tree won't be perfectly balanced but it will be pretty close.
E.g. if texlive-* dominates the tree today it will likely continue
dominating it for another 5 years. Statistical distribution of distfile
names will likely be changing very slowly.

Doing a binary search through a list of couple of hundred of directories
is really cheap. I don't see a reason to organize distfiles in a
multi-level hierarchy: e.g. if the goal is to keep no more than 1000
files in a folder than the limit of single level hierarchy is a million
which is more than enough for foreseeable future. The list of 500
directories takes 15kB when using full file names and will be couple of
times smaller when using only unique prefixes.

---
Andrew



Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure

2018-01-28 Thread Andrew Barchuk
[my apologies for posting the message to a wrong thread before]

Hi everyone,

> three possible solutions for splitting distfiles were listed:
>
> a. using initial portion of filename,
>
> b. using initial portion of file hash,
>
> c. using initial portion of filename hash.
>
> The significant advantage of the filename option was simplicity.  With
> that solution, the users could easily determine the correct subdirectory
> themselves.  However, it's significant disadvantage was very uneven
> shuffling of data.  In particular, the TeΧ Live packages alone count
> almost 23500 distfiles and all use a common prefix, making it impossible
> to split them further.
>
> The alternate option of using file hash has the advantage of having
> a more balanced split.


There's another option to use character ranges for each directory
computed in a way to have the files distributed evenly. One way to do
that is to use filename prefix of dynamic length so that each range
holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but
texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar
but simpler option is to use file names as range bounds (the same way
dictionaries use words to demarcate page bounds): each directory will
have a name of the first file located inside. This way files will be
distributed evenly and it's still easy to pick a correct directory where
a file will be located manually.

I have implemented a sketch of distfiles splitting that's using file
names as bounds in Python to demonstrate the idea (excuse possibly
non-idiomatic code, I'm not very versed in Python):

$ cat distfile-dirs.py
#!/usr/bin/env python3

import sys

"""
Builds list of dictionary directories to split the list of input files
into evenly. Each directory has name of the first file that is located
in the directory. Takes number of directories as an argument and reads
list of files from stdin. The resulting list or directories is printed
to stdout.
"""

dir_num = int(sys.argv[1])
distfiles = sys.stdin.read().splitlines()
distfile_num = len(distfiles)
dir_size = distfile_num / dir_num
# allows adding files in the beginning without repartitioning
dirs = ["0"]
next_dir = dir_size
while next_dir < distfile_num:
dirs.append(distfiles[round(next_dir)])
next_dir += dir_size
print("/\n".join(dirs) + "/")

$ cat pick-distfiles-dir.py
#!/usr/bin/env python3

"""
Picks the directory for a given file name. Takes a distfile name as an
argument. Reads sorted list of directories from stdin, name of each
directory is assumed to be the name of first file that's located inside.
"""

import sys

distfile = sys.argv[1]
dirs = sys.stdin.read().splitlines()
left = 0
right = len(dirs) - 1
while left < right:
pivot = round((left + right) / 2)
if (dirs[pivot] <= distfile):
left = pivot + 1
else:
right = pivot - 1

if distfile < dirs[right]:
print(dirs[right-1])
else:
print(dirs[right])

$ # distfiles.txt contains all the distfile names
$ head -n5 distfiles.txt
0CD9CDDE3F56BB5250D87C54592F04CBC24F03BF-wagon-provider-api-2.10.jar
0CE1EDB914C94EBC388F086C6827E8BDEEC71AC2-commons-lang-2.6.jar
0DCC973606CBD9737541AA5F3E76DED6E3F4D0D0-iri.jar
0ad-0.0.22-alpha-unix-build.tar.xz
0ad-0.0.22-alpha-unix-data.tar.xz

$ # calculate 500 directories to split distfiles into evenly
$ cat distfiles.txt | ./distfile-dirs.py 500 > dirs.txt
$ tail -n5 dirs.txt
xrmap-2.29.tar.bz2/
xview-3.2p1.4-18c.tar.gz/
yasat-700.tar.gz/
yubikey-manager-qt-0.4.0.tar.gz/
zimg-2.5.1.tar.gz

$ # pick a directory for xvinfo-1.0.1.tar.bz2
$ cat dirs.txt | ./pick-distfiles-dir.py xvinfo-1.0.1.tar.bz2
xview-3.2p1.4-18c.tar.gz/

Using the approach above the files will distributed evenly among the
directories keeping the possibility to determine the directory for a
specific file by hand. It's possible if necessary to keep the directory
structure unchanged for very long time and it will likely stay
well-balanced. Picking a directory for a file is very cheap. The only
obvious downside I see is that it's necessary to know list of
directories to pick the correct one (can be mitigated by caching the
list of directories if important). If it's desirable to make directory
names shorter or to look less like file names it's fairly easy to
achieve by keeping only unique prefixes of directories. For example:

xrmap-2.29.tar.bz2/
xview-3.2p1.4-18c.tar.gz/
yasat-700.tar.gz/
yubikey-manager-qt-0.4.0.tar.gz/
zimg-2.5.1.tar.gz/

will become

xr/
xv/
ya/
yu/
z/

Thanks for taking time to consider the suggestion.

---
Andrew



Re: [gentoo-dev] [News item review] Portage rsync tree verification (v4)

2018-01-28 Thread Andrew Barchuk
Hi everyone,

> three possible solutions for splitting distfiles were listed:
> 
> a. using initial portion of filename,
> 
> b. using initial portion of file hash,
> 
> c. using initial portion of filename hash.
> 
> The significant advantage of the filename option was simplicity.  With
> that solution, the users could easily determine the correct subdirectory
> themselves.  However, it's significant disadvantage was very uneven
> shuffling of data.  In particular, the TeΧ Live packages alone count
> almost 23500 distfiles and all use a common prefix, making it impossible
> to split them further.
> 
> The alternate option of using file hash has the advantage of having
> a more balanced split.

There's another option to use character ranges for each directory
computed in a way to have the files distributed evenly. One way to do
that is to use filename prefix of dynamic length so that each range
holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but
texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar
but simpler option is to use file names as range bounds (the same way
dictionaries use words to demarcate page bounds): each directory will
have a name of the first file located inside. This way files will be
distributed evenly and it's still easy to pick a correct directory where
a file will be located manually.

I have implemented a sketch of distfiles splitting that's using file
names as bounds in Python to demonstrate the idea (excuse possibly
non-idiomatic code, I'm not very versed in Python):

$ cat distfile-dirs.py
#!/usr/bin/env python3

import sys

"""
Builds list of dictionary directories to split the list of input files
into evenly. Each directory has name of the first file that is located
in the directory. Takes number of directories as an argument and reads
list of files from stdin. The resulting list or directories is printed
to stdout.
"""

dir_num = int(sys.argv[1])
distfiles = sys.stdin.read().splitlines()
distfile_num = len(distfiles)
dir_size = distfile_num / dir_num
# allows adding files in the beginning without repartitioning
dirs = ["0"]
next_dir = dir_size
while next_dir < distfile_num:
dirs.append(distfiles[round(next_dir)])
next_dir += dir_size
print("/\n".join(dirs) + "/")

$ cat pick-distfiles-dir.py
#!/usr/bin/env python3

"""
Picks the directory for a given file name. Takes a distfile name as an
argument. Reads sorted list of directories from stdin, name of each
directory is assumed to be the name of first file that's located inside.
"""

import sys

distfile = sys.argv[1]
dirs = sys.stdin.read().splitlines()
left = 0
right = len(dirs) - 1
while left < right:
pivot = round((left + right) / 2)
if (dirs[pivot] <= distfile):
left = pivot + 1
else:
right = pivot - 1

if distfile < dirs[right]:
print(dirs[right-1])
else:
print(dirs[right])

$ # distfiles.txt contains all the distfile names
$ head -n5 distfiles.txt
0CD9CDDE3F56BB5250D87C54592F04CBC24F03BF-wagon-provider-api-2.10.jar
0CE1EDB914C94EBC388F086C6827E8BDEEC71AC2-commons-lang-2.6.jar
0DCC973606CBD9737541AA5F3E76DED6E3F4D0D0-iri.jar
0ad-0.0.22-alpha-unix-build.tar.xz
0ad-0.0.22-alpha-unix-data.tar.xz

$ # calculate 500 directories to split distfiles into evenly
$ cat distfiles.txt | ./distfile-dirs.py 500 > dirs.txt
$ tail -n5 dirs.txt
xrmap-2.29.tar.bz2/
xview-3.2p1.4-18c.tar.gz/
yasat-700.tar.gz/
yubikey-manager-qt-0.4.0.tar.gz/
zimg-2.5.1.tar.gz

$ # pick a directory for xvinfo-1.0.1.tar.bz2
$ cat dirs.txt | ./pick-distfiles-dir.py xvinfo-1.0.1.tar.bz2
xview-3.2p1.4-18c.tar.gz/

Using the approach above the files will distributed evenly among the
directories keeping the possibility to determine the directory for a
specific file by hand. It's possible if necessary to keep the directory
structure unchanged for very long time and it will likely stay
well-balanced. Picking a directory for a file is very cheap. The only
obvious downside I see is that it's necessary to know list of
directories to pick the correct one (can be mitigated by caching the
list of directories if important). If it's desirable to make directory
names shorter or to look less like file names it's fairly easy to
achieve by keeping only unique prefixes of directories. For example:

xrmap-2.29.tar.bz2/
xview-3.2p1.4-18c.tar.gz/
yasat-700.tar.gz/
yubikey-manager-qt-0.4.0.tar.gz/
zimg-2.5.1.tar.gz/

will become

xr/
xv/
ya/
yu/
z/

Thanks for taking time to consider the suggestion.

---
Andrew