Re: [gentoo-portage-dev] Changing the VDB format

2022-04-11 Thread Sid Spry
On Mon, Apr 11, 2022, at 3:02 PM, Joshua Kinard wrote:
> On 3/13/2022 21:06, Matt Turner wrote:
>> The VDB uses a one-file-per-variable format. This has some
>> inefficiencies, with many file systems. For example the 'EAPI' file
>> that contains a single character will consume a 4K block on disk.
>> 
>> $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/
>> $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END {
>> print sum }'
>> 418517
>> $ du -sh --apparent-size .
>> 413K.
>> $ du -sh .
>> 556K.
>> 
>> During normal operations, portage has to read each of these 35+
>> files/package individually.
>> 
>> I suggest that we change the VDB format to a commonly used format that
>> can be quickly read by portage and any other tools. Combining these
>> 35+ files into a single file with a commonly used format should:
>> 
>> - speed up vdb access
>> - improve disk usage
>> - allow external tools to access VDB data more easily
>> 
>> I've attached a program that prints the VDB contents of a specified
>> package in different formats: json, toml, and yaml (and also Python
>> PrettyPrinter, just because). I think it's important to keep the VDB
>> format as plain-text for ease of manipulation, so I have not
>> considered anything like sqlite.
>> 
>> I expected to prefer toml, but I actually find it to be rather gross looking.
>
> Agreed, the toml output is rather "cluttered" looking.
>
>
>> I recommend json and think it is the best choice because:
>> 
>> - json provides the smallest on-disk footprint
>> - json is part of Python's standard library (so is yaml, and toml will
>> be in Python 3.11)
>> - Every programming language has multiple json parsers
>> -- lots of effort has been spent making them extremely fast.
>> 
>> I think we would have a significant time period for the transition. I
>> think I would include support for the new format in Portage, and ship
>> a tool with portage to switch back and forth between old and new
>> formats on-disk. Maybe after a year, drop the code from Portage to
>> support the old format?
>> 
>> Thoughts?
>
> I think json is the best format for storing the data on-disk.  It's intended
> to be a data serialization format to convert data from a non-specific memory
> format to a storable on-disk format and back again, so this is a perfect use
> for it.

Can we avoid adding another format? I find json very hard to edit by hand, it's
good at storing lots of data in a quasi-textual format, but is strict enough to 
be
obnoxious to work with.

Can the files not be concatenated? Doing so is similar to the tar suggestion,
but would keep everything very portage-like. Have the contents assigned to
variables. I am betting someone tried this at the start but settled on the 
current
scheme. Does anyone know why? (This would have to be done in bash syntax
I assume.)

Alternatively, I think the tar suggestion is quite elegant. There's streaming
decompressors you can use from python. It adds an extra step to modify but
that could be handled transparently by a dev mode. In dev mode, leave the files
after extraction and do not re-extract, for release mode replace the archive 
with
what is on disk.

Sid.



Re: [gentoo-portage-dev] [PATCH] portage.output: Replace darkblue colors with teal

2021-12-04 Thread Sid Spry
On Sat, Dec 4, 2021, at 3:47 AM, Fabian Groffen wrote:
> On 04-12-2021 10:24:23 +0100, Michał Górny wrote:
> Now, if you would make a supported claim that all terminals we install
> use a black background by default, your change becomes more valid.
>

I've wanted this change for many years actually, it's annoying booting
into the admincd/installcd and having a lot of stuff be hard to see. When
I finally get a graphical environment up it goes away but for a different
reason.

Ages ago there were people complaining about colored terminals and
their eyesight -- part of it, I think, is the fact that some things display in
blue.

Sid.



Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-07-01 Thread Sid Spry
On Wed, Jul 1, 2020, at 1:40 AM, Fabian Groffen wrote:
> On 30-06-2020 13:13:29 -0500, Sid Spry wrote:
> > On Tue, Jun 30, 2020, at 1:20 AM, Fabian Groffen wrote:
> > > Hi,
> > > 
> > > On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> > > > Hello,
> > > > 
> > > > I have some runnable pseudocode outlining a faster tree verification 
> > > > algorithm.
> > > > Before I create patches I'd like to see if there is any guidance on 
> > > > making the
> > > > changes as unobtrusive as possible. If the radical change in algorithm 
> > > > is
> > > > acceptable I can work on adding the changes.
> > > > 
> > > > Instead of composing any kind of structured data out of the portage 
> > > > tree my
> > > > algorithm just lists all files and then optionally batches them out to 
> > > > threads.
> > > > There is a noticeable speedup by eliding the tree traversal operations 
> > > > which
> > > > can be seen when running the algorithm with a single thread and 
> > > > comparing it to
> > > > the current algorithm in gemato (which should still be discussed here?).
> > > 
> > > I remember something that gemato used to use multiple threads, but
> > > because it totally saturated disk-IO, it was brought back to a single
> > > thread.  People were complaining about unusable systems.
> > > 
> > 
> > I think this is an argument for cgroups limits support on the portage 
> > process or
> > account as opposed to an argument against picking a better algorithm. That 
> > is
> > something I have been working towards, but I am only one man.
> 
> But this requires a) cgroups support, and b) the privileges to use it.
> Shouldn't be a problem in the normal case, but just saying.
> 

cgroups kernel support is a fairly common dependency. It can obviously be 
optional,
I am thinking related to MAKEOPTS or EMERGE_DEFAULT_OPTS (see: rustc/cargo not
respecting or being passed -j/-l as another use for cgroups) and supported 
best-effort,
but is there any reason to expect it to not be enabled?

If the user isn't either root or portage I think it reasonable to leave 
resource management
to the machine's administrator.



Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-07-01 Thread Sid Spry
On Wed, Jul 1, 2020, at 1:40 AM, Fabian Groffen wrote:
> On 30-06-2020 13:13:29 -0500, Sid Spry wrote:
> > On Tue, Jun 30, 2020, at 1:20 AM, Fabian Groffen wrote:
> > > Hi,
> > > 
> > > On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> > > > Hello,
> > > > 
> > > > I have some runnable pseudocode outlining a faster tree verification 
> > > > algorithm.
> > > > Before I create patches I'd like to see if there is any guidance on 
> > > > making the
> > > > changes as unobtrusive as possible. If the radical change in algorithm 
> > > > is
> > > > acceptable I can work on adding the changes.
> > > > 
> > > > Instead of composing any kind of structured data out of the portage 
> > > > tree my
> > > > algorithm just lists all files and then optionally batches them out to 
> > > > threads.
> > > > There is a noticeable speedup by eliding the tree traversal operations 
> > > > which
> > > > can be seen when running the algorithm with a single thread and 
> > > > comparing it to
> > > > the current algorithm in gemato (which should still be discussed here?).
> > > 
> > > I remember something that gemato used to use multiple threads, but
> > > because it totally saturated disk-IO, it was brought back to a single
> > > thread.  People were complaining about unusable systems.
> > > 
> > 
> > I think this is an argument for cgroups limits support on the portage 
> > process or
> > account as opposed to an argument against picking a better algorithm. That 
> > is
> > something I have been working towards, but I am only one man.
> 
> But this requires a) cgroups support, and b) the privileges to use it.
> Shouldn't be a problem in the normal case, but just saying.
> 
> > > In any case, can you share your performance results?  What speedup did
> > > you see, on warm and hot FS caches?  Which type of disk do you use?
> > > 
> > 
> > I ran all tests multiple times to make them warm off of a Samsung SSD, but
> > nothing very precise yet.
> > 
> > % gemato verify --openpgp-key signkey.asc /var/db/repos/gentoo
> > [...]
> > INFO:root:Verifying /var/db/repos/gentoo...
> > INFO:root:/var/db/repos/gentoo verified in 16.45 seconds
> > 
> > sometimes going higher, closer to 18s, vs.
> > 
> > % ./veriftree.py
> > 4.763171965983929
> > 
> > So roughly an order of magnitude speedup without batching to threads.
> 
> That is kind of a change.  Makes one wonder if you really did the same
> work.
> 

That was my initial reaction. I attempted to ensure I was processing all of
the files that gemato processed. The full output of my script is something
closer to:

% ./veriftree.py
x.xx
192157
126237

The first number being the time, the second the total number of manifest 
directives, 
and the third being the number of real files in the tree. If you prune the 
directives
that correspond to no file you end up with an exact match IIRC.

However, you are right, and I think this is old code. gemato times the manifest 
file
parsing as well as the verification. It seems this change is not in the code I
provided. If I do that instead, I get:

% ./veriftree.py
11.708862617029808
192157
126237

With corresponding times for gemato (at same system state, etc) being ~20s. So 
it
is a halving at worst with assured n-core speedup for 1/2 of that time, and I am
fairly confident I can speed up the manifest parsing even more as well.

> > > You could compare against qmanifest, which uses OpenMP-based
> > > paralllelism while verifying the tree.  On SSDs this does help.
> > > 
> > 
> > I lost my notes -- how do I specify to either gemato or qmanifest the GnuPG
> > directory? My code is partially structured as it is because I had problems 
> > doing
> > this. I rediscovered -K/--openpgp-key in gemato but am unsure for qmanifest.
> 
> qmanifest doesn't do much magic out of the standard gnupg practices.
> (It is using gpgme.)  If you want it to use a different gnupg dir, you
> may change HOME, or GNUPGHOME.
> 

Alright, I will attempt to set that. I think I like the interface of gemato a 
little more
but will look at qmanifest and see how it performs.



Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Sid Spry
On Tue, Jun 30, 2020, at 2:29 PM, Michał Górny wrote:
> On Tue, 2020-06-30 at 12:50 -0500, Sid Spry wrote:
> > On Tue, Jun 30, 2020, at 2:28 AM, Michał Górny wrote:
> > > Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry  napisał(a):
> > > > Hello,
> > > > 
> > > > I have some runnable pseudocode outlining a faster tree verification
> > > > algorithm.
> > > > Before I create patches I'd like to see if there is any guidance on
> > > > making the
> > > > changes as unobtrusive as possible. If the radical change in algorithm
> > > > is
> > > > acceptable I can work on adding the changes.
> > > > 
> > > > Instead of composing any kind of structured data out of the portage
> > > > tree my
> > > > algorithm just lists all files and then optionally batches them out to
> > > > threads.
> > > > There is a noticeable speedup by eliding the tree traversal operations
> > > > which
> > > > can be seen when running the algorithm with a single thread and
> > > > comparing it to
> > > > the current algorithm in gemato (which should still be discussed
> > > > here?).
> > > 
> > > Without reading the code: does your algorithm correctly detect extraneous 
> > > files?
> > > 
> > 
> > Yes and no.
> > 
> > I am not sure why this is necessary. If the file does not appear in a 
> > manifest it is
> > ignored. It makes the most sense to me to put the burden of not including
> > untracked files on the publisher. If the user puts an untracked file into 
> > the tree it
> > will be ignored to no consequence; the authored files don't refer to it, 
> > after all.
> 
> This is necessary because a malicious third party can MITM you an rsync
> tree with extraneous files (say, -r1 baselayout ebuild) that do horrible
> things on your system.  If you don't reject files not in Manifest, you
> open a huge security hole.
> 

Ok, I will refer to https://www.gentoo.org/glep/glep-0074.html and implement the
checks in detail, but will still need to spend some time looking for the best 
place
to insert the code.

I think it best to address this from two fronts. On one hand rejecting extra 
files
seems to have immediate benefit but the larger issue is portage exposing
untracked potentially malicious files to the user.

Has anything like a verity loopback filesystem been explored? It might reduce
duplication of work.

> > But it would be easy enough to build a second list of all files and compare 
> > it to
> > the list of files built from the manifests. If there are extras an error 
> > can be
> > generated. This is actually the first test I did on my manifest parsing 
> > code. I tried
> > to see if my tracked files roughly matched the total files in tree. That 
> > can be
> > repurposed for this check.
> > 
> > > > Some simple tests like counting all objects traversed and verified
> > > > returns the
> > > > same(ish). Once it is put into portage it could be tested in detail.
> > > > 
> > > > There is also my partial attempt at removing the brittle interface to
> > > > GnuPG
> > > > (it's not as if the current code is badly designed, just that parsing
> > > > the
> > > > output of GnuPG directly is likely not the best idea).
> > > 
> > > The 'brittle interface' is well-defined machine-readable output.
> > > 
> > 
> > Ok. I was aware there was a machine interface, but the classes that 
> > manipulate
> > a temporary GPG home seemed like not the best solution. I guess that is all
> > due to GPG assuming everything is in ~/.gnupg and keeping its state as a
> > directory structure.
> 
> A temporary home directory guarantees that user configuration does not
> affect the verification result.
> 

Yes, I know why it is there. The temporary construction of the directory is what
stood out to me as messy but I guess there is no way around it.

> > 
> > > > Needs gemato, dnspython, and requests. Slightly better than random code
> > > > because
> > > > I took inspiration from the existing gemato classes.
> > > 
> > > The code makes a lot of brittle assumptions about the structure. The 
> > > GLEP was specifically designed to avoid that and let us adjust the 
> > > structure in the future to meet our needs.
> > > 
> > 
> > These same assumptions are built into the code that operates on the
> > tree structure. If the GLEP were changed the existing code would al

Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Sid Spry
On Tue, Jun 30, 2020, at 1:20 AM, Fabian Groffen wrote:
> Hi,
> 
> On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> > Hello,
> > 
> > I have some runnable pseudocode outlining a faster tree verification 
> > algorithm.
> > Before I create patches I'd like to see if there is any guidance on making 
> > the
> > changes as unobtrusive as possible. If the radical change in algorithm is
> > acceptable I can work on adding the changes.
> > 
> > Instead of composing any kind of structured data out of the portage tree my
> > algorithm just lists all files and then optionally batches them out to 
> > threads.
> > There is a noticeable speedup by eliding the tree traversal operations which
> > can be seen when running the algorithm with a single thread and comparing 
> > it to
> > the current algorithm in gemato (which should still be discussed here?).
> 
> I remember something that gemato used to use multiple threads, but
> because it totally saturated disk-IO, it was brought back to a single
> thread.  People were complaining about unusable systems.
> 

I think this is an argument for cgroups limits support on the portage process or
account as opposed to an argument against picking a better algorithm. That is
something I have been working towards, but I am only one man.

> In any case, can you share your performance results?  What speedup did
> you see, on warm and hot FS caches?  Which type of disk do you use?
> 

I ran all tests multiple times to make them warm off of a Samsung SSD, but
nothing very precise yet.

% gemato verify --openpgp-key signkey.asc /var/db/repos/gentoo
[...]
INFO:root:Verifying /var/db/repos/gentoo...
INFO:root:/var/db/repos/gentoo verified in 16.45 seconds

sometimes going higher, closer to 18s, vs.

% ./veriftree.py
4.763171965983929

So roughly an order of magnitude speedup without batching to threads.

> You could compare against qmanifest, which uses OpenMP-based
> paralllelism while verifying the tree.  On SSDs this does help.
> 

I lost my notes -- how do I specify to either gemato or qmanifest the GnuPG
directory? My code is partially structured as it is because I had problems doing
this. I rediscovered -K/--openpgp-key in gemato but am unsure for qmanifest.



Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Sid Spry
On Tue, Jun 30, 2020, at 2:28 AM, Michał Górny wrote:
> Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry  napisał(a):
> >Hello,
> >
> >I have some runnable pseudocode outlining a faster tree verification
> >algorithm.
> >Before I create patches I'd like to see if there is any guidance on
> >making the
> >changes as unobtrusive as possible. If the radical change in algorithm
> >is
> >acceptable I can work on adding the changes.
> >
> >Instead of composing any kind of structured data out of the portage
> >tree my
> >algorithm just lists all files and then optionally batches them out to
> >threads.
> >There is a noticeable speedup by eliding the tree traversal operations
> >which
> >can be seen when running the algorithm with a single thread and
> >comparing it to
> >the current algorithm in gemato (which should still be discussed
> >here?).
> 
> Without reading the code: does your algorithm correctly detect extraneous 
> files?
> 

Yes and no.

I am not sure why this is necessary. If the file does not appear in a manifest 
it is
ignored. It makes the most sense to me to put the burden of not including
untracked files on the publisher. If the user puts an untracked file into the 
tree it
will be ignored to no consequence; the authored files don't refer to it, after 
all.

But it would be easy enough to build a second list of all files and compare it 
to
the list of files built from the manifests. If there are extras an error can be
generated. This is actually the first test I did on my manifest parsing code. I 
tried
to see if my tracked files roughly matched the total files in tree. That can be
repurposed for this check.

> >Some simple tests like counting all objects traversed and verified
> >returns the
> >same(ish). Once it is put into portage it could be tested in detail.
> >
> >There is also my partial attempt at removing the brittle interface to
> >GnuPG
> >(it's not as if the current code is badly designed, just that parsing
> >the
> >output of GnuPG directly is likely not the best idea).
> 
> The 'brittle interface' is well-defined machine-readable output.
>

Ok. I was aware there was a machine interface, but the classes that manipulate
a temporary GPG home seemed like not the best solution. I guess that is all
due to GPG assuming everything is in ~/.gnupg and keeping its state as a
directory structure.

> >
> >Needs gemato, dnspython, and requests. Slightly better than random code
> >because
> >I took inspiration from the existing gemato classes.
> 
> The code makes a lot of brittle assumptions about the structure. The 
> GLEP was specifically designed to avoid that and let us adjust the 
> structure in the future to meet our needs.
> 

These same assumptions are built into the code that operates on the
tree structure. If the GLEP were changed the existing code would also
potentially need changing. This code just uses the structure in a different
way.

I will admit my partial understanding of the entire GLEP. I made some
simplifications just to get something demonstrable done. However, please
consider removing or putting some of the checks elsewhere. I don't have
full suggestions right now, but there is the possibility of saving an
appreciable amount of time.



Re: [gentoo-portage-dev] Re: Speeding up Tree Verification

2020-06-30 Thread Sid Spry
On Mon, Jun 29, 2020, at 9:34 PM, Zac Medico wrote:
> On 6/29/20 7:15 PM, Sid Spry wrote:
> > On Mon, Jun 29, 2020, at 9:13 PM, Sid Spry wrote:
> >> Hello,
> >>
> >> I have some runnable pseudocode outlining a faster tree verification 
> >> algorithm.
> > 
> > Ah, right. It's worth noting that even faster than this algorithm is simply 
> > verifying
> > a .tar.xz. Is that totally off the table? I realize it doesn't fit every 
> > usecase, but it
> > seems to be faster in both sync and verification time.
> 
> We've already got support for that with sync-type = webrsync. However, I
> imagine sync-type = git is even better. All of the types are covered here:
> 
> https://wiki.gentoo.org/wiki/Portage_Security

I'm being warned right now that webrsync-gpg is being deprecated; I've been 
using
it. It is, amazingly, faster than a typical rsync and may be faster than a git 
pull though.

The issue with git is there are some analyses that indicate you shouldn't rely 
on git
for integrity, so you are back to verifying the tree on-disk, which is slower 
than
verifying the .tar.xz.

(To clarify: Even with signed commits the commit hashes could be attacked and 
this
is considered somewhat feasible.)



[gentoo-portage-dev] Re: Speeding up Tree Verification

2020-06-29 Thread Sid Spry
On Mon, Jun 29, 2020, at 9:13 PM, Sid Spry wrote:
> Hello,
> 
> I have some runnable pseudocode outlining a faster tree verification 
> algorithm.

Ah, right. It's worth noting that even faster than this algorithm is simply 
verifying
a .tar.xz. Is that totally off the table? I realize it doesn't fit every 
usecase, but it
seems to be faster in both sync and verification time.



[gentoo-portage-dev] Speeding up Tree Verification

2020-06-29 Thread Sid Spry
Hello,

I have some runnable pseudocode outlining a faster tree verification algorithm.
Before I create patches I'd like to see if there is any guidance on making the
changes as unobtrusive as possible. If the radical change in algorithm is
acceptable I can work on adding the changes.

Instead of composing any kind of structured data out of the portage tree my
algorithm just lists all files and then optionally batches them out to threads.
There is a noticeable speedup by eliding the tree traversal operations which
can be seen when running the algorithm with a single thread and comparing it to
the current algorithm in gemato (which should still be discussed here?).

Some simple tests like counting all objects traversed and verified returns the
same(ish). Once it is put into portage it could be tested in detail.

There is also my partial attempt at removing the brittle interface to GnuPG
(it's not as if the current code is badly designed, just that parsing the
output of GnuPG directly is likely not the best idea).

Needs gemato, dnspython, and requests. Slightly better than random code because
I took inspiration from the existing gemato classes.

```python (veriftree.py)
#!/usr/bin/env python3
import os, sys, zlib, hashlib, tempfile, shutil, timeit
import subprocess
from typing import List
from pprint import pprint

from gemato.manifest import (
ManifestFile,
ManifestFileEntry,
)
from wkd import (
check_domain_signature,
hash_localpart,
build_web_key_uri,
stream_to_file
)
from fetchmedia import (
OpenPGPEnvironment,
setup_verification_environment
)

# 0. Top level directory (repository) contains Manifest, a PGP signature of
#blake2b and sha512 hashes of Manifest.files.gz.
# 1. Manifest.files contains hashes of each category Manifest.gz.
# 2. The category Manifest contains hashes of each package Manifest.
# 3. The package Manifest contains hashes of each package file.
#Must be aware of PMS, e.g. aux tag specifies a file in files/.

# 0. Check signature of repo Manifest.
# 1. Merge items in Manifest.files, each category Manifest, and each package
#Manifest into one big list. The path must be made absolute.
# 2. Distribute items to threads.

# To check operation compare directory tree to files appearing in all
# ManifestRecords.

class ManifestTree(object):
__slots__ = ['_directory', '_manifest_list', '_manifest_records',
'_manifest_results']

def __init__(self, directory: str):
self._directory = directory
# Tuples of (base_path, full_path).
self._manifest_list = []
self._manifest_records = []
self._manifest_results = []

def build_manifest_list(self):
for path, dirs, files in os.walk(self._directory):
#if 'glsa' in path or 'news' in path:
#if 'metadata' in path:
#continue # Skip the metadata directory for now.
# It contains a repository. Current algo barfs on Manifest
# containing only sig.

if 'Manifest.files.gz' in files:
self._manifest_list += [(path, path + '/Manifest.files.gz')]
if 'Manifest.gz' in files:
self._manifest_list += [(path, path + '/Manifest.gz')]

if path == self._directory:
continue # Skip the repo manifest. Order matters, fix 
eventually.
if 'Manifest' in files:
self._manifest_list += [(path, path + '/Manifest')]

def parse_manifests(self):
td = tempfile.TemporaryDirectory(dir='./')
for manifest in self._manifest_list:
def inner():
if manifest[1].endswith('.gz'):
name = 'Manifest.files' # Need to also handle Manifest.gz.
path = '{0}/{1}'.format(td.name, name)
subprocess.run(['sh', '-c', 'gunzip -c {0} > {1}'
.format(manifest[1], path)])
for line in open(path):
mr = ManifestRecord(line)
mr.make_absolute(manifest[0])
self._manifest_records += [mr]
else:
for line in open(manifest[1]):
if line.startswith('-'):
return # Skip the signed manifest.
mr = ManifestRecord(line)
mr.make_absolute(manifest[0])
self._manifest_records += [mr]
inner()

def verify_manifests(self):
for record in self._manifest_records:
self._manifest_results += [record.verify()]


class ManifestRecord(object):
__slots__ = ['_tag', '_abs_path', '_path', '_size', '_hashes']

def __init__(self, line: str=None):
self._tag = None
self._abs_path = None
self._path = None
self._size = None
self._hashes = []
if line:
self.from_string(line)

def 

Re: [gentoo-portage-dev] [PATCH 1/3] Add caching to catpkgsplit function

2020-06-28 Thread Sid Spry
On Sat, Jun 27, 2020, at 1:34 AM, Chun-Yu Shei wrote:
> According to cProfile, catpkgsplit is called up to 1-5.5 million times
> during "emerge -uDvpU --with-bdeps=y @world". Adding a dict to cache its
> results reduces the time for this command from 43.53 -> 41.53 seconds --
> a 4.8% speedup.
> ---
>  lib/portage/versions.py | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/lib/portage/versions.py b/lib/portage/versions.py
> index 0c21373cc..ffec316ce 100644
> --- a/lib/portage/versions.py
> +++ b/lib/portage/versions.py
> @@ -312,6 +312,7 @@ def _pkgsplit(mypkg, eapi=None):
>  
>  _cat_re = re.compile('^%s$' % _cat, re.UNICODE)
>  _missing_cat = 'null'
> +_catpkgsplit_cache = {}
>  
>  def catpkgsplit(mydata, silent=1, eapi=None):
>   """
> @@ -331,6 +332,11 @@ def catpkgsplit(mydata, silent=1, eapi=None):
>   return mydata.cpv_split
>   except AttributeError:
>   pass
> +
> + cache_entry = _catpkgsplit_cache.get(mydata)
> + if cache_entry is not None:
> + return cache_entry
> +
>   mysplit = mydata.split('/', 1)
>   p_split = None
>   if len(mysplit) == 1:
> @@ -343,6 +349,7 @@ def catpkgsplit(mydata, silent=1, eapi=None):
>   if not p_split:
>   return None
>   retval = (cat, p_split[0], p_split[1], p_split[2])
> + _catpkgsplit_cache[mydata] = retval
>   return retval
>  
>  class _pkg_str(_unicode):
> -- 
> 2.27.0.212.ge8ba1cc988-goog
> 

There are libraries that provide decorators, etc, for caching and memoization.
Have you evaluated any of those? One is available in the standard library:
https://docs.python.org/dev/library/functools.html#functools.lru_cache

I comment as this would increase code clarity.



Re: [gentoo-portage-dev] [PATCH] Use env to find python

2020-06-16 Thread Sid Spry
On Tue, Jun 16, 2020, at 3:57 PM, Michał Górny wrote:
> On Tue, 2020-06-16 at 15:19 -0400, Mike Gilbert wrote:
> > On Tue, Jun 16, 2020 at 1:55 PM Zac Medico  wrote:
> > > On 6/16/20 10:46 AM, Mike Gilbert wrote:
> > > > On Tue, Jun 16, 2020 at 1:45 PM Mike Gilbert  wrote:
> > > > > On Mon, Jun 15, 2020 at 9:39 AM Sid Spry  wrote:
> > > > > > On Mon, Jun 15, 2020, at 2:36 AM, Ulrich Mueller wrote:
> > > > > > > But we know that it is in /usr/bin, so why add yet another 
> > > > > > > indirection?
> > > > > > > 
> > > > > > > Attachments:
> > > > > > > * signature.asc
> > > > > > 
> > > > > > Ah, sorry -- I forgot to note this here. If you wish to support 
> > > > > > prefix it is possible it may not be in /usr/bin. Granted I am not 
> > > > > > sure if the prefix stage3 I was using is old enough to be broken in 
> > > > > > some way, but adding this would prevent future breakage.
> > > > > 
> > > > > The portage ebuild and the python distutils module already take care
> > > > > of updating shebangs at install time.
> > > > 
> > > > I suppose your patch might be useful if you are trying to run portage
> > > > from a git checkout on a prefix system.
> > > > 
> > > 
> > > So, given that the ebuild updates shebangs automatically, should't we
> > > optimize the default shebangs to be as flexible as possible?
> > 
> > Yes, that makes sense.
> > 
> > However, we should test to make sure that distutils is smart enough to
> > parse that "/usr/bin/env -S python" string and replace it with
> > version-specific python shebang.
> > 
> 
> '/usr/bin/env python' (with no extra options) is the portable shebang.
> 

I added `-S` to preserve the options passed via the shebang line. It seems they 
can be left off, does anyone know otherwise?



Re: [gentoo-portage-dev] [PATCH] Use env to find python

2020-06-15 Thread Sid Spry
On Mon, Jun 15, 2020, at 2:36 AM, Ulrich Mueller wrote:
> But we know that it is in /usr/bin, so why add yet another indirection?
> 
> Attachments:
> * signature.asc

Ah, sorry -- I forgot to note this here. If you wish to support prefix it is 
possible it may not be in /usr/bin. Granted I am not sure if the prefix stage3 
I was using is old enough to be broken in some way, but adding this would 
prevent future breakage.

I understand your concern but are these really in a hot path? Whatever the 
scripts are doing will beat the time it takes to invoke env by orders of 
magnitude. I can run benchmarks if you'd like, would the tests serve as such?



[gentoo-portage-dev] [PATCH] Use env to find python

2020-06-14 Thread Sid Spry
>From b3854bd9791bb21d7284ef6284a3fb7d4b585412 Mon Sep 17 00:00:00 2001
From: Sid Spry 
Date: Sun, 14 Jun 2020 23:29:46 -0500
Subject: [PATCH] Use env to find python
To: gentoo-portage-dev@lists.gentoo.org

---
bin/archive-conf | 2 +-
bin/binhost-snapshot | 2 +-
bin/check-implicit-pointer-usage.py | 2 +-
bin/chmod-lite.py | 2 +-
bin/chpathtool.py | 2 +-
bin/clean_locks | 2 +-
bin/dispatch-conf | 2 +-
bin/dohtml.py | 2 +-
bin/doins.py | 2 +-
bin/ebuild | 2 +-
bin/ebuild-ipc.py | 2 +-
bin/egencache | 2 +-
bin/emaint | 2 +-
bin/emerge | 2 +-
bin/emirrordist | 2 +-
bin/env-update | 2 +-
bin/filter-bash-environment.py | 2 +-
bin/fixpackages | 2 +-
bin/glsa-check | 2 +-
bin/install.py | 2 +-
bin/lock-helper.py | 2 +-
bin/portageq | 2 +-
bin/quickpkg | 2 +-
bin/regenworld | 2 +-
bin/xattr-helper.py | 2 +-
bin/xpak-helper.py | 2 +-
lib/portage/tests/runTests.py | 2 +-
lib/portage/util/changelog.py | 2 +-
runtests | 2 +-
tabcheck.py | 2 +-
30 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/bin/archive-conf b/bin/archive-conf
index 8341ffe73..36a4da07a 100755
--- a/bin/archive-conf
+++ b/bin/archive-conf
@@ -1,4 +1,4 @@
-#!/usr/bin/python -b
+#!/usr/bin/env -S python -b
# Copyright 1999-2014 Gentoo Foundation
# Distributed under the terms of the GNU General Public License v2

diff --git a/bin/binhost-snapshot b/bin/binhost-snapshot
index d677e7568..3726bb20a 100755
--- a/bin/binhost-snapshot
+++ b/bin/binhost-snapshot
@@ -1,4 +1,4 @@
-#!/usr/bin/python -b
+#!/usr/bin/env -S python -b
# Copyright 2010-2014 Gentoo Foundation
# Distributed under the terms of the GNU General Public License v2

diff --git a/bin/check-implicit-pointer-usage.py 
b/bin/check-implicit-pointer-usage.py
index a49db8107..5b3cec019 100755
--- a/bin/check-implicit-pointer-usage.py
+++ b/bin/check-implicit-pointer-usage.py
@@ -1,4 +1,4 @@
-#!/usr/bin/python -b
+#!/usr/bin/env -S python -b

# Ripped from HP and updated from Debian
# Update by Gentoo to support unicode output
diff --git a/bin/chmod-lite.py b/bin/chmod-lite.py
index 177be7eab..c34c68912 100755
--- a/bin/chmod-lite.py
+++ b/bin/chmod-lite.py
@@ -1,4 +1,4 @@
-#!/usr/bin/python -b
+#!/usr/bin/env -S python -b
# Copyright 2015 Gentoo Foundation
# Distributed under the terms of the GNU General Public License v2

diff --git a/bin/chpathtool.py b/bin/chpathtool.py
index fbd18b987..fb438e5ba 100755
--- a/bin/chpathtool.py
+++ b/bin/chpathtool.py
@@ -1,4 +1,4 @@
-#!/usr/bin/python -b
+#!/usr/bin/env -S python -b
# Copyright 2011-2014 Gentoo Foundation
# Distributed under the terms of the GNU General Public License v2

diff --git a/bin/clean_locks b/bin/clean_locks
index 94ba4c606..c62d10b94 100755
--- a/bin/clean_locks
+++ b/bin/clean_locks
@@ -1,4 +1,4 @@
-#!/usr/bin/python -b
+#!/usr/bin/env -S python -b
# Copyright 1999-2014 Gentoo Foundation
# Distributed under the terms of the GNU General Public License v2

diff --git a/bin/dispatch-conf b/bin/dispatch-conf
index 62ab3f6cc..9d22aae72 100755
--- a/bin/dispatch-conf
+++ b/bin/dispatch-conf
@@ -1,4 +1,4 @@
-#!/usr/bin/python -b
+#!/usr/bin/env -S python -b
# Copyright 1999-2019 Gentoo Authors
# Distributed under the terms of the GNU General Public License v2

diff --git a/bin/dohtml.py b/bin/dohtml.py
index dfcaa6026..8505134c5 100755
--- a/bin/dohtml.py
+++ b/bin/dohtml.py
@@ -1,4 +1,4 @@
-#!/usr/bin/python -b
+#!/usr/bin/env -S python -b
# Copyright 1999-2014 Gentoo Foundation
# Distributed under the terms of the GNU General Public License v2

diff --git a/bin/doins.py b/bin/doins.py
index 6bc30c90b..8de480e81 100644
--- a/bin/doins.py
+++ b/bin/doins.py
@@ -1,4 +1,4 @@
-#!/usr/bin/python -b
+#!/usr/bin/env -S python -b
# Copyright 2017 Gentoo Foundation
# Distributed under the terms of the GNU General Public License v2
#
diff --git a/bin/ebuild b/bin/ebuild
index 460aa0fd1..fbc6ad177 100755
--- a/bin/ebuild
+++ b/bin/ebuild
@@ -1,4 +1,4 @@
-#!/usr/bin/python -b
+#!/usr/bin/env -S python -b
# Copyright 1999-2019 Gentoo Authors
# Distributed under the terms of the GNU General Public License v2

diff --git a/bin/ebuild-ipc.py b/bin/ebuild-ipc.py
index d68d3f05e..02b59f5ef 100755
--- a/bin/ebuild-ipc.py
+++ b/bin/ebuild-ipc.py
@@ -1,4 +1,4 @@
-#!/usr/bin/python -b
+#!/usr/bin/env -S python -b
# Copyright 2010-2018 Gentoo Foundation
# Distributed under the terms of the GNU General Public License v2
#
diff --git a/bin/egencache b/bin/egencache
index d172319f8..6fb2fe0fe 100755
--- a/bin/egencache
+++ b/bin/egencache
@@ -1,4 +1,4 @@
-#!/usr/bin/python -b
+#!/usr/bin/env -S python -b
# Copyright 2009-2015 Gentoo Foundation
# Distributed under the terms of the GNU General Public License v2

diff --git a/bin/emaint b/bin/emaint
index df904f7c0..d07e0d022 100755
--- a/bin/emaint
+++ b/bin/emaint
@@ -1,4 +1,4 @@
-#!/usr/bin/python -b
+#!/usr/bin/env -S python -b
# Copyright 2005-2014 Gentoo Foundation
# Distributed under the terms of the GNU General Public License v2

diff --git a/bin/emerge b/bin/emerge
index e372f5e9e..08f92b