Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-07-01 Thread Sid Spry
On Wed, Jul 1, 2020, at 1:40 AM, Fabian Groffen wrote:
> On 30-06-2020 13:13:29 -0500, Sid Spry wrote:
> > On Tue, Jun 30, 2020, at 1:20 AM, Fabian Groffen wrote:
> > > Hi,
> > > 
> > > On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> > > > Hello,
> > > > 
> > > > I have some runnable pseudocode outlining a faster tree verification 
> > > > algorithm.
> > > > Before I create patches I'd like to see if there is any guidance on 
> > > > making the
> > > > changes as unobtrusive as possible. If the radical change in algorithm 
> > > > is
> > > > acceptable I can work on adding the changes.
> > > > 
> > > > Instead of composing any kind of structured data out of the portage 
> > > > tree my
> > > > algorithm just lists all files and then optionally batches them out to 
> > > > threads.
> > > > There is a noticeable speedup by eliding the tree traversal operations 
> > > > which
> > > > can be seen when running the algorithm with a single thread and 
> > > > comparing it to
> > > > the current algorithm in gemato (which should still be discussed here?).
> > > 
> > > I remember something that gemato used to use multiple threads, but
> > > because it totally saturated disk-IO, it was brought back to a single
> > > thread.  People were complaining about unusable systems.
> > > 
> > 
> > I think this is an argument for cgroups limits support on the portage 
> > process or
> > account as opposed to an argument against picking a better algorithm. That 
> > is
> > something I have been working towards, but I am only one man.
> 
> But this requires a) cgroups support, and b) the privileges to use it.
> Shouldn't be a problem in the normal case, but just saying.
> 

cgroups kernel support is a fairly common dependency. It can obviously be 
optional,
I am thinking related to MAKEOPTS or EMERGE_DEFAULT_OPTS (see: rustc/cargo not
respecting or being passed -j/-l as another use for cgroups) and supported 
best-effort,
but is there any reason to expect it to not be enabled?

If the user isn't either root or portage I think it reasonable to leave 
resource management
to the machine's administrator.



Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-07-01 Thread Sid Spry
On Wed, Jul 1, 2020, at 1:40 AM, Fabian Groffen wrote:
> On 30-06-2020 13:13:29 -0500, Sid Spry wrote:
> > On Tue, Jun 30, 2020, at 1:20 AM, Fabian Groffen wrote:
> > > Hi,
> > > 
> > > On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> > > > Hello,
> > > > 
> > > > I have some runnable pseudocode outlining a faster tree verification 
> > > > algorithm.
> > > > Before I create patches I'd like to see if there is any guidance on 
> > > > making the
> > > > changes as unobtrusive as possible. If the radical change in algorithm 
> > > > is
> > > > acceptable I can work on adding the changes.
> > > > 
> > > > Instead of composing any kind of structured data out of the portage 
> > > > tree my
> > > > algorithm just lists all files and then optionally batches them out to 
> > > > threads.
> > > > There is a noticeable speedup by eliding the tree traversal operations 
> > > > which
> > > > can be seen when running the algorithm with a single thread and 
> > > > comparing it to
> > > > the current algorithm in gemato (which should still be discussed here?).
> > > 
> > > I remember something that gemato used to use multiple threads, but
> > > because it totally saturated disk-IO, it was brought back to a single
> > > thread.  People were complaining about unusable systems.
> > > 
> > 
> > I think this is an argument for cgroups limits support on the portage 
> > process or
> > account as opposed to an argument against picking a better algorithm. That 
> > is
> > something I have been working towards, but I am only one man.
> 
> But this requires a) cgroups support, and b) the privileges to use it.
> Shouldn't be a problem in the normal case, but just saying.
> 
> > > In any case, can you share your performance results?  What speedup did
> > > you see, on warm and hot FS caches?  Which type of disk do you use?
> > > 
> > 
> > I ran all tests multiple times to make them warm off of a Samsung SSD, but
> > nothing very precise yet.
> > 
> > % gemato verify --openpgp-key signkey.asc /var/db/repos/gentoo
> > [...]
> > INFO:root:Verifying /var/db/repos/gentoo...
> > INFO:root:/var/db/repos/gentoo verified in 16.45 seconds
> > 
> > sometimes going higher, closer to 18s, vs.
> > 
> > % ./veriftree.py
> > 4.763171965983929
> > 
> > So roughly an order of magnitude speedup without batching to threads.
> 
> That is kind of a change.  Makes one wonder if you really did the same
> work.
> 

That was my initial reaction. I attempted to ensure I was processing all of
the files that gemato processed. The full output of my script is something
closer to:

% ./veriftree.py
x.xx
192157
126237

The first number being the time, the second the total number of manifest 
directives, 
and the third being the number of real files in the tree. If you prune the 
directives
that correspond to no file you end up with an exact match IIRC.

However, you are right, and I think this is old code. gemato times the manifest 
file
parsing as well as the verification. It seems this change is not in the code I
provided. If I do that instead, I get:

% ./veriftree.py
11.708862617029808
192157
126237

With corresponding times for gemato (at same system state, etc) being ~20s. So 
it
is a halving at worst with assured n-core speedup for 1/2 of that time, and I am
fairly confident I can speed up the manifest parsing even more as well.

> > > You could compare against qmanifest, which uses OpenMP-based
> > > paralllelism while verifying the tree.  On SSDs this does help.
> > > 
> > 
> > I lost my notes -- how do I specify to either gemato or qmanifest the GnuPG
> > directory? My code is partially structured as it is because I had problems 
> > doing
> > this. I rediscovered -K/--openpgp-key in gemato but am unsure for qmanifest.
> 
> qmanifest doesn't do much magic out of the standard gnupg practices.
> (It is using gpgme.)  If you want it to use a different gnupg dir, you
> may change HOME, or GNUPGHOME.
> 

Alright, I will attempt to set that. I think I like the interface of gemato a 
little more
but will look at qmanifest and see how it performs.



Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-07-01 Thread Michał Górny
On Tue, 2020-06-30 at 16:51 -0500, Sid Spry wrote:
> On Tue, Jun 30, 2020, at 2:29 PM, Michał Górny wrote:
> > On Tue, 2020-06-30 at 12:50 -0500, Sid Spry wrote:
> > > On Tue, Jun 30, 2020, at 2:28 AM, Michał Górny wrote:
> > > > Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry  napisał(a):
> > > > > Hello,
> > > > > 
> > > > > I have some runnable pseudocode outlining a faster tree verification
> > > > > algorithm.
> > > > > Before I create patches I'd like to see if there is any guidance on
> > > > > making the
> > > > > changes as unobtrusive as possible. If the radical change in algorithm
> > > > > is
> > > > > acceptable I can work on adding the changes.
> > > > > 
> > > > > Instead of composing any kind of structured data out of the portage
> > > > > tree my
> > > > > algorithm just lists all files and then optionally batches them out to
> > > > > threads.
> > > > > There is a noticeable speedup by eliding the tree traversal operations
> > > > > which
> > > > > can be seen when running the algorithm with a single thread and
> > > > > comparing it to
> > > > > the current algorithm in gemato (which should still be discussed
> > > > > here?).
> > > > 
> > > > Without reading the code: does your algorithm correctly detect 
> > > > extraneous files?
> > > > 
> > > 
> > > Yes and no.
> > > 
> > > I am not sure why this is necessary. If the file does not appear in a 
> > > manifest it is
> > > ignored. It makes the most sense to me to put the burden of not including
> > > untracked files on the publisher. If the user puts an untracked file into 
> > > the tree it
> > > will be ignored to no consequence; the authored files don't refer to it, 
> > > after all.
> > 
> > This is necessary because a malicious third party can MITM you an rsync
> > tree with extraneous files (say, -r1 baselayout ebuild) that do horrible
> > things on your system.  If you don't reject files not in Manifest, you
> > open a huge security hole.
> > 
> 
> Ok, I will refer to https://www.gentoo.org/glep/glep-0074.html and implement 
> the
> checks in detail, but will still need to spend some time looking for the best 
> place
> to insert the code.
> 
> I think it best to address this from two fronts. On one hand rejecting extra 
> files
> seems to have immediate benefit but the larger issue is portage exposing
> untracked potentially malicious files to the user.

I've pushed two branches with two exploits I could think of.  Please
test your tools against them.  In both cases, the directory should be
rejected without any ill effects.

https://github.com/mgorny/glep74-test-cases

> 
> Has anything like a verity loopback filesystem been explored? It might reduce
> duplication of work.

I don't understand what a 'verity loopback filesystem' is.  Could you elaborate?

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-07-01 Thread Fabian Groffen
On 30-06-2020 13:13:29 -0500, Sid Spry wrote:
> On Tue, Jun 30, 2020, at 1:20 AM, Fabian Groffen wrote:
> > Hi,
> > 
> > On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> > > Hello,
> > > 
> > > I have some runnable pseudocode outlining a faster tree verification 
> > > algorithm.
> > > Before I create patches I'd like to see if there is any guidance on 
> > > making the
> > > changes as unobtrusive as possible. If the radical change in algorithm is
> > > acceptable I can work on adding the changes.
> > > 
> > > Instead of composing any kind of structured data out of the portage tree 
> > > my
> > > algorithm just lists all files and then optionally batches them out to 
> > > threads.
> > > There is a noticeable speedup by eliding the tree traversal operations 
> > > which
> > > can be seen when running the algorithm with a single thread and comparing 
> > > it to
> > > the current algorithm in gemato (which should still be discussed here?).
> > 
> > I remember something that gemato used to use multiple threads, but
> > because it totally saturated disk-IO, it was brought back to a single
> > thread.  People were complaining about unusable systems.
> > 
> 
> I think this is an argument for cgroups limits support on the portage process 
> or
> account as opposed to an argument against picking a better algorithm. That is
> something I have been working towards, but I am only one man.

But this requires a) cgroups support, and b) the privileges to use it.
Shouldn't be a problem in the normal case, but just saying.

> > In any case, can you share your performance results?  What speedup did
> > you see, on warm and hot FS caches?  Which type of disk do you use?
> > 
> 
> I ran all tests multiple times to make them warm off of a Samsung SSD, but
> nothing very precise yet.
> 
> % gemato verify --openpgp-key signkey.asc /var/db/repos/gentoo
> [...]
> INFO:root:Verifying /var/db/repos/gentoo...
> INFO:root:/var/db/repos/gentoo verified in 16.45 seconds
> 
> sometimes going higher, closer to 18s, vs.
> 
> % ./veriftree.py
> 4.763171965983929
> 
> So roughly an order of magnitude speedup without batching to threads.

That is kind of a change.  Makes one wonder if you really did the same
work.

> > You could compare against qmanifest, which uses OpenMP-based
> > paralllelism while verifying the tree.  On SSDs this does help.
> > 
> 
> I lost my notes -- how do I specify to either gemato or qmanifest the GnuPG
> directory? My code is partially structured as it is because I had problems 
> doing
> this. I rediscovered -K/--openpgp-key in gemato but am unsure for qmanifest.

qmanifest doesn't do much magic out of the standard gnupg practices.
(It is using gpgme.)  If you want it to use a different gnupg dir, you
may change HOME, or GNUPGHOME.

Thanks,
Fabian

-- 
Fabian Groffen
Gentoo on a different level


signature.asc
Description: PGP signature


Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Sid Spry
On Tue, Jun 30, 2020, at 2:29 PM, Michał Górny wrote:
> On Tue, 2020-06-30 at 12:50 -0500, Sid Spry wrote:
> > On Tue, Jun 30, 2020, at 2:28 AM, Michał Górny wrote:
> > > Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry  napisał(a):
> > > > Hello,
> > > > 
> > > > I have some runnable pseudocode outlining a faster tree verification
> > > > algorithm.
> > > > Before I create patches I'd like to see if there is any guidance on
> > > > making the
> > > > changes as unobtrusive as possible. If the radical change in algorithm
> > > > is
> > > > acceptable I can work on adding the changes.
> > > > 
> > > > Instead of composing any kind of structured data out of the portage
> > > > tree my
> > > > algorithm just lists all files and then optionally batches them out to
> > > > threads.
> > > > There is a noticeable speedup by eliding the tree traversal operations
> > > > which
> > > > can be seen when running the algorithm with a single thread and
> > > > comparing it to
> > > > the current algorithm in gemato (which should still be discussed
> > > > here?).
> > > 
> > > Without reading the code: does your algorithm correctly detect extraneous 
> > > files?
> > > 
> > 
> > Yes and no.
> > 
> > I am not sure why this is necessary. If the file does not appear in a 
> > manifest it is
> > ignored. It makes the most sense to me to put the burden of not including
> > untracked files on the publisher. If the user puts an untracked file into 
> > the tree it
> > will be ignored to no consequence; the authored files don't refer to it, 
> > after all.
> 
> This is necessary because a malicious third party can MITM you an rsync
> tree with extraneous files (say, -r1 baselayout ebuild) that do horrible
> things on your system.  If you don't reject files not in Manifest, you
> open a huge security hole.
> 

Ok, I will refer to https://www.gentoo.org/glep/glep-0074.html and implement the
checks in detail, but will still need to spend some time looking for the best 
place
to insert the code.

I think it best to address this from two fronts. On one hand rejecting extra 
files
seems to have immediate benefit but the larger issue is portage exposing
untracked potentially malicious files to the user.

Has anything like a verity loopback filesystem been explored? It might reduce
duplication of work.

> > But it would be easy enough to build a second list of all files and compare 
> > it to
> > the list of files built from the manifests. If there are extras an error 
> > can be
> > generated. This is actually the first test I did on my manifest parsing 
> > code. I tried
> > to see if my tracked files roughly matched the total files in tree. That 
> > can be
> > repurposed for this check.
> > 
> > > > Some simple tests like counting all objects traversed and verified
> > > > returns the
> > > > same(ish). Once it is put into portage it could be tested in detail.
> > > > 
> > > > There is also my partial attempt at removing the brittle interface to
> > > > GnuPG
> > > > (it's not as if the current code is badly designed, just that parsing
> > > > the
> > > > output of GnuPG directly is likely not the best idea).
> > > 
> > > The 'brittle interface' is well-defined machine-readable output.
> > > 
> > 
> > Ok. I was aware there was a machine interface, but the classes that 
> > manipulate
> > a temporary GPG home seemed like not the best solution. I guess that is all
> > due to GPG assuming everything is in ~/.gnupg and keeping its state as a
> > directory structure.
> 
> A temporary home directory guarantees that user configuration does not
> affect the verification result.
> 

Yes, I know why it is there. The temporary construction of the directory is what
stood out to me as messy but I guess there is no way around it.

> > 
> > > > Needs gemato, dnspython, and requests. Slightly better than random code
> > > > because
> > > > I took inspiration from the existing gemato classes.
> > > 
> > > The code makes a lot of brittle assumptions about the structure. The 
> > > GLEP was specifically designed to avoid that and let us adjust the 
> > > structure in the future to meet our needs.
> > > 
> > 
> > These same assumptions are built into the code that operates on the
> > tree structure. If the GLEP were changed the existing code would also
> > potentially need changing. This code just uses the structure in a different
> > way.
> > 
> 
> The code that predates the GLEP, yes.  It will eventually be changed to
> be more flexible, especially when we can assume that we start removing
> backwards compatibility.
> 

My comment was more on the general nature of the code having to track
what is intended. Even if parsing code (involving the filesystem?) was
generated from a specification now the specification is "code" and still may
not reflect what is intended.

Taking liberties with the programmatic representation of the manifests
seems appropriate. I also don't see anything in the GLEP that dictates an
in-memory representation (and such a thing would 

Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Michał Górny
On Tue, 2020-06-30 at 12:50 -0500, Sid Spry wrote:
> On Tue, Jun 30, 2020, at 2:28 AM, Michał Górny wrote:
> > Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry  napisał(a):
> > > Hello,
> > > 
> > > I have some runnable pseudocode outlining a faster tree verification
> > > algorithm.
> > > Before I create patches I'd like to see if there is any guidance on
> > > making the
> > > changes as unobtrusive as possible. If the radical change in algorithm
> > > is
> > > acceptable I can work on adding the changes.
> > > 
> > > Instead of composing any kind of structured data out of the portage
> > > tree my
> > > algorithm just lists all files and then optionally batches them out to
> > > threads.
> > > There is a noticeable speedup by eliding the tree traversal operations
> > > which
> > > can be seen when running the algorithm with a single thread and
> > > comparing it to
> > > the current algorithm in gemato (which should still be discussed
> > > here?).
> > 
> > Without reading the code: does your algorithm correctly detect extraneous 
> > files?
> > 
> 
> Yes and no.
> 
> I am not sure why this is necessary. If the file does not appear in a 
> manifest it is
> ignored. It makes the most sense to me to put the burden of not including
> untracked files on the publisher. If the user puts an untracked file into the 
> tree it
> will be ignored to no consequence; the authored files don't refer to it, 
> after all.

This is necessary because a malicious third party can MITM you an rsync
tree with extraneous files (say, -r1 baselayout ebuild) that do horrible
things on your system.  If you don't reject files not in Manifest, you
open a huge security hole.

> But it would be easy enough to build a second list of all files and compare 
> it to
> the list of files built from the manifests. If there are extras an error can 
> be
> generated. This is actually the first test I did on my manifest parsing code. 
> I tried
> to see if my tracked files roughly matched the total files in tree. That can 
> be
> repurposed for this check.
> 
> > > Some simple tests like counting all objects traversed and verified
> > > returns the
> > > same(ish). Once it is put into portage it could be tested in detail.
> > > 
> > > There is also my partial attempt at removing the brittle interface to
> > > GnuPG
> > > (it's not as if the current code is badly designed, just that parsing
> > > the
> > > output of GnuPG directly is likely not the best idea).
> > 
> > The 'brittle interface' is well-defined machine-readable output.
> > 
> 
> Ok. I was aware there was a machine interface, but the classes that manipulate
> a temporary GPG home seemed like not the best solution. I guess that is all
> due to GPG assuming everything is in ~/.gnupg and keeping its state as a
> directory structure.

A temporary home directory guarantees that user configuration does not
affect the verification result.

> 
> > > Needs gemato, dnspython, and requests. Slightly better than random code
> > > because
> > > I took inspiration from the existing gemato classes.
> > 
> > The code makes a lot of brittle assumptions about the structure. The 
> > GLEP was specifically designed to avoid that and let us adjust the 
> > structure in the future to meet our needs.
> > 
> 
> These same assumptions are built into the code that operates on the
> tree structure. If the GLEP were changed the existing code would also
> potentially need changing. This code just uses the structure in a different
> way.
> 

The code that predates the GLEP, yes.  It will eventually be changed to
be more flexible, especially when we can assume that we start removing
backwards compatibility.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Sid Spry
On Tue, Jun 30, 2020, at 1:20 AM, Fabian Groffen wrote:
> Hi,
> 
> On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> > Hello,
> > 
> > I have some runnable pseudocode outlining a faster tree verification 
> > algorithm.
> > Before I create patches I'd like to see if there is any guidance on making 
> > the
> > changes as unobtrusive as possible. If the radical change in algorithm is
> > acceptable I can work on adding the changes.
> > 
> > Instead of composing any kind of structured data out of the portage tree my
> > algorithm just lists all files and then optionally batches them out to 
> > threads.
> > There is a noticeable speedup by eliding the tree traversal operations which
> > can be seen when running the algorithm with a single thread and comparing 
> > it to
> > the current algorithm in gemato (which should still be discussed here?).
> 
> I remember something that gemato used to use multiple threads, but
> because it totally saturated disk-IO, it was brought back to a single
> thread.  People were complaining about unusable systems.
> 

I think this is an argument for cgroups limits support on the portage process or
account as opposed to an argument against picking a better algorithm. That is
something I have been working towards, but I am only one man.

> In any case, can you share your performance results?  What speedup did
> you see, on warm and hot FS caches?  Which type of disk do you use?
> 

I ran all tests multiple times to make them warm off of a Samsung SSD, but
nothing very precise yet.

% gemato verify --openpgp-key signkey.asc /var/db/repos/gentoo
[...]
INFO:root:Verifying /var/db/repos/gentoo...
INFO:root:/var/db/repos/gentoo verified in 16.45 seconds

sometimes going higher, closer to 18s, vs.

% ./veriftree.py
4.763171965983929

So roughly an order of magnitude speedup without batching to threads.

> You could compare against qmanifest, which uses OpenMP-based
> paralllelism while verifying the tree.  On SSDs this does help.
> 

I lost my notes -- how do I specify to either gemato or qmanifest the GnuPG
directory? My code is partially structured as it is because I had problems doing
this. I rediscovered -K/--openpgp-key in gemato but am unsure for qmanifest.



Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Sid Spry
On Tue, Jun 30, 2020, at 2:28 AM, Michał Górny wrote:
> Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry  napisał(a):
> >Hello,
> >
> >I have some runnable pseudocode outlining a faster tree verification
> >algorithm.
> >Before I create patches I'd like to see if there is any guidance on
> >making the
> >changes as unobtrusive as possible. If the radical change in algorithm
> >is
> >acceptable I can work on adding the changes.
> >
> >Instead of composing any kind of structured data out of the portage
> >tree my
> >algorithm just lists all files and then optionally batches them out to
> >threads.
> >There is a noticeable speedup by eliding the tree traversal operations
> >which
> >can be seen when running the algorithm with a single thread and
> >comparing it to
> >the current algorithm in gemato (which should still be discussed
> >here?).
> 
> Without reading the code: does your algorithm correctly detect extraneous 
> files?
> 

Yes and no.

I am not sure why this is necessary. If the file does not appear in a manifest 
it is
ignored. It makes the most sense to me to put the burden of not including
untracked files on the publisher. If the user puts an untracked file into the 
tree it
will be ignored to no consequence; the authored files don't refer to it, after 
all.

But it would be easy enough to build a second list of all files and compare it 
to
the list of files built from the manifests. If there are extras an error can be
generated. This is actually the first test I did on my manifest parsing code. I 
tried
to see if my tracked files roughly matched the total files in tree. That can be
repurposed for this check.

> >Some simple tests like counting all objects traversed and verified
> >returns the
> >same(ish). Once it is put into portage it could be tested in detail.
> >
> >There is also my partial attempt at removing the brittle interface to
> >GnuPG
> >(it's not as if the current code is badly designed, just that parsing
> >the
> >output of GnuPG directly is likely not the best idea).
> 
> The 'brittle interface' is well-defined machine-readable output.
>

Ok. I was aware there was a machine interface, but the classes that manipulate
a temporary GPG home seemed like not the best solution. I guess that is all
due to GPG assuming everything is in ~/.gnupg and keeping its state as a
directory structure.

> >
> >Needs gemato, dnspython, and requests. Slightly better than random code
> >because
> >I took inspiration from the existing gemato classes.
> 
> The code makes a lot of brittle assumptions about the structure. The 
> GLEP was specifically designed to avoid that and let us adjust the 
> structure in the future to meet our needs.
> 

These same assumptions are built into the code that operates on the
tree structure. If the GLEP were changed the existing code would also
potentially need changing. This code just uses the structure in a different
way.

I will admit my partial understanding of the entire GLEP. I made some
simplifications just to get something demonstrable done. However, please
consider removing or putting some of the checks elsewhere. I don't have
full suggestions right now, but there is the possibility of saving an
appreciable amount of time.



Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Pacho Ramos
El mar, 30-06-2020 a las 08:20 +0200, Fabian Groffen escribió:
> Hi,
> 
> On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> > Hello,
> > 
> > I have some runnable pseudocode outlining a faster tree verification
> > algorithm.
> > Before I create patches I'd like to see if there is any guidance on making
> > the
> > changes as unobtrusive as possible. If the radical change in algorithm is
> > acceptable I can work on adding the changes.
> > 
> > Instead of composing any kind of structured data out of the portage tree my
> > algorithm just lists all files and then optionally batches them out to
> > threads.
> > There is a noticeable speedup by eliding the tree traversal operations which
> > can be seen when running the algorithm with a single thread and comparing it
> > to
> > the current algorithm in gemato (which should still be discussed here?).
> 
> I remember something that gemato used to use multiple threads, but
> because it totally saturated disk-IO, it was brought back to a single
> thread.  People were complaining about unusable systems.
> 
> In any case, can you share your performance results?  What speedup did
> you see, on warm and hot FS caches?  Which type of disk do you use?
> 
> You could compare against qmanifest, which uses OpenMP-based
> paralllelism while verifying the tree.  On SSDs this does help.
> 
> Thanks,
> Fabian

I am talking only based on my own experience but, at least on my case, there are
noticeable differences between I/O schedulers. That is something that maybe
could affect that tests... with latest kernels we only have noop, deadline
(default) and bfq... in my case bfq works far better for keeping the system
responsible (on both, rotational and SSDs). But you can test different
combinations to see the one that fits better for you.



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Michał Górny
Dnia June 30, 2020 6:20:37 AM UTC, Fabian Groffen  
napisał(a):
>Hi,
>
>On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
>> Hello,
>> 
>> I have some runnable pseudocode outlining a faster tree verification
>algorithm.
>> Before I create patches I'd like to see if there is any guidance on
>making the
>> changes as unobtrusive as possible. If the radical change in
>algorithm is
>> acceptable I can work on adding the changes.
>> 
>> Instead of composing any kind of structured data out of the portage
>tree my
>> algorithm just lists all files and then optionally batches them out
>to threads.
>> There is a noticeable speedup by eliding the tree traversal
>operations which
>> can be seen when running the algorithm with a single thread and
>comparing it to
>> the current algorithm in gemato (which should still be discussed
>here?).
>
>I remember something that gemato used to use multiple threads, but
>because it totally saturated disk-IO, it was brought back to a single
>thread.  People were complaining about unusable systems.

No, it gave significant speedup even on spinning HDDs. However, it hang on some 
people due to some bug that I couldn't reproduce.

>
>In any case, can you share your performance results?  What speedup did
>you see, on warm and hot FS caches?  Which type of disk do you use?
>
>You could compare against qmanifest, which uses OpenMP-based
>paralllelism while verifying the tree.  On SSDs this does help.
>
>Thanks,
>Fabian
>
>> 
>> Some simple tests like counting all objects traversed and verified
>returns the
>> same(ish). Once it is put into portage it could be tested in detail.
>> 
>> There is also my partial attempt at removing the brittle interface to
>GnuPG
>> (it's not as if the current code is badly designed, just that parsing
>the
>> output of GnuPG directly is likely not the best idea).
>> 
>> Needs gemato, dnspython, and requests. Slightly better than random
>code because
>> I took inspiration from the existing gemato classes.
>> 
>> ```python (veriftree.py)
>> #!/usr/bin/env python3
>> import os, sys, zlib, hashlib, tempfile, shutil, timeit
>> import subprocess
>> from typing import List
>> from pprint import pprint
>> 
>> from gemato.manifest import (
>> ManifestFile,
>> ManifestFileEntry,
>> )
>> from wkd import (
>> check_domain_signature,
>> hash_localpart,
>> build_web_key_uri,
>> stream_to_file
>> )
>> from fetchmedia import (
>> OpenPGPEnvironment,
>> setup_verification_environment
>> )
>> 
>> # 0. Top level directory (repository) contains Manifest, a PGP
>signature of
>> #blake2b and sha512 hashes of Manifest.files.gz.
>> # 1. Manifest.files contains hashes of each category Manifest.gz.
>> # 2. The category Manifest contains hashes of each package Manifest.
>> # 3. The package Manifest contains hashes of each package file.
>> #Must be aware of PMS, e.g. aux tag specifies a file in files/.
>> 
>> # 0. Check signature of repo Manifest.
>> # 1. Merge items in Manifest.files, each category Manifest, and each
>package
>> #Manifest into one big list. The path must be made absolute.
>> # 2. Distribute items to threads.
>> 
>> # To check operation compare directory tree to files appearing in all
>> # ManifestRecords.
>> 
>> class ManifestTree(object):
>> __slots__ = ['_directory', '_manifest_list', '_manifest_records',
>> '_manifest_results']
>> 
>> def __init__(self, directory: str):
>> self._directory = directory
>> # Tuples of (base_path, full_path).
>> self._manifest_list = []
>> self._manifest_records = []
>> self._manifest_results = []
>> 
>> def build_manifest_list(self):
>> for path, dirs, files in os.walk(self._directory):
>> #if 'glsa' in path or 'news' in path:
>> #if 'metadata' in path:
>> #continue # Skip the metadata directory for now.
>> # It contains a repository. Current algo barfs on
>Manifest
>> # containing only sig.
>> 
>> if 'Manifest.files.gz' in files:
>> self._manifest_list += [(path, path +
>'/Manifest.files.gz')]
>> if 'Manifest.gz' in files:
>> self._manifest_list += [(path, path +
>'/Manifest.gz')]
>> 
>> if path == self._directory:
>> continue # Skip the repo manifest. Order matters, fix
>eventually.
>> if 'Manifest' in files:
>> self._manifest_list += [(path, path + '/Manifest')]
>> 
>> def parse_manifests(self):
>> td = tempfile.TemporaryDirectory(dir='./')
>> for manifest in self._manifest_list:
>> def inner():
>> if manifest[1].endswith('.gz'):
>> name = 'Manifest.files' # Need to also handle
>Manifest.gz.
>> path = '{0}/{1}'.format(td.name, name)
>> subprocess.run(['sh', '-c', 'gunzip -c {0} > {1}'
>> 

Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Michał Górny
Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry  napisał(a):
>Hello,
>
>I have some runnable pseudocode outlining a faster tree verification
>algorithm.
>Before I create patches I'd like to see if there is any guidance on
>making the
>changes as unobtrusive as possible. If the radical change in algorithm
>is
>acceptable I can work on adding the changes.
>
>Instead of composing any kind of structured data out of the portage
>tree my
>algorithm just lists all files and then optionally batches them out to
>threads.
>There is a noticeable speedup by eliding the tree traversal operations
>which
>can be seen when running the algorithm with a single thread and
>comparing it to
>the current algorithm in gemato (which should still be discussed
>here?).

Without reading the code: does your algorithm correctly detect extraneous files?

>
>Some simple tests like counting all objects traversed and verified
>returns the
>same(ish). Once it is put into portage it could be tested in detail.
>
>There is also my partial attempt at removing the brittle interface to
>GnuPG
>(it's not as if the current code is badly designed, just that parsing
>the
>output of GnuPG directly is likely not the best idea).

The 'brittle interface' is well-defined machine-readable output.

>
>Needs gemato, dnspython, and requests. Slightly better than random code
>because
>I took inspiration from the existing gemato classes.

The code makes a lot of brittle assumptions about the structure. The GLEP was 
specifically designed to avoid that and let us adjust the structure in the 
future to meet our needs.

>
>```python (veriftree.py)
>#!/usr/bin/env python3
>import os, sys, zlib, hashlib, tempfile, shutil, timeit
>import subprocess
>from typing import List
>from pprint import pprint
>
>from gemato.manifest import (
>ManifestFile,
>ManifestFileEntry,
>)
>from wkd import (
>check_domain_signature,
>hash_localpart,
>build_web_key_uri,
>stream_to_file
>)
>from fetchmedia import (
>OpenPGPEnvironment,
>setup_verification_environment
>)
>
># 0. Top level directory (repository) contains Manifest, a PGP
>signature of
>#blake2b and sha512 hashes of Manifest.files.gz.
># 1. Manifest.files contains hashes of each category Manifest.gz.
># 2. The category Manifest contains hashes of each package Manifest.
># 3. The package Manifest contains hashes of each package file.
>#Must be aware of PMS, e.g. aux tag specifies a file in files/.
>
># 0. Check signature of repo Manifest.
># 1. Merge items in Manifest.files, each category Manifest, and each
>package
>#Manifest into one big list. The path must be made absolute.
># 2. Distribute items to threads.
>
># To check operation compare directory tree to files appearing in all
># ManifestRecords.
>
>class ManifestTree(object):
>__slots__ = ['_directory', '_manifest_list', '_manifest_records',
>'_manifest_results']
>
>def __init__(self, directory: str):
>self._directory = directory
># Tuples of (base_path, full_path).
>self._manifest_list = []
>self._manifest_records = []
>self._manifest_results = []
>
>def build_manifest_list(self):
>for path, dirs, files in os.walk(self._directory):
>#if 'glsa' in path or 'news' in path:
>#if 'metadata' in path:
>#continue # Skip the metadata directory for now.
> # It contains a repository. Current algo barfs on Manifest
># containing only sig.
>
>if 'Manifest.files.gz' in files:
>   self._manifest_list += [(path, path + '/Manifest.files.gz')]
>if 'Manifest.gz' in files:
>self._manifest_list += [(path, path + '/Manifest.gz')]
>
>if path == self._directory:
>  continue # Skip the repo manifest. Order matters, fix eventually.
>if 'Manifest' in files:
>self._manifest_list += [(path, path + '/Manifest')]
>
>def parse_manifests(self):
>td = tempfile.TemporaryDirectory(dir='./')
>for manifest in self._manifest_list:
>def inner():
>if manifest[1].endswith('.gz'):
> name = 'Manifest.files' # Need to also handle Manifest.gz.
>path = '{0}/{1}'.format(td.name, name)
>subprocess.run(['sh', '-c', 'gunzip -c {0} > {1}'
>.format(manifest[1], path)])
>for line in open(path):
>mr = ManifestRecord(line)
>mr.make_absolute(manifest[0])
>self._manifest_records += [mr]
>else:
>for line in open(manifest[1]):
>if line.startswith('-'):
>return # Skip the signed manifest.
>mr = ManifestRecord(line)
>mr.make_absolute(manifest[0])
>self._manifest_records += [mr]
>   

Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Fabian Groffen
Hi,

On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> Hello,
> 
> I have some runnable pseudocode outlining a faster tree verification 
> algorithm.
> Before I create patches I'd like to see if there is any guidance on making the
> changes as unobtrusive as possible. If the radical change in algorithm is
> acceptable I can work on adding the changes.
> 
> Instead of composing any kind of structured data out of the portage tree my
> algorithm just lists all files and then optionally batches them out to 
> threads.
> There is a noticeable speedup by eliding the tree traversal operations which
> can be seen when running the algorithm with a single thread and comparing it 
> to
> the current algorithm in gemato (which should still be discussed here?).

I remember something that gemato used to use multiple threads, but
because it totally saturated disk-IO, it was brought back to a single
thread.  People were complaining about unusable systems.

In any case, can you share your performance results?  What speedup did
you see, on warm and hot FS caches?  Which type of disk do you use?

You could compare against qmanifest, which uses OpenMP-based
paralllelism while verifying the tree.  On SSDs this does help.

Thanks,
Fabian

> 
> Some simple tests like counting all objects traversed and verified returns the
> same(ish). Once it is put into portage it could be tested in detail.
> 
> There is also my partial attempt at removing the brittle interface to GnuPG
> (it's not as if the current code is badly designed, just that parsing the
> output of GnuPG directly is likely not the best idea).
> 
> Needs gemato, dnspython, and requests. Slightly better than random code 
> because
> I took inspiration from the existing gemato classes.
> 
> ```python (veriftree.py)
> #!/usr/bin/env python3
> import os, sys, zlib, hashlib, tempfile, shutil, timeit
> import subprocess
> from typing import List
> from pprint import pprint
> 
> from gemato.manifest import (
> ManifestFile,
> ManifestFileEntry,
> )
> from wkd import (
> check_domain_signature,
> hash_localpart,
> build_web_key_uri,
> stream_to_file
> )
> from fetchmedia import (
> OpenPGPEnvironment,
> setup_verification_environment
> )
> 
> # 0. Top level directory (repository) contains Manifest, a PGP signature of
> #blake2b and sha512 hashes of Manifest.files.gz.
> # 1. Manifest.files contains hashes of each category Manifest.gz.
> # 2. The category Manifest contains hashes of each package Manifest.
> # 3. The package Manifest contains hashes of each package file.
> #Must be aware of PMS, e.g. aux tag specifies a file in files/.
> 
> # 0. Check signature of repo Manifest.
> # 1. Merge items in Manifest.files, each category Manifest, and each package
> #Manifest into one big list. The path must be made absolute.
> # 2. Distribute items to threads.
> 
> # To check operation compare directory tree to files appearing in all
> # ManifestRecords.
> 
> class ManifestTree(object):
> __slots__ = ['_directory', '_manifest_list', '_manifest_records',
> '_manifest_results']
> 
> def __init__(self, directory: str):
> self._directory = directory
> # Tuples of (base_path, full_path).
> self._manifest_list = []
> self._manifest_records = []
> self._manifest_results = []
> 
> def build_manifest_list(self):
> for path, dirs, files in os.walk(self._directory):
> #if 'glsa' in path or 'news' in path:
> #if 'metadata' in path:
> #continue # Skip the metadata directory for now.
> # It contains a repository. Current algo barfs on Manifest
> # containing only sig.
> 
> if 'Manifest.files.gz' in files:
> self._manifest_list += [(path, path + '/Manifest.files.gz')]
> if 'Manifest.gz' in files:
> self._manifest_list += [(path, path + '/Manifest.gz')]
> 
> if path == self._directory:
> continue # Skip the repo manifest. Order matters, fix 
> eventually.
> if 'Manifest' in files:
> self._manifest_list += [(path, path + '/Manifest')]
> 
> def parse_manifests(self):
> td = tempfile.TemporaryDirectory(dir='./')
> for manifest in self._manifest_list:
> def inner():
> if manifest[1].endswith('.gz'):
> name = 'Manifest.files' # Need to also handle Manifest.gz.
> path = '{0}/{1}'.format(td.name, name)
> subprocess.run(['sh', '-c', 'gunzip -c {0} > {1}'
> .format(manifest[1], path)])
> for line in open(path):
> mr = ManifestRecord(line)
> mr.make_absolute(manifest[0])
> self._manifest_records += [mr]
> else:
> for line in open(manifest[1]):
>