Re: [gentoo-portage-dev] Performance tuning and parallelisation

2021-08-26 Thread Marco Sirabella
Hi Ed,

I've taken a dabble at trying to track down portage's bottlenecks (and have
stopped for the time being at solving them :/ )

> Can anyone give me a leg up on how I could benchmark this further and look
> for the hotspot? Perhaps someone understand the architecture of this point
> more intimately and could point at whether there are opportunities to do some
> of the processing on mass, rather than per file?

From my notes at the timem, it looks like
[yappi](https://pypi.org/project/yappi/) worked a bit better than python's
built in cProfile for me because it properly dove into async calls. I used
[snakeviz](https://jiffyclub.github.io/snakeviz/) for visualizing the profile
results.

I was taking a look at depclean, but I found similarly that a lot of duplicate
process was being done due to encapsulated abstractions not being able to
communicate that the same thing was being done multiple times eg removing each
package processes a massive json structure for each package removed, although I
opted to work on the more-understandable unicode conversions.

My stalled progress can be found here: 
[#700](https://github.com/gentoo/portage/pull/700). 
Lost the drive to continue for now unfortunately :<

Good luck! Looking forward to your optimizations

-- 
Marco Sirabella


signature.asc
Description: PGP signature


[gentoo-portage-dev] Performance tuning and parallelisation

2021-08-26 Thread Ed W
Hi All

Consider this a tentative first email to test the water, but I have started to 
look at performance
of particularly the install phase of the emerge utility and I could use some 
guidance on where to go
next

Firstly, to define the "problem": I have found gentoo to be a great base for 
building custom
distributions and I use it to build a small embedded distro which runs on a 
couple of different
architectures. (Essentially just a "ROOT=/something emerge $some_packages"). 
However, I use some
packaging around binpackages to avoid uncessary rebuilds, and this highlights 
that "building" a
complete install using only binary packages rarely gets over a load of 1. Can 
we do better than
this? Seems to be highly serialised on the install phase of copying the files 
to the disk?

(Note I use parallel build and parallel-install flags, plus --jobs=N. If there 
is code to compile
then load will shoot up, but simply installing binpackages struggles to get the 
load over about
0.7-1.1, so presumably single threaded in all parts?)


Now, this is particularly noticeable where I cheated to build my arm install 
and just used qemu
user-mode on an amd64 host (rather than using cross-compile). Here it's very 
noticeable that the
install/merge phase of the build is consuming much/most of the install time. 

eg, random example (under qemu user mode)

# time ROOT=/tmp/timetest emerge -1k --nodeps openssl

>>> Emerging binary (1 of 1) dev-libs/openssl-1.1.1k-r1::gentoo for 
>>> /tmp/timetest/
...
real    0m30.145s
user    0m29.066s
sys    0m1.685s


Running the same on the native host is about 5-6sec, (and I find this ratio 
fairly consistent for
qemu usermode, about 5-6x slower than native)

If I pick another package with fewer files, then I will see this 5-6 secs drop, 
suggesting (without
offering proof) that the bulk of the time here is some "per file" processing.

Note this machine is a 12 core AMD ryzen 3900x with SSDs that bench around the 
4GB/s+. So really 5-6
seconds to install a few files is relatively "slow". Random benchmark on this 
machine might be that
I can backup 4.5GB of chroot with tar+zstd in about 4 seconds.


So the question is: I assume that further parallelisation of the install phase 
will be difficult,
therefore the low hanging fruit here seems to be the install/merge phase and 
why there seems to be
quite a bit of CPU "per file installed"? Can anyone give me a leg up on how I 
could benchmark this
further and look for the hotspot? Perhaps someone understand the architecture 
of this point more
intimately and could point at whether there are opportunities to do some of the 
processing on mass,
rather than per file?

I'm not really a python guru, but interested to poke further to see where the 
time is going.


Many thanks

Ed W