On Thu, Aug 26, 2021 at 4:03 AM Ed W <li...@wildgooses.com> wrote: > > Hi All > > Consider this a tentative first email to test the water, but I have started > to look at performance > of particularly the install phase of the emerge utility and I could use some > guidance on where to go > next
To clarify; the 'install' phase installs the package into ${D}. The 'qmerge' phase is the phase that merges to the livefs. > > Firstly, to define the "problem": I have found gentoo to be a great base for > building custom > distributions and I use it to build a small embedded distro which runs on a > couple of different > architectures. (Essentially just a "ROOT=/something emerge $some_packages"). > However, I use some > packaging around binpackages to avoid uncessary rebuilds, and this highlights > that "building" a > complete install using only binary packages rarely gets over a load of 1. Can > we do better than > this? Seems to be highly serialised on the install phase of copying the files > to the disk? In terms of parallelism it's not safe to run multiple phase functions simultaneously. This is a problem in theory and occasionally in practice (recently discussed in #gentoo-dev.) The phase functions run arbitrary code that modifies the livefs (as pre / post install and rm can touch $ROOT.) As an example we observed recently; font ebuilds will generate font related metadata. If 2 ebuilds try to generate the metadata at the same time; they can race and cause unexpected results. Sometimes this is caught in the ebuild (e.g. they wrote code like rebuild_indexes || die and the indexer returned non-zero) but can simply result in silent data corruption instead; particularly if the races go undetected. > > (Note I use parallel build and parallel-install flags, plus --jobs=N. If > there is code to compile > then load will shoot up, but simply installing binpackages struggles to get > the load over about > 0.7-1.1, so presumably single threaded in all parts?) > > > Now, this is particularly noticeable where I cheated to build my arm install > and just used qemu > user-mode on an amd64 host (rather than using cross-compile). Here it's very > noticeable that the > install/merge phase of the build is consuming much/most of the install time. > > eg, random example (under qemu user mode) I think perhaps a simpler test is to use qmerge (from portage-utils)? If you can use emerge (e.g. in --pretend mode) to generate a package list to merge; you can simply merge them with qmerge. I suspect qmerge will both (a) be faster and (b) be less safe than emerge; as emerge is doing a bunch of extra work you may or may not care about. You can also consider running N qmerge's (again less sure how safe this is; as the writes by qmerge may be racy.) Note again that this speed may not come for free and you may end up with a corrupt image afterwards. I'm not sure if folks are running qmerge in production like this (maybe others on the list have experience.) > > # time ROOT=/tmp/timetest emerge -1k --nodeps openssl > > >>> Emerging binary (1 of 1) dev-libs/openssl-1.1.1k-r1::gentoo for > >>> /tmp/timetest/ > ... > real 0m30.145s > user 0m29.066s > sys 0m1.685s > > > Running the same on the native host is about 5-6sec, (and I find this ratio > fairly consistent for > qemu usermode, about 5-6x slower than native) > > If I pick another package with fewer files, then I will see this 5-6 secs > drop, suggesting (without > offering proof) that the bulk of the time here is some "per file" processing. > > Note this machine is a 12 core AMD ryzen 3900x with SSDs that bench around > the 4GB/s+. So really 5-6 > seconds to install a few files is relatively "slow". Random benchmark on this > machine might be that > I can backup 4.5GB of chroot with tar+zstd in about 4 seconds. > > > So the question is: I assume that further parallelisation of the install > phase will be difficult, > therefore the low hanging fruit here seems to be the install/merge phase and > why there seems to be > quite a bit of CPU "per file installed"? Can anyone give me a leg up on how I > could benchmark this > further and look for the hotspot? Perhaps someone understand the architecture > of this point more > intimately and could point at whether there are opportunities to do some of > the processing on mass, > rather than per file? > > I'm not really a python guru, but interested to poke further to see where the > time is going. > > > Many thanks > > Ed W > > > > >