Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)
By the way, just following up with fresher data by checking out the https://debuginfod.fedoraproject.org/metrics. We're counting 15.7 million distinct buildids across 3+ fedora (32,33,34,rawhide) complete build histories across all arches, which is about 24 bits = 3 bytes. The traditional 20-byte buildid hash provides ample uniqueness probability, but so could something considerably shorter. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/rpm-software-management/rpm/issues/950#issuecomment-851711090___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)
I extrapolated incorrectly to 600 million because of an off-by-1024 error in my calculations, not because of other architectures / releases. Those add a factor of 20ish (for fedora). -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/rpm-software-management/rpm/issues/950#issuecomment-559105713___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)
Hi Frank, Thanks so much for trying to come up with actual numbers! On Tue, 2019-11-26 at 13:07 -0800, Frank Ch. Eigler wrote: > Argh, very sorry @espindola but I was 3 orders of magnitude too high > with my extrapolated numbers. I hereby hand back my math card. The > collision probability is more in the 10e-6 range. This seems mood since the issue is already closed and lld will probably switch to a 128 or 160 bit build-id anyway. But I am still interested in the actual numbers and I am now also a little confused. When you say you were off by 3 orders of magnitude to you mean that for one architecture (x86_64) and one release (fedora 30) you estimate that there are ~600.000 build-ids (instead of ~600.000.000) because that is the number of executable artifacts? Or that you extrapolated it (which number?) wrongly when looking at the number of arches (i386 and x86_64 can be installed concurrently) and actual distros (3 normally) that overlap in maintenance? Or asked differently, which number of build-ids are you expecting to come to a collision probability of 10e-6 when the build-id is just 64bits? Thanks, Mark ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)
Closed #950. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/rpm-software-management/rpm/issues/950#event-2835387089___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)
Non-unique build-ids become file conflicts in rpm, and end up preventing installations packages when the build-id links are in the main package, ie at least Fedora + RHEL and derivates. That's why strong hashes in build-id calculation are a must. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/rpm-software-management/rpm/issues/950#issuecomment-559058807___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)
Argh, very sorry @espindola but I was 3 orders of magnitude too high with my extrapolated numbers. I hereby hand back my math card. The collision probability is more in the 10e-6 range. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/rpm-software-management/rpm/issues/950#issuecomment-558817548___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)
lld bug: https://bugs.llvm.org/show_bug.cgi?id=44138 -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/rpm-software-management/rpm/issues/950#issuecomment-558802296___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)
Fair enough. Fell free to close this. I will forward your analysis to the lld developers. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/rpm-software-management/rpm/issues/950#issuecomment-558800517___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)
To complete my partial distro-wide buildid analysis, for fedora30 x86-64 (from its first branch last summer till the present) all built packages add up to apprx. 780 GB of build artifacts. Based on an extrapolation from a single-day 'compose' set of one version of each package, this maps to *very roughly* 600 million distinct buildids over the course of a distro for that arch, and it still has another eight months of lifespan left. Looking at the birthday-attack probability table, the probability of collisions at that kind of scale (10**9) with a 64-bit hash is already on the several-percent level. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/rpm-software-management/rpm/issues/950#issuecomment-558794690___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)
To complete my partial distro-wide buildid analysis, for fedora30 x86-64 (from its first branch last summer till the present) all built packages add up to apprx. 780 GB of build artifacts. Based on an extrapolation from a single-day 'compose' set of one version of each package, this maps to *very roughly* 600 million distinct buildids over the course of a distro for that arch, and it still has another eight months of lifespan left. Looking at the birthday-attack probability table, the probability of collisions at that kind of scale (10**9) with a 64-bit hash is already on the several-percent level. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/rpm-software-management/rpm/issues/950#issuecomment-558794536___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)
Yes, I used ac37755c60ba19103f08f04d07ca8f1d640153d6. It is impressive how fast sha1 is now, but it is still much slower than xxhash. Linking a release version of scylla I got none 2.022252056 seconds fast 2.061659922 seconds md5 2.740843349 seconds sha1 2.318114449 seconds -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/rpm-software-management/rpm/issues/950#issuecomment-558734316___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)
@espindola Did you use lld that is newer than 2019-11-11? The LLVM SHA1 was optimized recently https://github.com/llvm/llvm-project/commit/43ff63477256d584cf506dba0c222c28231b0ccc . It is not included in llvm 9.*, though. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/rpm-software-management/rpm/issues/950#issuecomment-558531072___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)
> Note that there is precedent for using only 64 bits. The dwarf standard > (http://www.dwarfstd.org/doc/DWARF5.pdf section 7.32 "Type Signature > Computation") uses the last 8 bytes of md5 for type signatures. I see your point, but it is likely that the number of binaries in an OS distribution is orders of magnitude greater than the number of distinct type declarations / signatures of its source code. And AIUI, the type signature code is meant for use primarily for indexing within or amongst a small set of related dwarf files already, i.e., another level of buildid type indexing would already have been applied to tell them apart. This is not to say that rpm per se should be enforcing distro-scale concerns - but if it doesn't, some other piece of distro releng tooling must. An 8-byte (16-hex-char) code is likely too small within a single distro/arch, never mind globally. (I'm collecting some buildid population figures for a slice of fedora via debuginfod as we speak.) -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/rpm-software-management/rpm/issues/950#issuecomment-558253495___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)
On Sat, 2019-11-23 at 09:53 -0800, Rafael Ávila de Espíndola wrote: > > If we really need 128 bits, which hash would you suggest for a fast > buildid? Farmhash comes to mind. 128 bits is the minimum that GNU binutils ld and gold support, you can choose between uuid (128bits), md5 (128bits) and sha1 (160bits). Where sha1 is the default. If you want maximum compatibility you would use a hash that is 160bits/20bytes. I have seen code that simply hardcodes build-ids being 20 bytes (those should be fixed of course, but that might take time). But I don't know if 64bits/8bytes is enough, assuming the hash code is good the the birthday problem probabilty table shows the difference between using 64bit and 128bit hashes: https://en.wikipedia.org/wiki/Birthday_problem#Probability_table I'll ask some people who have tried to index all executables of a distribution what number of executables they have encountered. ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)
This is the birthday problem, so a good 64 bit hash needs about 32 bit worth of hashes for a collision. Note that there is precedent for using only 64 bits. The dwarf standard (http://www.dwarfstd.org/doc/DWARF5.pdf section 7.32 "Type Signature Computation") uses the last 8 bytes of md5 for type signatures. As for the quality, I think xxhash is very good (http://cyan4973.github.io/xxHash/) , but I don't have any detailed knowledge of hashes. If we really need 128 bits, which hash would you suggest for a fast buildid? Farmhash comes to mind. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/rpm-software-management/rpm/issues/950#issuecomment-557818947___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
[Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)
Currently rpm requires build ids to be between 16 and 64 bytes in size: https://github.com/rpm-software-management/rpm/blob/b7d427728b8ba8734ba47d51849a5736bdd727cd/build/files.c#L1883 Since the main intention of build-id is to avoid accidentally using the wrong binary, lld defaults to xxhash which is much faster than md5 or sha1 but only 8 bytes long. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/rpm-software-management/rpm/issues/950___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint