Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)

2021-05-31 Thread Frank Ch. Eigler
By the way, just following up with fresher data by checking out the 
https://debuginfod.fedoraproject.org/metrics.  We're counting 15.7 million 
distinct buildids across 3+ fedora (32,33,34,rawhide) complete build histories 
across all arches, which is about 24 bits = 3 bytes.  The traditional 20-byte 
buildid hash provides ample uniqueness probability, but so could something 
considerably shorter.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/rpm-software-management/rpm/issues/950#issuecomment-851711090___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)

2019-11-27 Thread Frank Ch. Eigler
I extrapolated incorrectly to 600 million because of an off-by-1024 error in my 
calculations, not because of other architectures / releases.  Those add a 
factor of 20ish (for fedora).

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/rpm-software-management/rpm/issues/950#issuecomment-559105713___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)

2019-11-27 Thread Mark Wielaard
Hi Frank,

Thanks so much for trying to come up with actual numbers!

On Tue, 2019-11-26 at 13:07 -0800, Frank Ch. Eigler wrote:
> Argh, very sorry @espindola but I was 3 orders of magnitude too high
> with my extrapolated numbers.  I hereby hand back my math card.  The
> collision probability is more in the 10e-6 range.

This seems mood since the issue is already closed and lld will probably
switch to a 128 or 160 bit build-id anyway. But I am still interested
in the actual numbers and I am now also a little confused.

When you say you were off by 3 orders of magnitude to you mean that for
one architecture (x86_64) and one release (fedora 30) you estimate that
there are ~600.000 build-ids (instead of ~600.000.000) because that is
the number of executable artifacts? Or that you extrapolated it (which
number?) wrongly when looking at the number of arches (i386 and x86_64
can be installed concurrently) and actual distros (3 normally) that
overlap in maintenance?

Or asked differently, which number of build-ids are you expecting to
come to a collision probability of 10e-6 when the build-id is just
64bits?

Thanks,

Mark
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)

2019-11-27 Thread Panu Matilainen
Closed #950.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/rpm-software-management/rpm/issues/950#event-2835387089___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)

2019-11-27 Thread Panu Matilainen
Non-unique build-ids become file conflicts in rpm, and end up preventing 
installations packages when the build-id links are in the main package, ie at 
least Fedora + RHEL and derivates. That's why strong hashes in build-id 
calculation are a must.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/rpm-software-management/rpm/issues/950#issuecomment-559058807___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)

2019-11-26 Thread Frank Ch. Eigler
Argh, very sorry @espindola but I was 3 orders of magnitude too high with my 
extrapolated numbers.  I hereby hand back my math card.  The collision 
probability is more in the 10e-6 range.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/rpm-software-management/rpm/issues/950#issuecomment-558817548___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)

2019-11-26 Thread Rafael Ávila de Espíndola
lld bug: https://bugs.llvm.org/show_bug.cgi?id=44138

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/rpm-software-management/rpm/issues/950#issuecomment-558802296___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)

2019-11-26 Thread Rafael Ávila de Espíndola
Fair enough.  Fell free to close this. I will forward your analysis to the lld 
developers.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/rpm-software-management/rpm/issues/950#issuecomment-558800517___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)

2019-11-26 Thread Frank Ch. Eigler
To complete my partial distro-wide buildid analysis, for fedora30 x86-64
(from its first branch last summer till the present) all built packages add up 
to apprx. 780 GB of build artifacts.  Based on an extrapolation from a 
single-day 'compose' set of one version of each package, this maps to *very 
roughly* 600 million distinct buildids over the course of a distro for that 
arch, and it still has another eight months of lifespan left.

Looking at the birthday-attack probability table, the probability of collisions 
at that kind of scale (10**9) with a 64-bit hash is already on the 
several-percent level.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/rpm-software-management/rpm/issues/950#issuecomment-558794690___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)

2019-11-26 Thread Frank Ch. Eigler
To complete my partial distro-wide buildid analysis, for fedora30 x86-64
(from its first branch last summer till the present) all built packages add up 
to apprx. 780 GB of build artifacts.  Based on an extrapolation from a 
single-day 'compose' set of one version of each package, this maps to *very 
roughly* 600 million distinct buildids over the course of a distro for that 
arch, and it still has another eight months of lifespan left.

Looking at the birthday-attack probability table, the probability of collisions 
at that kind of scale (10**9) with a 64-bit hash is already on the 
several-percent level.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/rpm-software-management/rpm/issues/950#issuecomment-558794536___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)

2019-11-26 Thread Rafael Ávila de Espíndola
Yes, I used ac37755c60ba19103f08f04d07ca8f1d640153d6.
It is impressive how fast sha1 is now, but it is still much slower than xxhash. 
Linking a release version of scylla I got

none 2.022252056 seconds
fast 2.061659922 seconds
md5  2.740843349 seconds
sha1 2.318114449 seconds


-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/rpm-software-management/rpm/issues/950#issuecomment-558734316___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)

2019-11-26 Thread Fangrui Song
@espindola Did you use lld that is newer than 2019-11-11? The LLVM SHA1 was 
optimized recently 
https://github.com/llvm/llvm-project/commit/43ff63477256d584cf506dba0c222c28231b0ccc
 . It is not included in llvm 9.*, though.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/rpm-software-management/rpm/issues/950#issuecomment-558531072___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)

2019-11-25 Thread Frank Ch. Eigler
>  Note that there is precedent for using only 64 bits. The dwarf standard 
> (http://www.dwarfstd.org/doc/DWARF5.pdf section 7.32 "Type Signature 
> Computation") uses the last 8 bytes of md5 for type signatures.

I see your point, but it is likely that the number of binaries in an OS 
distribution is orders of magnitude greater than the number of distinct type 
declarations / signatures of its source code.  And AIUI, the type signature 
code is meant for use primarily for indexing within or amongst a small set of 
related dwarf files already, i.e., another level of buildid type indexing would 
already have been applied to tell them apart.

This is not to say that rpm per se should be enforcing distro-scale concerns - 
but if it doesn't, some other piece of distro releng tooling must.  An 8-byte 
(16-hex-char) code is likely too small within a single distro/arch, never mind 
globally.  (I'm collecting some buildid population figures for a slice of 
fedora via debuginfod as we speak.)

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/rpm-software-management/rpm/issues/950#issuecomment-558253495___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)

2019-11-25 Thread Mark Wielaard
On Sat, 2019-11-23 at 09:53 -0800, Rafael Ávila de Espíndola wrote:
> 
> If we really need 128 bits, which hash would you suggest for a fast
> buildid? Farmhash comes to mind.

128 bits is the minimum that GNU binutils ld and gold support, you can
choose between uuid (128bits), md5 (128bits) and sha1 (160bits). Where
sha1 is the default. If you want maximum compatibility you would use a
hash that is 160bits/20bytes. I have seen code that simply hardcodes
build-ids being 20 bytes (those should be fixed of course, but that
might take time).

But I don't know if 64bits/8bytes is enough, assuming the hash code is
good the the birthday problem probabilty table shows the difference
between using 64bit and 128bit hashes:
https://en.wikipedia.org/wiki/Birthday_problem#Probability_table

I'll ask some people who have tried to index all executables of a
distribution what number of executables they have encountered.
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)

2019-11-23 Thread Rafael Ávila de Espíndola
This is the birthday problem, so a good 64 bit hash needs about 32 bit worth of 
hashes for a collision. Note that there is precedent for using only 64 bits. 
The dwarf standard (http://www.dwarfstd.org/doc/DWARF5.pdf section 7.32  "Type 
Signature Computation") uses the last 8 bytes of md5 for type signatures.

As for the quality, I think xxhash is very good 
(http://cyan4973.github.io/xxHash/) , but I don't have any detailed knowledge 
of hashes.

If we really need 128 bits, which hash would you suggest for a fast buildid? 
Farmhash comes to mind.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/rpm-software-management/rpm/issues/950#issuecomment-557818947___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


[Rpm-maint] [rpm-software-management/rpm] Please support smaller build-ids (#950)

2019-11-22 Thread Rafael Ávila de Espíndola
Currently rpm requires build ids to be between 16 and 64 bytes in size:

https://github.com/rpm-software-management/rpm/blob/b7d427728b8ba8734ba47d51849a5736bdd727cd/build/files.c#L1883

Since the main intention of build-id is to avoid accidentally using the wrong 
binary, lld defaults to xxhash which is much faster than md5 or sha1 but only 8 
bytes long.



-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/rpm-software-management/rpm/issues/950___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint