Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
On Thu, Sep 17, 2020 at 4:31 AM Rasmus Villemoes wrote: > > On 16/09/2020 20.01, Masahiro Yamada wrote: > > On Thu, Sep 17, 2020 at 12:23 AM Rasmus Villemoes > > wrote: > >> > >> On 16/09/2020 16.28, Masahiro Yamada wrote: > >>> On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes > >>> wrote: > > > >>> Why do we care about the uniqueness of the abbreviated > >>> hash in the whole git history? > >> > >> Because when I ask a customer "what kernel are you running", and they > >> tell me "4.19.45-rt67-00567-123abc8", I prefer that 123abc8 uniquely > >> identifies the right commit, instead of me having to dig around each of > >> the commits starting with that prefix and see which one of them matches > >> "is exactly 567 commits from 4.19.45-rt67". 7 hexchars is nowhere near > >> enough for that today, which is why Linus himself did that "auto-tune > >> abbrev based on repo size" patch for git.git. > > > > I like: > > > >git rev-parse --verify HEAD 2>/dev/null | cut -c1-15 > > > > better than: > > > >git rev-parse --verify --short=15 HEAD 2>/dev/null > > > > The former produces the deterministic kernelrelease string. > > > > > > But, let's reconsider if we need as long as 15-digits. > > > > There are a couple of kinds of objects in git: commit, tree, blob. > > > > I think you came up with 15-digits to ensure the uniqueness > > in _all_ kinds of objects. > > > > But, when your customer says "4.19.45-rt67-00567-123abc8", > > 123abc8 is apparently a commit instead of a tree or a blob. > > > > > > > > In the context of the kernel version, we can exclude > > tree and blob objects. > > > > In other words, I think "grows ~6 objects per release" > > is somewhat over-estimating. > > > > We have ~15000 commits per release. So, the difference > > is just 4x factor, though... > > > > > > > > BTW, I did some experiments, and I noticed > > "git log" accepts a shorter hash > > than "git show" does presumably because > > "git log" only takes a commit. > > > > > > > > For example, "git show 06a0d" fails, but > > "git log 06a0d" works. > > > > > > masahiro@oscar:~/ref/linux$ git show 06a0d > > error: short SHA1 06a0d is ambiguous > > hint: The candidates are: > > hint: 06a0df4d1b8b commit 2020-09-08 - fbcon: remove now unusued > > 'softback_lines' cursor() argument > > hint: 06a0d81911b3 tree > > hint: 06a0dc5a84d2 tree > > hint: 06a0d1947c77 blob > > hint: 06a0df434249 blob > > fatal: ambiguous argument '06a0d': unknown revision or path not in the > > working tree. > > Use '--' to separate paths from revisions, like this: > > 'git [...] -- [...]' > > masahiro@oscar:~/ref/linux$ git log --oneline -1 06a0d > > 06a0df4d1b8b fbcon: remove now unusued 'softback_lines' cursor() argument > > > > > > > > > > What is interesting is, if you prepend -- > > to the abbreviated hash, "git show" accepts as short as a commit > > "git log" does. > > > > masahiro@oscar:~/ref/linux$ git show v5.9-rc5-2-g06a0d | head -1 > > commit 06a0df4d1b8b13b551668e47b11fd7629033b7df > > > > > > I guess git cleverly understands it refers to a commit object > > since "v5.9-rc5-2-" prefix only makes sense > > in the commit context. > > > > Well, yes, but unfortunately only so far as applying the same "the user > means a commit object" logic; git doesn't do anything to actually use > the prefix to disambiguate. In fact, looking at a semi-random commit in > 4.19-stable v4.19.114-7-g236c445eb352: > > $ git show 236c44 > error: short SHA1 236c44 is ambiguous > hint: The candidates are: > hint: 236c445eb352 commit 2020-03-13 - drm/bochs: downgrade > pci_request_region failure from error to warning > hint: 236c448cb6e7 commit 2011-09-08 - usbmon vs. tcpdump: fix dropped > packet count > hint: 236c445b1930 blob > > Now you're right that we get > > $ git show v4.19.114-7-g236c445 | head -n1 > commit 236c445eb3529aa7c976f8812513c3cb26d77e27 > > so it knows we're not looking at that blob. But "git show > v4.19.114-7-g236c44" still fails because there are two commits. Adding > to the fun is that "git show v4.19.114-7-g236c448" actually shows the > usbmon commit, which is old v3.2 stuff and certainly not a descendant of > v4.19.114. Oh, I did not notice that. Maybe git can be improved to check the validity of the 'git describe' form, but that is another story. > I didn't actually know that git would even accept those prefixed strings > as possible refspecs, and for a moment you had me hoping that git would > actually do the disambiguation using that prefix, which would certainly > make 7 hex chars enough; unfortunately that's not the case. > > > Keeping those above in mind, I believe 15-digits is too long. > > Well, as you indicated, commits are probably ~1/4 of all objects, i.e. > half a hexchar worth of bits. So the commit/object distinction shouldn't > matter very much for the choice of suitable fixed length. > > > So, I propose two options. > > > > > > [1] 7 digits > > > > The abbreviated hash part may not uniquel
Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
On 16/09/2020 16.28, Masahiro Yamada wrote: > On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes > wrote: >> >> On 10/09/2020 21.05, Brian Norris wrote: >>> On Thu, Sep 10, 2020 at 7:35 AM Masahiro Yamada >>> wrote: On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes wrote: > So in order to avoid `uname -a` output relying on such random details > of the build environment which are rather hard to ensure are > consistent between developers and buildbots, use an explicit > --abbrev=15 option (and for consistency, also use rev-parse --short=15 > for the unlikely case of no signed tags being usable). >>> >>> For the patch: >>> >>> Reviewed-by: Brian Norris >>> I agree that any randomness should be avoided. My question is, do we need 15-digits? >>> ... So, I think the conflict happens only when we have two commits that start with the same 7-digits in the _same_ release. Is this correct? >> >> No. >> >>> For git-describe (the case where we have a tag to base off): >>> "use digits, or as many digits as needed to form a unique object name" >> >> Yes, the abbreviated hash that `git describe` produces is unique among >> all objects (and objects are more than just commits) in the current >> repo, so what matters for probability-of-collision is the total number >> of objects - linus.git itself probably grows ~6 objects per release. >> >> As for "do we need 15 digits", well, in theory the next time I merge the >> next rt-stable tag into our kernel I could end up with a commit that >> matches some existing object in the first 33 hex chars at which point I >> should have chosen 34 - but of course that's so unlikely that it's >> irrelevant. >> >> But the upshot of that is that there really is no objective answer to >> "how many digits do we need", so it becomes a tradeoff between "low >> enough probability that anyone anywhere in the next few years would have >> needed more than X when building their own kernel" and readability of >> `uname -r` etc. So I decided somewhat arbitrarily that each time one >> rolls a new release, there should be less than 1e-9 probability that >> HEAD collides with some other object when abbreviated to X hexchars. >> X=12 doesn't pass that criteria even when the repo has only 10M objects >> (and, current linus.git already has objects that need 12 to be unique, >> so such collisions do happen in practice, though of course very rarely). >> 13 and 14 are just weird numbers, so I ended with 15, which also allows >> many many more objects in the repo before the probability crosses that >> 1e-9 threshold. >> > > > This is because you use the output from git as-is. > > Why are you still trying to rely on such obscure behavior of git? > > > It is pretty easy to get the fixed number of digits reliably. > > For example, > git rev-parse --verify HEAD 2>/dev/null | cut -c1-7 > > > It always returns a 7-digits hash, > and two different people will get the same result for sure. > > Your solution, --short=15, means "at least 15", > which still contains ambiguity. > > > > As I already noted, the kernelrelease string is > constructed in this format: > > -- > > > For example, if I enable CONFIG_LOCALVERSION_AUTO=y > in today's Linus tree, I got this: > > 5.9.0-rc5-5-gfc4f28bb3daf > > > What if the number of digits were 7? > > 5.9.0-rc5-5-gfc4f28b > > I see no problem here. The problem is that currently, the build relies on things which cannot easily be controlled or reproduced between different developers: Apart from git version (which is reasonable to mandate is the same, just like one must use same compiler, binutils etc. to get binary reproducible output), it depends on whether ~/.gitconfig has a core.abbrev setting - and even worse, it depends on the _total number of objects in the source git repo_, i.e. something that has nothing to do with what is currently in the work tree at all. And that leads to real bugs when one builds external modules that end up in one directory in the rootfs, while `uname -r` tells modprobe to look in some other directory (differing in the length of the abbreviated hash). > Even if we have another object that starts with "fc4f28b", > the kernelrelease string is still unique thanks to the > - prefix. > > Why do we care about the uniqueness of the abbreviated > hash in the whole git history? Because when I ask a customer "what kernel are you running", and they tell me "4.19.45-rt67-00567-123abc8", I prefer that 123abc8 uniquely identifies the right commit, instead of me having to dig around each of the commits starting with that prefix and see which one of them matches "is exactly 567 commits from 4.19.45-rt67". 7 hexchars is nowhere near enough for that today, which is why Linus himself did that "auto-tune abbrev based on repo size" patch for git.git. But what I mostly care about is getting a consistent result. And yes, you're right that the porcelain command 'git describe' could end
Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
On 16/09/2020 20.01, Masahiro Yamada wrote: > On Thu, Sep 17, 2020 at 12:23 AM Rasmus Villemoes > wrote: >> >> On 16/09/2020 16.28, Masahiro Yamada wrote: >>> On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes >>> wrote: >>> Why do we care about the uniqueness of the abbreviated >>> hash in the whole git history? >> >> Because when I ask a customer "what kernel are you running", and they >> tell me "4.19.45-rt67-00567-123abc8", I prefer that 123abc8 uniquely >> identifies the right commit, instead of me having to dig around each of >> the commits starting with that prefix and see which one of them matches >> "is exactly 567 commits from 4.19.45-rt67". 7 hexchars is nowhere near >> enough for that today, which is why Linus himself did that "auto-tune >> abbrev based on repo size" patch for git.git. > > I like: > >git rev-parse --verify HEAD 2>/dev/null | cut -c1-15 > > better than: > >git rev-parse --verify --short=15 HEAD 2>/dev/null > > The former produces the deterministic kernelrelease string. > > > But, let's reconsider if we need as long as 15-digits. > > There are a couple of kinds of objects in git: commit, tree, blob. > > I think you came up with 15-digits to ensure the uniqueness > in _all_ kinds of objects. > > But, when your customer says "4.19.45-rt67-00567-123abc8", > 123abc8 is apparently a commit instead of a tree or a blob. > > > > In the context of the kernel version, we can exclude > tree and blob objects. > > In other words, I think "grows ~6 objects per release" > is somewhat over-estimating. > > We have ~15000 commits per release. So, the difference > is just 4x factor, though... > > > > BTW, I did some experiments, and I noticed > "git log" accepts a shorter hash > than "git show" does presumably because > "git log" only takes a commit. > > > > For example, "git show 06a0d" fails, but > "git log 06a0d" works. > > > masahiro@oscar:~/ref/linux$ git show 06a0d > error: short SHA1 06a0d is ambiguous > hint: The candidates are: > hint: 06a0df4d1b8b commit 2020-09-08 - fbcon: remove now unusued > 'softback_lines' cursor() argument > hint: 06a0d81911b3 tree > hint: 06a0dc5a84d2 tree > hint: 06a0d1947c77 blob > hint: 06a0df434249 blob > fatal: ambiguous argument '06a0d': unknown revision or path not in the > working tree. > Use '--' to separate paths from revisions, like this: > 'git [...] -- [...]' > masahiro@oscar:~/ref/linux$ git log --oneline -1 06a0d > 06a0df4d1b8b fbcon: remove now unusued 'softback_lines' cursor() argument > > > > > What is interesting is, if you prepend -- > to the abbreviated hash, "git show" accepts as short as a commit > "git log" does. > > masahiro@oscar:~/ref/linux$ git show v5.9-rc5-2-g06a0d | head -1 > commit 06a0df4d1b8b13b551668e47b11fd7629033b7df > > > I guess git cleverly understands it refers to a commit object > since "v5.9-rc5-2-" prefix only makes sense > in the commit context. > Well, yes, but unfortunately only so far as applying the same "the user means a commit object" logic; git doesn't do anything to actually use the prefix to disambiguate. In fact, looking at a semi-random commit in 4.19-stable v4.19.114-7-g236c445eb352: $ git show 236c44 error: short SHA1 236c44 is ambiguous hint: The candidates are: hint: 236c445eb352 commit 2020-03-13 - drm/bochs: downgrade pci_request_region failure from error to warning hint: 236c448cb6e7 commit 2011-09-08 - usbmon vs. tcpdump: fix dropped packet count hint: 236c445b1930 blob Now you're right that we get $ git show v4.19.114-7-g236c445 | head -n1 commit 236c445eb3529aa7c976f8812513c3cb26d77e27 so it knows we're not looking at that blob. But "git show v4.19.114-7-g236c44" still fails because there are two commits. Adding to the fun is that "git show v4.19.114-7-g236c448" actually shows the usbmon commit, which is old v3.2 stuff and certainly not a descendant of v4.19.114. I didn't actually know that git would even accept those prefixed strings as possible refspecs, and for a moment you had me hoping that git would actually do the disambiguation using that prefix, which would certainly make 7 hex chars enough; unfortunately that's not the case. > Keeping those above in mind, I believe 15-digits is too long. Well, as you indicated, commits are probably ~1/4 of all objects, i.e. half a hexchar worth of bits. So the commit/object distinction shouldn't matter very much for the choice of suitable fixed length. > So, I propose two options. > > > [1] 7 digits > > The abbreviated hash part may not uniquely identify > the commit. In that case, you need some extra git > operations to find out which one is it. > > As for the kernel build, > --<7-digits> is enough > to provide the unique kernelrelease string. > > > > [2] 12 digits > > This matches to the Fixes: tag policy specified in > Documentation/process/submitting-patches.rst > > The abbreviated hash part is very likely unique > in the commit object space. > (Of cou
Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
On Thu, Sep 17, 2020 at 12:23 AM Rasmus Villemoes wrote: > > On 16/09/2020 16.28, Masahiro Yamada wrote: > > On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes > > wrote: > >> > >> On 10/09/2020 21.05, Brian Norris wrote: > >>> On Thu, Sep 10, 2020 at 7:35 AM Masahiro Yamada > >>> wrote: > On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes > wrote: > > So in order to avoid `uname -a` output relying on such random details > > of the build environment which are rather hard to ensure are > > consistent between developers and buildbots, use an explicit > > --abbrev=15 option (and for consistency, also use rev-parse --short=15 > > for the unlikely case of no signed tags being usable). > >>> > >>> For the patch: > >>> > >>> Reviewed-by: Brian Norris > >>> > I agree that any randomness should be avoided. > > My question is, do we need 15-digits? > >>> ... > So, I think the conflict happens > only when we have two commits that start with the same 7-digits > in the _same_ release. Is this correct? > >> > >> No. > >> > >>> For git-describe (the case where we have a tag to base off): > >>> "use digits, or as many digits as needed to form a unique object name" > >> > >> Yes, the abbreviated hash that `git describe` produces is unique among > >> all objects (and objects are more than just commits) in the current > >> repo, so what matters for probability-of-collision is the total number > >> of objects - linus.git itself probably grows ~6 objects per release. > >> > >> As for "do we need 15 digits", well, in theory the next time I merge the > >> next rt-stable tag into our kernel I could end up with a commit that > >> matches some existing object in the first 33 hex chars at which point I > >> should have chosen 34 - but of course that's so unlikely that it's > >> irrelevant. > >> > >> But the upshot of that is that there really is no objective answer to > >> "how many digits do we need", so it becomes a tradeoff between "low > >> enough probability that anyone anywhere in the next few years would have > >> needed more than X when building their own kernel" and readability of > >> `uname -r` etc. So I decided somewhat arbitrarily that each time one > >> rolls a new release, there should be less than 1e-9 probability that > >> HEAD collides with some other object when abbreviated to X hexchars. > >> X=12 doesn't pass that criteria even when the repo has only 10M objects > >> (and, current linus.git already has objects that need 12 to be unique, > >> so such collisions do happen in practice, though of course very rarely). > >> 13 and 14 are just weird numbers, so I ended with 15, which also allows > >> many many more objects in the repo before the probability crosses that > >> 1e-9 threshold. > >> > > > > > > This is because you use the output from git as-is. > > > > Why are you still trying to rely on such obscure behavior of git? > > > > > > It is pretty easy to get the fixed number of digits reliably. > > > > For example, > > git rev-parse --verify HEAD 2>/dev/null | cut -c1-7 > > > > > > It always returns a 7-digits hash, > > and two different people will get the same result for sure. > > > > Your solution, --short=15, means "at least 15", > > which still contains ambiguity. > > > > > > > > As I already noted, the kernelrelease string is > > constructed in this format: > > > > -- > > > > > > For example, if I enable CONFIG_LOCALVERSION_AUTO=y > > in today's Linus tree, I got this: > > > > 5.9.0-rc5-5-gfc4f28bb3daf > > > > > > What if the number of digits were 7? > > > > 5.9.0-rc5-5-gfc4f28b > > > > I see no problem here. > > The problem is that currently, the build relies on things which cannot > easily be controlled or reproduced between different developers: Apart > from git version (which is reasonable to mandate is the same, just like > one must use same compiler, binutils etc. to get binary reproducible > output), it depends on whether ~/.gitconfig has a core.abbrev setting - > and even worse, it depends on the _total number of objects in the source > git repo_, i.e. something that has nothing to do with what is currently > in the work tree at all. > > And that leads to real bugs when one builds external modules that end up > in one directory in the rootfs, while `uname -r` tells modprobe to look > in some other directory (differing in the length of the abbreviated hash). > > > Even if we have another object that starts with "fc4f28b", > > the kernelrelease string is still unique thanks to the > > - prefix. > > > > Why do we care about the uniqueness of the abbreviated > > hash in the whole git history? > > Because when I ask a customer "what kernel are you running", and they > tell me "4.19.45-rt67-00567-123abc8", I prefer that 123abc8 uniquely > identifies the right commit, instead of me having to dig around each of > the commits starting with that prefix and see which one of them matches > "is exactly 567 commits from 4.19.45-rt67". 7
Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes wrote: > > On 10/09/2020 21.05, Brian Norris wrote: > > On Thu, Sep 10, 2020 at 7:35 AM Masahiro Yamada > > wrote: > >> On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes > >> wrote: > >>> So in order to avoid `uname -a` output relying on such random details > >>> of the build environment which are rather hard to ensure are > >>> consistent between developers and buildbots, use an explicit > >>> --abbrev=15 option (and for consistency, also use rev-parse --short=15 > >>> for the unlikely case of no signed tags being usable). > > > > For the patch: > > > > Reviewed-by: Brian Norris > > > >> I agree that any randomness should be avoided. > >> > >> My question is, do we need 15-digits? > > ... > >> So, I think the conflict happens > >> only when we have two commits that start with the same 7-digits > >> in the _same_ release. Is this correct? > > No. > > > For git-describe (the case where we have a tag to base off): > > "use digits, or as many digits as needed to form a unique object name" > > Yes, the abbreviated hash that `git describe` produces is unique among > all objects (and objects are more than just commits) in the current > repo, so what matters for probability-of-collision is the total number > of objects - linus.git itself probably grows ~6 objects per release. > > As for "do we need 15 digits", well, in theory the next time I merge the > next rt-stable tag into our kernel I could end up with a commit that > matches some existing object in the first 33 hex chars at which point I > should have chosen 34 - but of course that's so unlikely that it's > irrelevant. > > But the upshot of that is that there really is no objective answer to > "how many digits do we need", so it becomes a tradeoff between "low > enough probability that anyone anywhere in the next few years would have > needed more than X when building their own kernel" and readability of > `uname -r` etc. So I decided somewhat arbitrarily that each time one > rolls a new release, there should be less than 1e-9 probability that > HEAD collides with some other object when abbreviated to X hexchars. > X=12 doesn't pass that criteria even when the repo has only 10M objects > (and, current linus.git already has objects that need 12 to be unique, > so such collisions do happen in practice, though of course very rarely). > 13 and 14 are just weird numbers, so I ended with 15, which also allows > many many more objects in the repo before the probability crosses that > 1e-9 threshold. > > Rasmus > > Side note 1: Note that there really isn't any such thing as "last > tag/previous tag" in a DAG; I think what git does is a best-effort > search for the eligible tag that minimizes #({objects reachable from > commit-to-be-described} - {objects reachable from tag}), where eligible > tag means at least reachable from commit-to-be-described (so that set > difference is a proper one), but there can be additional constraints > (e.g. --match=...). > > Side note 2: Linus or Greg releases are actually not interesting here > (see the logic in setlocalversion that stops early if we're exactly at > an annotated tag) - the whole raison d'etre for setlocalversion is that > people do maintain and build non-mainline kernels, and it's extremely > useful for `uname -a` to accurately tell just which commit is running on > the target. This is because you use the output from git as-is. Why are you still trying to rely on such obscure behavior of git? It is pretty easy to get the fixed number of digits reliably. For example, git rev-parse --verify HEAD 2>/dev/null | cut -c1-7 It always returns a 7-digits hash, and two different people will get the same result for sure. Your solution, --short=15, means "at least 15", which still contains ambiguity. As I already noted, the kernelrelease string is constructed in this format: -- For example, if I enable CONFIG_LOCALVERSION_AUTO=y in today's Linus tree, I got this: 5.9.0-rc5-5-gfc4f28bb3daf What if the number of digits were 7? 5.9.0-rc5-5-gfc4f28b I see no problem here. Even if we have another object that starts with "fc4f28b", the kernelrelease string is still unique thanks to the - prefix. Why do we care about the uniqueness of the abbreviated hash in the whole git history? Note: We prepend $(KERNELVERSION) to the result of setlocalversion. [1] [1] https://github.com/torvalds/linux/blob/v5.9-rc4/Makefile#L1175 -- Best Regards Masahiro Yamada
Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
On 10/09/2020 16.34, Masahiro Yamada wrote: > On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes > wrote: >> ... >> >> So in order to avoid `uname -a` output relying on such random details >> of the build environment which are rather hard to ensure are >> consistent between developers and buildbots, use an explicit >> --abbrev=15 option (and for consistency, also use rev-parse --short=15 >> for the unlikely case of no signed tags being usable). >> Hi Masahiro Can I get you to carry this through the kbuild tree? Unless of course your questions/concerns haven't been addressed. Thanks, Rasmus
Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
On 10/09/2020 21.05, Brian Norris wrote: > On Thu, Sep 10, 2020 at 7:35 AM Masahiro Yamada wrote: >> On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes >> wrote: >>> So in order to avoid `uname -a` output relying on such random details >>> of the build environment which are rather hard to ensure are >>> consistent between developers and buildbots, use an explicit >>> --abbrev=15 option (and for consistency, also use rev-parse --short=15 >>> for the unlikely case of no signed tags being usable). > > For the patch: > > Reviewed-by: Brian Norris > >> I agree that any randomness should be avoided. >> >> My question is, do we need 15-digits? > ... >> So, I think the conflict happens >> only when we have two commits that start with the same 7-digits >> in the _same_ release. Is this correct? No. > For git-describe (the case where we have a tag to base off): > "use digits, or as many digits as needed to form a unique object name" Yes, the abbreviated hash that `git describe` produces is unique among all objects (and objects are more than just commits) in the current repo, so what matters for probability-of-collision is the total number of objects - linus.git itself probably grows ~6 objects per release. As for "do we need 15 digits", well, in theory the next time I merge the next rt-stable tag into our kernel I could end up with a commit that matches some existing object in the first 33 hex chars at which point I should have chosen 34 - but of course that's so unlikely that it's irrelevant. But the upshot of that is that there really is no objective answer to "how many digits do we need", so it becomes a tradeoff between "low enough probability that anyone anywhere in the next few years would have needed more than X when building their own kernel" and readability of `uname -r` etc. So I decided somewhat arbitrarily that each time one rolls a new release, there should be less than 1e-9 probability that HEAD collides with some other object when abbreviated to X hexchars. X=12 doesn't pass that criteria even when the repo has only 10M objects (and, current linus.git already has objects that need 12 to be unique, so such collisions do happen in practice, though of course very rarely). 13 and 14 are just weird numbers, so I ended with 15, which also allows many many more objects in the repo before the probability crosses that 1e-9 threshold. Rasmus Side note 1: Note that there really isn't any such thing as "last tag/previous tag" in a DAG; I think what git does is a best-effort search for the eligible tag that minimizes #({objects reachable from commit-to-be-described} - {objects reachable from tag}), where eligible tag means at least reachable from commit-to-be-described (so that set difference is a proper one), but there can be additional constraints (e.g. --match=...). Side note 2: Linus or Greg releases are actually not interesting here (see the logic in setlocalversion that stops early if we're exactly at an annotated tag) - the whole raison d'etre for setlocalversion is that people do maintain and build non-mainline kernels, and it's extremely useful for `uname -a` to accurately tell just which commit is running on the target.
Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
On 23:34 Thu 10 Sep 2020, Masahiro Yamada wrote: On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes wrote: When building for an embedded target using Yocto, we're sometimes observing that the version string that gets built into vmlinux (and thus what uname -a reports) differs from the path under /lib/modules/ where modules get installed in the rootfs, but only in the length of the -gabc123def suffix. Hence modprobe always fails. The problem is that Yocto has the concept of "sstate" (shared state), which allows different developers/buildbots/etc. to share build artifacts, based on a hash of all the metadata that went into building that artifact - and that metadata includes all dependencies (e.g. the compiler used etc.). That normally works quite well; usually a clean build (without using any sstate cache) done by one developer ends up being binary identical to a build done on another host. However, one thing that can cause two developers to end up with different builds [and thus make one's vmlinux package incompatible with the other's kernel-dev package], which is not captured by the metadata hashing, is this `git describe`: The output of that can be affected by (1) git version: before 2.11 git defaulted to a minimum of 7, since 2.11 (git.git commit e6c587) the default is dynamic based on the number of objects in the repo (2) hence even if both run the same git version, the output can differ based on how many remotes are being tracked (or just lots of local development branches or plain old garbage) (3) and of course somebody could have a core.abbrev config setting in ~/.gitconfig So in order to avoid `uname -a` output relying on such random details of the build environment which are rather hard to ensure are consistent between developers and buildbots, use an explicit --abbrev=15 option (and for consistency, also use rev-parse --short=15 for the unlikely case of no signed tags being usable). Now, why is 60 bits enough for everyone? It's not mathematically guaranteed that git won't have to use 16 in some git repo, but it is beyond unlikely: Even in a repo with 100M objects, the probability that any given commit (i.e. the one being described) clashes with some other object in the first 15 hex chars is less than 1e-10, and currently a git repo tracking Linus', -stable and -rt only has around 10M objects. I agree that any randomness should be avoided. My question is, do we need 15-digits? The kernelrelease is formed by [kernel version] + [some digits of git hash]. For example, "git describe" shows as follows: v5.9.0-rc4-00034-g7fe10096c150 Linus gives a new tag every week (or every two week). So, I think the conflict happens only when we have two commits that start with the same 7-digits in the _same_ release. Is this correct? We have 14000 - 15000 commits in each release, not 100M. I kinda agree with this...we need to chopped down the excess bits from the information . Indeed , a 15 digits is too long to keep up. ~Bhaskar Signed-off-by: Rasmus Villemoes --- I could probably fix things by adding a 'git config --local core.abbrev 15' step to the Yocto build process after the repo to build from has been cloned but before building has started. But in the interest of binary reproducibility outside of just Yocto builds, I think it's better if this lives in the kernel. scripts/setlocalversion | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/scripts/setlocalversion b/scripts/setlocalversion index 20f2efd57b11..c5262f0d953d 100755 --- a/scripts/setlocalversion +++ b/scripts/setlocalversion @@ -45,7 +45,7 @@ scm_version() # Check for git and a git repo. if test -z "$(git rev-parse --show-cdup 2>/dev/null)" && - head=$(git rev-parse --verify --short HEAD 2>/dev/null); then + head=$(git rev-parse --verify --short=15 HEAD 2>/dev/null); then # If we are at a tagged commit (like "v2.6.30-rc6"), we ignore # it, because this version is defined in the top level Makefile. @@ -59,7 +59,7 @@ scm_version() fi # If we are past a tagged commit (like # "v2.6.30-rc5-302-g72357d5"), we pretty print it. - if atag="$(git describe 2>/dev/null)"; then + if atag="$(git describe --abbrev=15 2>/dev/null)"; then echo "$atag" | awk -F- '{printf("-%05d-%s", $(NF-1),$(NF))}' # If we don't have a tag at all we print -g{commitish}. -- 2.23.0 -- Best Regards Masahiro Yamada signature.asc Description: PGP signature
[PATCH] scripts/setlocalversion: make git describe output more reliable
When building for an embedded target using Yocto, we're sometimes observing that the version string that gets built into vmlinux (and thus what uname -a reports) differs from the path under /lib/modules/ where modules get installed in the rootfs, but only in the length of the -gabc123def suffix. Hence modprobe always fails. The problem is that Yocto has the concept of "sstate" (shared state), which allows different developers/buildbots/etc. to share build artifacts, based on a hash of all the metadata that went into building that artifact - and that metadata includes all dependencies (e.g. the compiler used etc.). That normally works quite well; usually a clean build (without using any sstate cache) done by one developer ends up being binary identical to a build done on another host. However, one thing that can cause two developers to end up with different builds [and thus make one's vmlinux package incompatible with the other's kernel-dev package], which is not captured by the metadata hashing, is this `git describe`: The output of that can be affected by (1) git version: before 2.11 git defaulted to a minimum of 7, since 2.11 (git.git commit e6c587) the default is dynamic based on the number of objects in the repo (2) hence even if both run the same git version, the output can differ based on how many remotes are being tracked (or just lots of local development branches or plain old garbage) (3) and of course somebody could have a core.abbrev config setting in ~/.gitconfig So in order to avoid `uname -a` output relying on such random details of the build environment which are rather hard to ensure are consistent between developers and buildbots, use an explicit --abbrev=15 option (and for consistency, also use rev-parse --short=15 for the unlikely case of no signed tags being usable). Now, why is 60 bits enough for everyone? It's not mathematically guaranteed that git won't have to use 16 in some git repo, but it is beyond unlikely: Even in a repo with 100M objects, the probability that any given commit (i.e. the one being described) clashes with some other object in the first 15 hex chars is less than 1e-10, and currently a git repo tracking Linus', -stable and -rt only has around 10M objects. Signed-off-by: Rasmus Villemoes --- I could probably fix things by adding a 'git config --local core.abbrev 15' step to the Yocto build process after the repo to build from has been cloned but before building has started. But in the interest of binary reproducibility outside of just Yocto builds, I think it's better if this lives in the kernel. scripts/setlocalversion | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/scripts/setlocalversion b/scripts/setlocalversion index 20f2efd57b11..c5262f0d953d 100755 --- a/scripts/setlocalversion +++ b/scripts/setlocalversion @@ -45,7 +45,7 @@ scm_version() # Check for git and a git repo. if test -z "$(git rev-parse --show-cdup 2>/dev/null)" && - head=$(git rev-parse --verify --short HEAD 2>/dev/null); then + head=$(git rev-parse --verify --short=15 HEAD 2>/dev/null); then # If we are at a tagged commit (like "v2.6.30-rc6"), we ignore # it, because this version is defined in the top level Makefile. @@ -59,7 +59,7 @@ scm_version() fi # If we are past a tagged commit (like # "v2.6.30-rc5-302-g72357d5"), we pretty print it. - if atag="$(git describe 2>/dev/null)"; then + if atag="$(git describe --abbrev=15 2>/dev/null)"; then echo "$atag" | awk -F- '{printf("-%05d-%s", $(NF-1),$(NF))}' # If we don't have a tag at all we print -g{commitish}. -- 2.23.0
Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes wrote: > > When building for an embedded target using Yocto, we're sometimes > observing that the version string that gets built into vmlinux (and > thus what uname -a reports) differs from the path under /lib/modules/ > where modules get installed in the rootfs, but only in the length of > the -gabc123def suffix. Hence modprobe always fails. > > The problem is that Yocto has the concept of "sstate" (shared state), > which allows different developers/buildbots/etc. to share build > artifacts, based on a hash of all the metadata that went into building > that artifact - and that metadata includes all dependencies (e.g. the > compiler used etc.). That normally works quite well; usually a clean > build (without using any sstate cache) done by one developer ends up > being binary identical to a build done on another host. However, one > thing that can cause two developers to end up with different builds > [and thus make one's vmlinux package incompatible with the other's > kernel-dev package], which is not captured by the metadata hashing, is > this `git describe`: The output of that can be affected by > > (1) git version: before 2.11 git defaulted to a minimum of 7, since > 2.11 (git.git commit e6c587) the default is dynamic based on the > number of objects in the repo > (2) hence even if both run the same git version, the output can differ > based on how many remotes are being tracked (or just lots of local > development branches or plain old garbage) > (3) and of course somebody could have a core.abbrev config setting in > ~/.gitconfig > > So in order to avoid `uname -a` output relying on such random details > of the build environment which are rather hard to ensure are > consistent between developers and buildbots, use an explicit > --abbrev=15 option (and for consistency, also use rev-parse --short=15 > for the unlikely case of no signed tags being usable). > > Now, why is 60 bits enough for everyone? It's not mathematically > guaranteed that git won't have to use 16 in some git repo, but it is > beyond unlikely: Even in a repo with 100M objects, the probability > that any given commit (i.e. the one being described) clashes with some > other object in the first 15 hex chars is less than 1e-10, and > currently a git repo tracking Linus', -stable and -rt only has around > 10M objects. I agree that any randomness should be avoided. My question is, do we need 15-digits? The kernelrelease is formed by [kernel version] + [some digits of git hash]. For example, "git describe" shows as follows: v5.9.0-rc4-00034-g7fe10096c150 Linus gives a new tag every week (or every two week). So, I think the conflict happens only when we have two commits that start with the same 7-digits in the _same_ release. Is this correct? We have 14000 - 15000 commits in each release, not 100M. > > Signed-off-by: Rasmus Villemoes > --- > I could probably fix things by adding a 'git config --local > core.abbrev 15' step to the Yocto build process after the repo to > build from has been cloned but before building has started. But in the > interest of binary reproducibility outside of just Yocto builds, I > think it's better if this lives in the kernel. > > scripts/setlocalversion | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/scripts/setlocalversion b/scripts/setlocalversion > index 20f2efd57b11..c5262f0d953d 100755 > --- a/scripts/setlocalversion > +++ b/scripts/setlocalversion > @@ -45,7 +45,7 @@ scm_version() > > # Check for git and a git repo. > if test -z "$(git rev-parse --show-cdup 2>/dev/null)" && > - head=$(git rev-parse --verify --short HEAD 2>/dev/null); then > + head=$(git rev-parse --verify --short=15 HEAD 2>/dev/null); then > > # If we are at a tagged commit (like "v2.6.30-rc6"), we ignore > # it, because this version is defined in the top level > Makefile. > @@ -59,7 +59,7 @@ scm_version() > fi > # If we are past a tagged commit (like > # "v2.6.30-rc5-302-g72357d5"), we pretty print it. > - if atag="$(git describe 2>/dev/null)"; then > + if atag="$(git describe --abbrev=15 2>/dev/null)"; > then > echo "$atag" | awk -F- '{printf("-%05d-%s", > $(NF-1),$(NF))}' > > # If we don't have a tag at all we print > -g{commitish}. > -- > 2.23.0 > -- Best Regards Masahiro Yamada
Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
On Thu, Sep 10, 2020 at 7:35 AM Masahiro Yamada wrote: > On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes > wrote: > > So in order to avoid `uname -a` output relying on such random details > > of the build environment which are rather hard to ensure are > > consistent between developers and buildbots, use an explicit > > --abbrev=15 option (and for consistency, also use rev-parse --short=15 > > for the unlikely case of no signed tags being usable). For the patch: Reviewed-by: Brian Norris > I agree that any randomness should be avoided. > > My question is, do we need 15-digits? ... > So, I think the conflict happens > only when we have two commits that start with the same 7-digits > in the _same_ release. Is this correct? For the rev-parse usage ("unlikely case where we have no signed tag"), the total number of objects is definitely relevant, and the man-page says: "unique prefix with at least length characters" i.e., it might be longer, if needed for uniqueness. For git-describe (the case where we have a tag to base off): "use digits, or as many digits as needed to form a unique object name" I'm not quite sure whether the uniqueness is including anything about the tag-relative prefix, but if not, then we have the same problem. At least, that's my reading of the situation. Brian
Re: [PATCH] scripts/setlocalversion: make git describe output more reliable
On 9/10/20 4:26 AM, Rasmus Villemoes wrote: > When building for an embedded target using Yocto, we're sometimes > observing that the version string that gets built into vmlinux (and > thus what uname -a reports) differs from the path under /lib/modules/ > where modules get installed in the rootfs, but only in the length of > the -gabc123def suffix. Hence modprobe always fails. > > The problem is that Yocto has the concept of "sstate" (shared state), > which allows different developers/buildbots/etc. to share build > artifacts, based on a hash of all the metadata that went into building > that artifact - and that metadata includes all dependencies (e.g. the > compiler used etc.). That normally works quite well; usually a clean > build (without using any sstate cache) done by one developer ends up > being binary identical to a build done on another host. However, one > thing that can cause two developers to end up with different builds > [and thus make one's vmlinux package incompatible with the other's > kernel-dev package], which is not captured by the metadata hashing, is > this `git describe`: The output of that can be affected by > > (1) git version: before 2.11 git defaulted to a minimum of 7, since > 2.11 (git.git commit e6c587) the default is dynamic based on the > number of objects in the repo > (2) hence even if both run the same git version, the output can differ > based on how many remotes are being tracked (or just lots of local > development branches or plain old garbage) > (3) and of course somebody could have a core.abbrev config setting in > ~/.gitconfig > > So in order to avoid `uname -a` output relying on such random details > of the build environment which are rather hard to ensure are > consistent between developers and buildbots, use an explicit > --abbrev=15 option (and for consistency, also use rev-parse --short=15 > for the unlikely case of no signed tags being usable). > > Now, why is 60 bits enough for everyone? It's not mathematically > guaranteed that git won't have to use 16 in some git repo, but it is > beyond unlikely: Even in a repo with 100M objects, the probability > that any given commit (i.e. the one being described) clashes with some > other object in the first 15 hex chars is less than 1e-10, and > currently a git repo tracking Linus', -stable and -rt only has around > 10M objects. > > Signed-off-by: Rasmus Villemoes Makes sense to me. Acked-by: Guenter Roeck > --- > I could probably fix things by adding a 'git config --local > core.abbrev 15' step to the Yocto build process after the repo to > build from has been cloned but before building has started. But in the > interest of binary reproducibility outside of just Yocto builds, I > think it's better if this lives in the kernel. > > scripts/setlocalversion | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/scripts/setlocalversion b/scripts/setlocalversion > index 20f2efd57b11..c5262f0d953d 100755 > --- a/scripts/setlocalversion > +++ b/scripts/setlocalversion > @@ -45,7 +45,7 @@ scm_version() > > # Check for git and a git repo. > if test -z "$(git rev-parse --show-cdup 2>/dev/null)" && > -head=$(git rev-parse --verify --short HEAD 2>/dev/null); then > +head=$(git rev-parse --verify --short=15 HEAD 2>/dev/null); then > > # If we are at a tagged commit (like "v2.6.30-rc6"), we ignore > # it, because this version is defined in the top level Makefile. > @@ -59,7 +59,7 @@ scm_version() > fi > # If we are past a tagged commit (like > # "v2.6.30-rc5-302-g72357d5"), we pretty print it. > - if atag="$(git describe 2>/dev/null)"; then > + if atag="$(git describe --abbrev=15 2>/dev/null)"; then > echo "$atag" | awk -F- '{printf("-%05d-%s", > $(NF-1),$(NF))}' > > # If we don't have a tag at all we print -g{commitish}. >