Re: [PATCH] scripts/setlocalversion: make git describe output more reliable

2020-09-16 Thread Masahiro Yamada
On Thu, Sep 17, 2020 at 4:31 AM Rasmus Villemoes
 wrote:
>
> On 16/09/2020 20.01, Masahiro Yamada wrote:
> > On Thu, Sep 17, 2020 at 12:23 AM Rasmus Villemoes
> >  wrote:
> >>
> >> On 16/09/2020 16.28, Masahiro Yamada wrote:
> >>> On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes
> >>>  wrote:
> 
>
> >>> Why do we care about the uniqueness of the abbreviated
> >>> hash in the whole git history?
> >>
> >> Because when I ask a customer "what kernel are you running", and they
> >> tell me "4.19.45-rt67-00567-123abc8", I prefer that 123abc8 uniquely
> >> identifies the right commit, instead of me having to dig around each of
> >> the commits starting with that prefix and see which one of them matches
> >> "is exactly 567 commits from 4.19.45-rt67". 7 hexchars is nowhere near
> >> enough for that today, which is why Linus himself did that "auto-tune
> >> abbrev based on repo size" patch for git.git.
> >
> > I like:
> >
> >git rev-parse --verify HEAD 2>/dev/null | cut -c1-15
> >
> > better than:
> >
> >git rev-parse --verify --short=15 HEAD 2>/dev/null
> >
> > The former produces the deterministic kernelrelease string.
> >
> >
> > But, let's reconsider if we need as long as 15-digits.
> >
> > There are a couple of kinds of objects in git: commit, tree, blob.
> >
> > I think you came up with 15-digits to ensure the uniqueness
> > in _all_ kinds of objects.
> >
> > But, when your customer says "4.19.45-rt67-00567-123abc8",
> > 123abc8 is apparently a commit instead of a tree or a blob.
> >
> >
> >
> > In the context of the kernel version, we can exclude
> > tree and blob objects.
> >
> > In other words, I think "grows ~6 objects per release"
> > is somewhat over-estimating.
> >
> > We have ~15000 commits per release. So, the difference
> > is just 4x factor, though...
> >
> >
> >
> > BTW, I did some experiments, and I noticed
> > "git log" accepts a shorter hash
> > than "git show" does presumably because
> > "git log" only takes a commit.
> >
> >
> >
> > For example, "git show 06a0d" fails, but
> > "git log 06a0d" works.
> >
> >
> > masahiro@oscar:~/ref/linux$ git  show   06a0d
> > error: short SHA1 06a0d is ambiguous
> > hint: The candidates are:
> > hint:   06a0df4d1b8b commit 2020-09-08 - fbcon: remove now unusued
> > 'softback_lines' cursor() argument
> > hint:   06a0d81911b3 tree
> > hint:   06a0dc5a84d2 tree
> > hint:   06a0d1947c77 blob
> > hint:   06a0df434249 blob
> > fatal: ambiguous argument '06a0d': unknown revision or path not in the
> > working tree.
> > Use '--' to separate paths from revisions, like this:
> > 'git  [...] -- [...]'
> > masahiro@oscar:~/ref/linux$ git  log --oneline  -1   06a0d
> > 06a0df4d1b8b fbcon: remove now unusued 'softback_lines' cursor() argument
> >
> >
> >
> >
> > What is interesting is, if you prepend --
> > to the abbreviated hash, "git show" accepts as short as a commit
> > "git log" does.
> >
> > masahiro@oscar:~/ref/linux$ git  show   v5.9-rc5-2-g06a0d  | head -1
> > commit 06a0df4d1b8b13b551668e47b11fd7629033b7df
> >
> >
> > I guess git cleverly understands it refers to a commit object
> > since "v5.9-rc5-2-" prefix only makes sense
> > in the commit context.
> >
>
> Well, yes, but unfortunately only so far as applying the same "the user
> means a commit object" logic; git doesn't do anything to actually use
> the prefix to disambiguate. In fact, looking at a semi-random commit in
> 4.19-stable v4.19.114-7-g236c445eb352:
>
> $ git show 236c44
> error: short SHA1 236c44 is ambiguous
> hint: The candidates are:
> hint:   236c445eb352 commit 2020-03-13 - drm/bochs: downgrade
> pci_request_region failure from error to warning
> hint:   236c448cb6e7 commit 2011-09-08 - usbmon vs. tcpdump: fix dropped
> packet count
> hint:   236c445b1930 blob
>
> Now you're right that we get
>
> $ git show v4.19.114-7-g236c445 | head -n1
> commit 236c445eb3529aa7c976f8812513c3cb26d77e27
>
> so it knows we're not looking at that blob. But "git show
> v4.19.114-7-g236c44" still fails because there are two commits. Adding
> to the fun is that "git show v4.19.114-7-g236c448" actually shows the
> usbmon commit, which is old v3.2 stuff and certainly not a descendant of
> v4.19.114.


Oh, I did not notice that.
Maybe git can be improved to check the validity of
the 'git describe' form, but that is another story.



> I didn't actually know that git would even accept those prefixed strings
> as possible refspecs, and for a moment you had me hoping that git would
> actually do the disambiguation using that prefix, which would certainly
> make 7 hex chars enough; unfortunately that's not the case.
>
> > Keeping those above in mind, I believe 15-digits is too long.
>
> Well, as you indicated, commits are probably ~1/4 of all objects, i.e.
> half a hexchar worth of bits. So the commit/object distinction shouldn't
> matter very much for the choice of suitable fixed length.
>
> > So, I propose two options.
> >
> >
> > [1] 7 digits
> >
> > The abbreviated hash part may not 

Re: [PATCH] scripts/setlocalversion: make git describe output more reliable

2020-09-16 Thread Rasmus Villemoes
On 16/09/2020 16.28, Masahiro Yamada wrote:
> On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes
>  wrote:
>>
>> On 10/09/2020 21.05, Brian Norris wrote:
>>> On Thu, Sep 10, 2020 at 7:35 AM Masahiro Yamada  
>>> wrote:
 On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
  wrote:
> So in order to avoid `uname -a` output relying on such random details
> of the build environment which are rather hard to ensure are
> consistent between developers and buildbots, use an explicit
> --abbrev=15 option (and for consistency, also use rev-parse --short=15
> for the unlikely case of no signed tags being usable).
>>>
>>> For the patch:
>>>
>>> Reviewed-by: Brian Norris 
>>>
 I agree that any randomness should be avoided.

 My question is, do we need 15-digits?
>>> ...
 So, I think the conflict happens
 only when we have two commits that start with the same 7-digits
 in the _same_ release. Is this correct?
>>
>> No.
>>
>>> For git-describe (the case where we have a tag to base off):
>>> "use  digits, or as many digits as needed to form a unique object name"
>>
>> Yes, the abbreviated hash that `git describe` produces is unique among
>> all objects (and objects are more than just commits) in the current
>> repo, so what matters for probability-of-collision is the total number
>> of objects - linus.git itself probably grows ~6 objects per release.
>>
>> As for "do we need 15 digits", well, in theory the next time I merge the
>> next rt-stable tag into our kernel I could end up with a commit that
>> matches some existing object in the first 33 hex chars at which point I
>> should have chosen 34 - but of course that's so unlikely that it's
>> irrelevant.
>>
>> But the upshot of that is that there really is no objective answer to
>> "how many digits do we need", so it becomes a tradeoff between "low
>> enough probability that anyone anywhere in the next few years would have
>> needed more than X when building their own kernel" and readability of
>> `uname -r` etc. So I decided somewhat arbitrarily that each time one
>> rolls a new release, there should be less than 1e-9 probability that
>> HEAD collides with some other object when abbreviated to X hexchars.
>> X=12 doesn't pass that criteria even when the repo has only 10M objects
>> (and, current linus.git already has objects that need 12 to be unique,
>> so such collisions do happen in practice, though of course very rarely).
>> 13 and 14 are just weird numbers, so I ended with 15, which also allows
>> many many more objects in the repo before the probability crosses that
>> 1e-9 threshold.
>>
> 
> 
> This is because you use the output from git as-is.
> 
> Why are you still trying to rely on such obscure behavior of git?
> 
> 
> It is pretty easy to get the fixed number of digits reliably.
> 
> For example,
> git rev-parse --verify HEAD 2>/dev/null | cut -c1-7
> 
> 
> It always returns a 7-digits hash,
> and two different people will get the same result for sure.
> 
> Your solution, --short=15, means "at least 15",
> which still contains ambiguity.
> 
> 
> 
> As I already noted, the kernelrelease string is
> constructed in this format:
> 
> --
> 
> 
> For example, if I enable CONFIG_LOCALVERSION_AUTO=y
> in today's Linus tree, I got this:
> 
> 5.9.0-rc5-5-gfc4f28bb3daf
> 
> 
> What if the number of digits were 7?
> 
> 5.9.0-rc5-5-gfc4f28b
> 
> I see no problem here.

The problem is that currently, the build relies on things which cannot
easily be controlled or reproduced between different developers: Apart
from git version (which is reasonable to mandate is the same, just like
one must use same compiler, binutils etc. to get binary reproducible
output), it depends on whether ~/.gitconfig has a core.abbrev setting -
and even worse, it depends on the _total number of objects in the source
git repo_, i.e. something that has nothing to do with what is currently
in the work tree at all.

And that leads to real bugs when one builds external modules that end up
in one directory in the rootfs, while `uname -r` tells modprobe to look
in some other directory (differing in the length of the abbreviated hash).

> Even if we have another object that starts with "fc4f28b",
> the kernelrelease string is still unique thanks to the
> - prefix.
> 
> Why do we care about the uniqueness of the abbreviated
> hash in the whole git history?

Because when I ask a customer "what kernel are you running", and they
tell me "4.19.45-rt67-00567-123abc8", I prefer that 123abc8 uniquely
identifies the right commit, instead of me having to dig around each of
the commits starting with that prefix and see which one of them matches
"is exactly 567 commits from 4.19.45-rt67". 7 hexchars is nowhere near
enough for that today, which is why Linus himself did that "auto-tune
abbrev based on repo size" patch for git.git.

But what I mostly care about is getting a consistent result. And yes,
you're right that the porcelain command 'git describe' could end 

Re: [PATCH] scripts/setlocalversion: make git describe output more reliable

2020-09-16 Thread Rasmus Villemoes
On 16/09/2020 20.01, Masahiro Yamada wrote:
> On Thu, Sep 17, 2020 at 12:23 AM Rasmus Villemoes
>  wrote:
>>
>> On 16/09/2020 16.28, Masahiro Yamada wrote:
>>> On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes
>>>  wrote:


>>> Why do we care about the uniqueness of the abbreviated
>>> hash in the whole git history?
>>
>> Because when I ask a customer "what kernel are you running", and they
>> tell me "4.19.45-rt67-00567-123abc8", I prefer that 123abc8 uniquely
>> identifies the right commit, instead of me having to dig around each of
>> the commits starting with that prefix and see which one of them matches
>> "is exactly 567 commits from 4.19.45-rt67". 7 hexchars is nowhere near
>> enough for that today, which is why Linus himself did that "auto-tune
>> abbrev based on repo size" patch for git.git.
> 
> I like:
> 
>git rev-parse --verify HEAD 2>/dev/null | cut -c1-15
> 
> better than:
> 
>git rev-parse --verify --short=15 HEAD 2>/dev/null
> 
> The former produces the deterministic kernelrelease string.
> 
> 
> But, let's reconsider if we need as long as 15-digits.
> 
> There are a couple of kinds of objects in git: commit, tree, blob.
> 
> I think you came up with 15-digits to ensure the uniqueness
> in _all_ kinds of objects.
> 
> But, when your customer says "4.19.45-rt67-00567-123abc8",
> 123abc8 is apparently a commit instead of a tree or a blob.
> 
> 
> 
> In the context of the kernel version, we can exclude
> tree and blob objects.
> 
> In other words, I think "grows ~6 objects per release"
> is somewhat over-estimating.
> 
> We have ~15000 commits per release. So, the difference
> is just 4x factor, though...
> 
> 
> 
> BTW, I did some experiments, and I noticed
> "git log" accepts a shorter hash
> than "git show" does presumably because
> "git log" only takes a commit.
> 
> 
> 
> For example, "git show 06a0d" fails, but
> "git log 06a0d" works.
> 
> 
> masahiro@oscar:~/ref/linux$ git  show   06a0d
> error: short SHA1 06a0d is ambiguous
> hint: The candidates are:
> hint:   06a0df4d1b8b commit 2020-09-08 - fbcon: remove now unusued
> 'softback_lines' cursor() argument
> hint:   06a0d81911b3 tree
> hint:   06a0dc5a84d2 tree
> hint:   06a0d1947c77 blob
> hint:   06a0df434249 blob
> fatal: ambiguous argument '06a0d': unknown revision or path not in the
> working tree.
> Use '--' to separate paths from revisions, like this:
> 'git  [...] -- [...]'
> masahiro@oscar:~/ref/linux$ git  log --oneline  -1   06a0d
> 06a0df4d1b8b fbcon: remove now unusued 'softback_lines' cursor() argument
> 
> 
> 
> 
> What is interesting is, if you prepend --
> to the abbreviated hash, "git show" accepts as short as a commit
> "git log" does.
> 
> masahiro@oscar:~/ref/linux$ git  show   v5.9-rc5-2-g06a0d  | head -1
> commit 06a0df4d1b8b13b551668e47b11fd7629033b7df
> 
> 
> I guess git cleverly understands it refers to a commit object
> since "v5.9-rc5-2-" prefix only makes sense
> in the commit context.
> 

Well, yes, but unfortunately only so far as applying the same "the user
means a commit object" logic; git doesn't do anything to actually use
the prefix to disambiguate. In fact, looking at a semi-random commit in
4.19-stable v4.19.114-7-g236c445eb352:

$ git show 236c44
error: short SHA1 236c44 is ambiguous
hint: The candidates are:
hint:   236c445eb352 commit 2020-03-13 - drm/bochs: downgrade
pci_request_region failure from error to warning
hint:   236c448cb6e7 commit 2011-09-08 - usbmon vs. tcpdump: fix dropped
packet count
hint:   236c445b1930 blob

Now you're right that we get

$ git show v4.19.114-7-g236c445 | head -n1
commit 236c445eb3529aa7c976f8812513c3cb26d77e27

so it knows we're not looking at that blob. But "git show
v4.19.114-7-g236c44" still fails because there are two commits. Adding
to the fun is that "git show v4.19.114-7-g236c448" actually shows the
usbmon commit, which is old v3.2 stuff and certainly not a descendant of
v4.19.114.

I didn't actually know that git would even accept those prefixed strings
as possible refspecs, and for a moment you had me hoping that git would
actually do the disambiguation using that prefix, which would certainly
make 7 hex chars enough; unfortunately that's not the case.

> Keeping those above in mind, I believe 15-digits is too long.

Well, as you indicated, commits are probably ~1/4 of all objects, i.e.
half a hexchar worth of bits. So the commit/object distinction shouldn't
matter very much for the choice of suitable fixed length.

> So, I propose two options.
> 
> 
> [1] 7 digits
> 
> The abbreviated hash part may not uniquely identify
> the commit. In that case, you need some extra git
> operations to find out which one is it.
> 
> As for the kernel build,
> --<7-digits> is enough
> to provide the unique kernelrelease string.
> 
> 
> 
> [2] 12 digits
> 
> This matches to the Fixes: tag policy specified in
> Documentation/process/submitting-patches.rst
> 
> The abbreviated hash part is very likely unique
> in the commit object space.
> (Of 

Re: [PATCH] scripts/setlocalversion: make git describe output more reliable

2020-09-16 Thread Masahiro Yamada
On Thu, Sep 17, 2020 at 12:23 AM Rasmus Villemoes
 wrote:
>
> On 16/09/2020 16.28, Masahiro Yamada wrote:
> > On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes
> >  wrote:
> >>
> >> On 10/09/2020 21.05, Brian Norris wrote:
> >>> On Thu, Sep 10, 2020 at 7:35 AM Masahiro Yamada  
> >>> wrote:
>  On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
>   wrote:
> > So in order to avoid `uname -a` output relying on such random details
> > of the build environment which are rather hard to ensure are
> > consistent between developers and buildbots, use an explicit
> > --abbrev=15 option (and for consistency, also use rev-parse --short=15
> > for the unlikely case of no signed tags being usable).
> >>>
> >>> For the patch:
> >>>
> >>> Reviewed-by: Brian Norris 
> >>>
>  I agree that any randomness should be avoided.
> 
>  My question is, do we need 15-digits?
> >>> ...
>  So, I think the conflict happens
>  only when we have two commits that start with the same 7-digits
>  in the _same_ release. Is this correct?
> >>
> >> No.
> >>
> >>> For git-describe (the case where we have a tag to base off):
> >>> "use  digits, or as many digits as needed to form a unique object name"
> >>
> >> Yes, the abbreviated hash that `git describe` produces is unique among
> >> all objects (and objects are more than just commits) in the current
> >> repo, so what matters for probability-of-collision is the total number
> >> of objects - linus.git itself probably grows ~6 objects per release.
> >>
> >> As for "do we need 15 digits", well, in theory the next time I merge the
> >> next rt-stable tag into our kernel I could end up with a commit that
> >> matches some existing object in the first 33 hex chars at which point I
> >> should have chosen 34 - but of course that's so unlikely that it's
> >> irrelevant.
> >>
> >> But the upshot of that is that there really is no objective answer to
> >> "how many digits do we need", so it becomes a tradeoff between "low
> >> enough probability that anyone anywhere in the next few years would have
> >> needed more than X when building their own kernel" and readability of
> >> `uname -r` etc. So I decided somewhat arbitrarily that each time one
> >> rolls a new release, there should be less than 1e-9 probability that
> >> HEAD collides with some other object when abbreviated to X hexchars.
> >> X=12 doesn't pass that criteria even when the repo has only 10M objects
> >> (and, current linus.git already has objects that need 12 to be unique,
> >> so such collisions do happen in practice, though of course very rarely).
> >> 13 and 14 are just weird numbers, so I ended with 15, which also allows
> >> many many more objects in the repo before the probability crosses that
> >> 1e-9 threshold.
> >>
> >
> >
> > This is because you use the output from git as-is.
> >
> > Why are you still trying to rely on such obscure behavior of git?
> >
> >
> > It is pretty easy to get the fixed number of digits reliably.
> >
> > For example,
> > git rev-parse --verify HEAD 2>/dev/null | cut -c1-7
> >
> >
> > It always returns a 7-digits hash,
> > and two different people will get the same result for sure.
> >
> > Your solution, --short=15, means "at least 15",
> > which still contains ambiguity.
> >
> >
> >
> > As I already noted, the kernelrelease string is
> > constructed in this format:
> >
> > --
> >
> >
> > For example, if I enable CONFIG_LOCALVERSION_AUTO=y
> > in today's Linus tree, I got this:
> >
> > 5.9.0-rc5-5-gfc4f28bb3daf
> >
> >
> > What if the number of digits were 7?
> >
> > 5.9.0-rc5-5-gfc4f28b
> >
> > I see no problem here.
>
> The problem is that currently, the build relies on things which cannot
> easily be controlled or reproduced between different developers: Apart
> from git version (which is reasonable to mandate is the same, just like
> one must use same compiler, binutils etc. to get binary reproducible
> output), it depends on whether ~/.gitconfig has a core.abbrev setting -
> and even worse, it depends on the _total number of objects in the source
> git repo_, i.e. something that has nothing to do with what is currently
> in the work tree at all.
>
> And that leads to real bugs when one builds external modules that end up
> in one directory in the rootfs, while `uname -r` tells modprobe to look
> in some other directory (differing in the length of the abbreviated hash).
>
> > Even if we have another object that starts with "fc4f28b",
> > the kernelrelease string is still unique thanks to the
> > - prefix.
> >
> > Why do we care about the uniqueness of the abbreviated
> > hash in the whole git history?
>
> Because when I ask a customer "what kernel are you running", and they
> tell me "4.19.45-rt67-00567-123abc8", I prefer that 123abc8 uniquely
> identifies the right commit, instead of me having to dig around each of
> the commits starting with that prefix and see which one of them matches
> "is exactly 567 commits from 4.19.45-rt67". 7 

Re: [PATCH] scripts/setlocalversion: make git describe output more reliable

2020-09-16 Thread Masahiro Yamada
On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes
 wrote:
>
> On 10/09/2020 21.05, Brian Norris wrote:
> > On Thu, Sep 10, 2020 at 7:35 AM Masahiro Yamada  
> > wrote:
> >> On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
> >>  wrote:
> >>> So in order to avoid `uname -a` output relying on such random details
> >>> of the build environment which are rather hard to ensure are
> >>> consistent between developers and buildbots, use an explicit
> >>> --abbrev=15 option (and for consistency, also use rev-parse --short=15
> >>> for the unlikely case of no signed tags being usable).
> >
> > For the patch:
> >
> > Reviewed-by: Brian Norris 
> >
> >> I agree that any randomness should be avoided.
> >>
> >> My question is, do we need 15-digits?
> > ...
> >> So, I think the conflict happens
> >> only when we have two commits that start with the same 7-digits
> >> in the _same_ release. Is this correct?
>
> No.
>
> > For git-describe (the case where we have a tag to base off):
> > "use  digits, or as many digits as needed to form a unique object name"
>
> Yes, the abbreviated hash that `git describe` produces is unique among
> all objects (and objects are more than just commits) in the current
> repo, so what matters for probability-of-collision is the total number
> of objects - linus.git itself probably grows ~6 objects per release.
>
> As for "do we need 15 digits", well, in theory the next time I merge the
> next rt-stable tag into our kernel I could end up with a commit that
> matches some existing object in the first 33 hex chars at which point I
> should have chosen 34 - but of course that's so unlikely that it's
> irrelevant.
>
> But the upshot of that is that there really is no objective answer to
> "how many digits do we need", so it becomes a tradeoff between "low
> enough probability that anyone anywhere in the next few years would have
> needed more than X when building their own kernel" and readability of
> `uname -r` etc. So I decided somewhat arbitrarily that each time one
> rolls a new release, there should be less than 1e-9 probability that
> HEAD collides with some other object when abbreviated to X hexchars.
> X=12 doesn't pass that criteria even when the repo has only 10M objects
> (and, current linus.git already has objects that need 12 to be unique,
> so such collisions do happen in practice, though of course very rarely).
> 13 and 14 are just weird numbers, so I ended with 15, which also allows
> many many more objects in the repo before the probability crosses that
> 1e-9 threshold.
>
> Rasmus
>
> Side note 1: Note that there really isn't any such thing as "last
> tag/previous tag" in a DAG; I think what git does is a best-effort
> search for the eligible tag that minimizes #({objects reachable from
> commit-to-be-described} - {objects reachable from tag}), where eligible
> tag means at least reachable from commit-to-be-described (so that set
> difference is a proper one), but there can be additional constraints
> (e.g. --match=...).
>
> Side note 2: Linus or Greg releases are actually not interesting here
> (see the logic in setlocalversion that stops early if we're exactly at
> an annotated tag) - the whole raison d'etre for setlocalversion is that
> people do maintain and build non-mainline kernels, and it's extremely
> useful for `uname -a` to accurately tell just which commit is running on
> the target.



This is because you use the output from git as-is.

Why are you still trying to rely on such obscure behavior of git?


It is pretty easy to get the fixed number of digits reliably.

For example,
git rev-parse --verify HEAD 2>/dev/null | cut -c1-7


It always returns a 7-digits hash,
and two different people will get the same result for sure.

Your solution, --short=15, means "at least 15",
which still contains ambiguity.



As I already noted, the kernelrelease string is
constructed in this format:

--


For example, if I enable CONFIG_LOCALVERSION_AUTO=y
in today's Linus tree, I got this:

5.9.0-rc5-5-gfc4f28bb3daf


What if the number of digits were 7?

5.9.0-rc5-5-gfc4f28b

I see no problem here.

Even if we have another object that starts with "fc4f28b",
the kernelrelease string is still unique thanks to the
- prefix.

Why do we care about the uniqueness of the abbreviated
hash in the whole git history?



Note:
We prepend $(KERNELVERSION) to the result
of setlocalversion. [1]

[1] https://github.com/torvalds/linux/blob/v5.9-rc4/Makefile#L1175


-- 
Best Regards
Masahiro Yamada


Re: [PATCH] scripts/setlocalversion: make git describe output more reliable

2020-09-16 Thread Rasmus Villemoes
On 10/09/2020 16.34, Masahiro Yamada wrote:
> On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
>  wrote:
>>
...
>>
>> So in order to avoid `uname -a` output relying on such random details
>> of the build environment which are rather hard to ensure are
>> consistent between developers and buildbots, use an explicit
>> --abbrev=15 option (and for consistency, also use rev-parse --short=15
>> for the unlikely case of no signed tags being usable).
>>

Hi Masahiro

Can I get you to carry this through the kbuild tree? Unless of course
your questions/concerns haven't been addressed.

Thanks,
Rasmus


Re: [PATCH] scripts/setlocalversion: make git describe output more reliable

2020-09-11 Thread Rasmus Villemoes
On 10/09/2020 21.05, Brian Norris wrote:
> On Thu, Sep 10, 2020 at 7:35 AM Masahiro Yamada  wrote:
>> On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
>>  wrote:
>>> So in order to avoid `uname -a` output relying on such random details
>>> of the build environment which are rather hard to ensure are
>>> consistent between developers and buildbots, use an explicit
>>> --abbrev=15 option (and for consistency, also use rev-parse --short=15
>>> for the unlikely case of no signed tags being usable).
> 
> For the patch:
> 
> Reviewed-by: Brian Norris 
> 
>> I agree that any randomness should be avoided.
>>
>> My question is, do we need 15-digits?
> ...
>> So, I think the conflict happens
>> only when we have two commits that start with the same 7-digits
>> in the _same_ release. Is this correct?

No.

> For git-describe (the case where we have a tag to base off):
> "use  digits, or as many digits as needed to form a unique object name"

Yes, the abbreviated hash that `git describe` produces is unique among
all objects (and objects are more than just commits) in the current
repo, so what matters for probability-of-collision is the total number
of objects - linus.git itself probably grows ~6 objects per release.

As for "do we need 15 digits", well, in theory the next time I merge the
next rt-stable tag into our kernel I could end up with a commit that
matches some existing object in the first 33 hex chars at which point I
should have chosen 34 - but of course that's so unlikely that it's
irrelevant.

But the upshot of that is that there really is no objective answer to
"how many digits do we need", so it becomes a tradeoff between "low
enough probability that anyone anywhere in the next few years would have
needed more than X when building their own kernel" and readability of
`uname -r` etc. So I decided somewhat arbitrarily that each time one
rolls a new release, there should be less than 1e-9 probability that
HEAD collides with some other object when abbreviated to X hexchars.
X=12 doesn't pass that criteria even when the repo has only 10M objects
(and, current linus.git already has objects that need 12 to be unique,
so such collisions do happen in practice, though of course very rarely).
13 and 14 are just weird numbers, so I ended with 15, which also allows
many many more objects in the repo before the probability crosses that
1e-9 threshold.

Rasmus

Side note 1: Note that there really isn't any such thing as "last
tag/previous tag" in a DAG; I think what git does is a best-effort
search for the eligible tag that minimizes #({objects reachable from
commit-to-be-described} - {objects reachable from tag}), where eligible
tag means at least reachable from commit-to-be-described (so that set
difference is a proper one), but there can be additional constraints
(e.g. --match=...).

Side note 2: Linus or Greg releases are actually not interesting here
(see the logic in setlocalversion that stops early if we're exactly at
an annotated tag) - the whole raison d'etre for setlocalversion is that
people do maintain and build non-mainline kernels, and it's extremely
useful for `uname -a` to accurately tell just which commit is running on
the target.


Re: [PATCH] scripts/setlocalversion: make git describe output more reliable

2020-09-10 Thread Bhaskar Chowdhury

On 23:34 Thu 10 Sep 2020, Masahiro Yamada wrote:

On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
 wrote:


When building for an embedded target using Yocto, we're sometimes
observing that the version string that gets built into vmlinux (and
thus what uname -a reports) differs from the path under /lib/modules/
where modules get installed in the rootfs, but only in the length of
the -gabc123def suffix. Hence modprobe always fails.

The problem is that Yocto has the concept of "sstate" (shared state),
which allows different developers/buildbots/etc. to share build
artifacts, based on a hash of all the metadata that went into building
that artifact - and that metadata includes all dependencies (e.g. the
compiler used etc.). That normally works quite well; usually a clean
build (without using any sstate cache) done by one developer ends up
being binary identical to a build done on another host. However, one
thing that can cause two developers to end up with different builds
[and thus make one's vmlinux package incompatible with the other's
kernel-dev package], which is not captured by the metadata hashing, is
this `git describe`: The output of that can be affected by

(1) git version: before 2.11 git defaulted to a minimum of 7, since
2.11 (git.git commit e6c587) the default is dynamic based on the
number of objects in the repo
(2) hence even if both run the same git version, the output can differ
based on how many remotes are being tracked (or just lots of local
development branches or plain old garbage)
(3) and of course somebody could have a core.abbrev config setting in
~/.gitconfig

So in order to avoid `uname -a` output relying on such random details
of the build environment which are rather hard to ensure are
consistent between developers and buildbots, use an explicit
--abbrev=15 option (and for consistency, also use rev-parse --short=15
for the unlikely case of no signed tags being usable).

Now, why is 60 bits enough for everyone? It's not mathematically
guaranteed that git won't have to use 16 in some git repo, but it is
beyond unlikely: Even in a repo with 100M objects, the probability
that any given commit (i.e. the one being described) clashes with some
other object in the first 15 hex chars is less than 1e-10, and
currently a git repo tracking Linus', -stable and -rt only has around
10M objects.



I agree that any randomness should be avoided.

My question is, do we need 15-digits?


The kernelrelease is formed by
[kernel version] + [some digits of git hash].


For example, "git describe" shows as follows:

v5.9.0-rc4-00034-g7fe10096c150


Linus gives a new tag every week (or every two week).


So, I think the conflict happens
only when we have two commits that start with the same 7-digits
in the _same_ release. Is this correct?

We have 14000 - 15000 commits in each release,
not 100M.



 I kinda agree with this...we need to chopped down the excess bits from the
 information .  


 Indeed , a 15 digits is too long  to keep up.

 ~Bhaskar





Signed-off-by: Rasmus Villemoes 
---
I could probably fix things by adding a 'git config --local
core.abbrev 15' step to the Yocto build process after the repo to
build from has been cloned but before building has started. But in the
interest of binary reproducibility outside of just Yocto builds, I
think it's better if this lives in the kernel.

 scripts/setlocalversion | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/scripts/setlocalversion b/scripts/setlocalversion
index 20f2efd57b11..c5262f0d953d 100755
--- a/scripts/setlocalversion
+++ b/scripts/setlocalversion
@@ -45,7 +45,7 @@ scm_version()

# Check for git and a git repo.
if test -z "$(git rev-parse --show-cdup 2>/dev/null)" &&
-  head=$(git rev-parse --verify --short HEAD 2>/dev/null); then
+  head=$(git rev-parse --verify --short=15 HEAD 2>/dev/null); then

# If we are at a tagged commit (like "v2.6.30-rc6"), we ignore
# it, because this version is defined in the top level Makefile.
@@ -59,7 +59,7 @@ scm_version()
fi
# If we are past a tagged commit (like
# "v2.6.30-rc5-302-g72357d5"), we pretty print it.
-   if atag="$(git describe 2>/dev/null)"; then
+   if atag="$(git describe --abbrev=15 2>/dev/null)"; then
echo "$atag" | awk -F- '{printf("-%05d-%s", 
$(NF-1),$(NF))}'

# If we don't have a tag at all we print -g{commitish}.
--
2.23.0




--
Best Regards
Masahiro Yamada


signature.asc
Description: PGP signature


Re: [PATCH] scripts/setlocalversion: make git describe output more reliable

2020-09-10 Thread Masahiro Yamada
On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
 wrote:
>
> When building for an embedded target using Yocto, we're sometimes
> observing that the version string that gets built into vmlinux (and
> thus what uname -a reports) differs from the path under /lib/modules/
> where modules get installed in the rootfs, but only in the length of
> the -gabc123def suffix. Hence modprobe always fails.
>
> The problem is that Yocto has the concept of "sstate" (shared state),
> which allows different developers/buildbots/etc. to share build
> artifacts, based on a hash of all the metadata that went into building
> that artifact - and that metadata includes all dependencies (e.g. the
> compiler used etc.). That normally works quite well; usually a clean
> build (without using any sstate cache) done by one developer ends up
> being binary identical to a build done on another host. However, one
> thing that can cause two developers to end up with different builds
> [and thus make one's vmlinux package incompatible with the other's
> kernel-dev package], which is not captured by the metadata hashing, is
> this `git describe`: The output of that can be affected by
>
> (1) git version: before 2.11 git defaulted to a minimum of 7, since
> 2.11 (git.git commit e6c587) the default is dynamic based on the
> number of objects in the repo
> (2) hence even if both run the same git version, the output can differ
> based on how many remotes are being tracked (or just lots of local
> development branches or plain old garbage)
> (3) and of course somebody could have a core.abbrev config setting in
> ~/.gitconfig
>
> So in order to avoid `uname -a` output relying on such random details
> of the build environment which are rather hard to ensure are
> consistent between developers and buildbots, use an explicit
> --abbrev=15 option (and for consistency, also use rev-parse --short=15
> for the unlikely case of no signed tags being usable).
>
> Now, why is 60 bits enough for everyone? It's not mathematically
> guaranteed that git won't have to use 16 in some git repo, but it is
> beyond unlikely: Even in a repo with 100M objects, the probability
> that any given commit (i.e. the one being described) clashes with some
> other object in the first 15 hex chars is less than 1e-10, and
> currently a git repo tracking Linus', -stable and -rt only has around
> 10M objects.


I agree that any randomness should be avoided.

My question is, do we need 15-digits?


The kernelrelease is formed by
[kernel version] + [some digits of git hash].


For example, "git describe" shows as follows:

v5.9.0-rc4-00034-g7fe10096c150


Linus gives a new tag every week (or every two week).


So, I think the conflict happens
only when we have two commits that start with the same 7-digits
in the _same_ release. Is this correct?

We have 14000 - 15000 commits in each release,
not 100M.





>
> Signed-off-by: Rasmus Villemoes 
> ---
> I could probably fix things by adding a 'git config --local
> core.abbrev 15' step to the Yocto build process after the repo to
> build from has been cloned but before building has started. But in the
> interest of binary reproducibility outside of just Yocto builds, I
> think it's better if this lives in the kernel.
>
>  scripts/setlocalversion | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/scripts/setlocalversion b/scripts/setlocalversion
> index 20f2efd57b11..c5262f0d953d 100755
> --- a/scripts/setlocalversion
> +++ b/scripts/setlocalversion
> @@ -45,7 +45,7 @@ scm_version()
>
> # Check for git and a git repo.
> if test -z "$(git rev-parse --show-cdup 2>/dev/null)" &&
> -  head=$(git rev-parse --verify --short HEAD 2>/dev/null); then
> +  head=$(git rev-parse --verify --short=15 HEAD 2>/dev/null); then
>
> # If we are at a tagged commit (like "v2.6.30-rc6"), we ignore
> # it, because this version is defined in the top level 
> Makefile.
> @@ -59,7 +59,7 @@ scm_version()
> fi
> # If we are past a tagged commit (like
> # "v2.6.30-rc5-302-g72357d5"), we pretty print it.
> -   if atag="$(git describe 2>/dev/null)"; then
> +   if atag="$(git describe --abbrev=15 2>/dev/null)"; 
> then
> echo "$atag" | awk -F- '{printf("-%05d-%s", 
> $(NF-1),$(NF))}'
>
> # If we don't have a tag at all we print 
> -g{commitish}.
> --
> 2.23.0
>


-- 
Best Regards
Masahiro Yamada


Re: [PATCH] scripts/setlocalversion: make git describe output more reliable

2020-09-10 Thread Brian Norris
On Thu, Sep 10, 2020 at 7:35 AM Masahiro Yamada  wrote:
> On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
>  wrote:
> > So in order to avoid `uname -a` output relying on such random details
> > of the build environment which are rather hard to ensure are
> > consistent between developers and buildbots, use an explicit
> > --abbrev=15 option (and for consistency, also use rev-parse --short=15
> > for the unlikely case of no signed tags being usable).

For the patch:

Reviewed-by: Brian Norris 

> I agree that any randomness should be avoided.
>
> My question is, do we need 15-digits?
...
> So, I think the conflict happens
> only when we have two commits that start with the same 7-digits
> in the _same_ release. Is this correct?

For the rev-parse usage ("unlikely case where we have no signed tag"),
the total number of objects is definitely relevant, and the man-page
says:
"unique prefix with at least length characters"
i.e., it might be longer, if needed for uniqueness.

For git-describe (the case where we have a tag to base off):
"use  digits, or as many digits as needed to form a unique object name"
I'm not quite sure whether the uniqueness is including anything about
the tag-relative prefix, but if not, then we have the same problem.

At least, that's my reading of the situation.

Brian


Re: [PATCH] scripts/setlocalversion: make git describe output more reliable

2020-09-10 Thread Guenter Roeck
On 9/10/20 4:26 AM, Rasmus Villemoes wrote:
> When building for an embedded target using Yocto, we're sometimes
> observing that the version string that gets built into vmlinux (and
> thus what uname -a reports) differs from the path under /lib/modules/
> where modules get installed in the rootfs, but only in the length of
> the -gabc123def suffix. Hence modprobe always fails.
> 
> The problem is that Yocto has the concept of "sstate" (shared state),
> which allows different developers/buildbots/etc. to share build
> artifacts, based on a hash of all the metadata that went into building
> that artifact - and that metadata includes all dependencies (e.g. the
> compiler used etc.). That normally works quite well; usually a clean
> build (without using any sstate cache) done by one developer ends up
> being binary identical to a build done on another host. However, one
> thing that can cause two developers to end up with different builds
> [and thus make one's vmlinux package incompatible with the other's
> kernel-dev package], which is not captured by the metadata hashing, is
> this `git describe`: The output of that can be affected by
> 
> (1) git version: before 2.11 git defaulted to a minimum of 7, since
> 2.11 (git.git commit e6c587) the default is dynamic based on the
> number of objects in the repo
> (2) hence even if both run the same git version, the output can differ
> based on how many remotes are being tracked (or just lots of local
> development branches or plain old garbage)
> (3) and of course somebody could have a core.abbrev config setting in
> ~/.gitconfig
> 
> So in order to avoid `uname -a` output relying on such random details
> of the build environment which are rather hard to ensure are
> consistent between developers and buildbots, use an explicit
> --abbrev=15 option (and for consistency, also use rev-parse --short=15
> for the unlikely case of no signed tags being usable).
> 
> Now, why is 60 bits enough for everyone? It's not mathematically
> guaranteed that git won't have to use 16 in some git repo, but it is
> beyond unlikely: Even in a repo with 100M objects, the probability
> that any given commit (i.e. the one being described) clashes with some
> other object in the first 15 hex chars is less than 1e-10, and
> currently a git repo tracking Linus', -stable and -rt only has around
> 10M objects.
> 
> Signed-off-by: Rasmus Villemoes 

Makes sense to me.

Acked-by: Guenter Roeck 

> ---
> I could probably fix things by adding a 'git config --local
> core.abbrev 15' step to the Yocto build process after the repo to
> build from has been cloned but before building has started. But in the
> interest of binary reproducibility outside of just Yocto builds, I
> think it's better if this lives in the kernel.
> 
>  scripts/setlocalversion | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/scripts/setlocalversion b/scripts/setlocalversion
> index 20f2efd57b11..c5262f0d953d 100755
> --- a/scripts/setlocalversion
> +++ b/scripts/setlocalversion
> @@ -45,7 +45,7 @@ scm_version()
>  
>   # Check for git and a git repo.
>   if test -z "$(git rev-parse --show-cdup 2>/dev/null)" &&
> -head=$(git rev-parse --verify --short HEAD 2>/dev/null); then
> +head=$(git rev-parse --verify --short=15 HEAD 2>/dev/null); then
>  
>   # If we are at a tagged commit (like "v2.6.30-rc6"), we ignore
>   # it, because this version is defined in the top level Makefile.
> @@ -59,7 +59,7 @@ scm_version()
>   fi
>   # If we are past a tagged commit (like
>   # "v2.6.30-rc5-302-g72357d5"), we pretty print it.
> - if atag="$(git describe 2>/dev/null)"; then
> + if atag="$(git describe --abbrev=15 2>/dev/null)"; then
>   echo "$atag" | awk -F- '{printf("-%05d-%s", 
> $(NF-1),$(NF))}'
>  
>   # If we don't have a tag at all we print -g{commitish}.
>