Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
Linus Torvalds writes: > On Fri, Sep 30, 2016 at 10:54 AM, Linus Torvalds > wrote: >> >>> So IMHO, the best combination is the init_default_abbrev() you posted in >>> [1], but initialized at the top of find_unique_abbrev(). And cached >>> there, obviously, in a similar way. >> >> That's certainly possible, but I'm really not happy with how the >> counting function looks. And nobody actually stood up to say "yeah, >> that gets alternate loose objects right" or "if you have tons of those >> alternate loose objects you have other issues anyway". I think >> somebody would have to "own" that counting function, the advantage of >> just putting it into disambiguate_state is that we just get the >> counting for free.. > > Side note: maybe we can mix the two approaches, and keep the counting > in the disambiguation state, and just make the counting function do > > init_object_disambiguation(); > find_short_object_filename(&ds); > find_short_packed_object(&ds); > finish_object_disambiguation(&ds, sha1); > > and then just use "ds.nrobjects". So the counting would still be done > by the disambiguation code, it just woudln't be in get_short_sha1(). > > So here's another version that takes that approach. And if somebody > (hint hint) wants to do the counting differently, they can perhaps > send an incremental patch to do that. > > (This patch also contains the few setup issues Junio found with the > new "default_abbrev is negative" model) Sorry, but I do not quite see the point in the difference between this one and your original that had a hook in get_short_sha1(), as it seemed to me that Peff's objection was about the counting done in find_short_object_filename() and find_short_packed_object(), which is (understandably) still here.
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Fri, Sep 30, 2016 at 3:01 AM, Linus Torvalds wrote: > On Thu, Sep 29, 2016 at 5:56 PM, Mike Hommey wrote: >> >> OTOH, how often does one refer to trees or blobs with abbreviated sha1s? >> Most of the time, you'd use abbreviated sha1s for commits. And the number >> of commits in git and the kernel repositories are much lower than the >> number of overall objects. > > See that whole other discussion about this. I agree. If we only ever > worried about just commits, the abbreviation length wouldn't need to > be grown nearly as aggressively. The current default would still be > wrong for the kernel, but it wouldn't be as noticeably wrong, and > updating it to 8 or 9 would be fine. > > That said, people argued against that too. We *do* end up having > abbreviated SHA1's for blobs in the diff index. When I said that _I_ > neer use it, somebody piped up to say that they do. > > So I'd rather just keep the existing semantics (a hash is a hash is a > hash), and just abbreviate at a sufficient point that we don't have to > worry too much about disambiguating further by object type. I work on a repo that's around the size of linux.git in every way (commits, objects etc.), and growing twice as fast. So I also see 8 or 9 digit abbreviations on a daily basis, even with the current defaults core.abbrev, but I still think growing it so aggressively is the wrong thing to do. The fact that we have a core.abbrev option at all and nobody's talking about getting rid of it entirely means we all acknowledge the UX convenience of short SHA1s. I don't think it's a good idea for such UX options to have defaults that really only make sense for repositories at the very far end of the bell curve, which is the case with linux.git and the repo I work on. Either way you're going to waste somebody's time. I think it's a better trade-off that some kernel dev occasionally has to look at Peff's new disambiguation output, than have the wast hordes of everyday Git users have less screen real estate, need to recite longer sha1s over the phone during outages (people do that), and any number of other every day use cases. I think if anything we should be talking about making the default shorter & then have some clever auto-scaling by repository size as has been discussed in this thread to deal with the repositories at the far end of the bell curve.
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
Linus Torvalds writes: > Considering that TRANSPORT_SUMMARY and TRANSPORT_SUMMARY_WIDTH are > both used in exactly one place each, I'd suggest getting rid of that > crazy macro, and just expanding it in those places to avoid these > kinds of crazy "hiding variables inside complex defines thning". > > And maybe just deciding to hardcode TRANSPORT_SUMMARY_WIDTH to 17 > (which was it's original default value and presumably is what the test > is effectively hardcoded for too), and avoiding that complexity > entirely. For all fairness, when the WIDTH thing was introduced, there were two places that needed reference it at f1863d0d16 ("refactor duplicated code in builtin-send-pack.c and transport.c", 2010-02-16). But that is no longer the case, and it makes sense to hardcode it as 17 (or something derived from a symbolic constant that gives the new "default to default"). What TRANSPORT_SUMMARY() does is even more crazy and it really shouldn't be exposed as a public interface. Let's move it to its single calling place.
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Fri, Sep 30, 2016 at 11:40 AM, Junio C Hamano wrote: > > There is another instance buried deep in an obscure macro. A > minimum fix may look like this, but I really hope somebody else > finds a better approach. Heh. Yeah, that's just ugly. I assume this is why the odd git fetch pretty-printing test was off by one column.. Considering that TRANSPORT_SUMMARY and TRANSPORT_SUMMARY_WIDTH are both used in exactly one place each, I'd suggest getting rid of that crazy macro, and just expanding it in those places to avoid these kinds of crazy "hiding variables inside complex defines thning". And maybe just deciding to hardcode TRANSPORT_SUMMARY_WIDTH to 17 (which was it's original default value and presumably is what the test is effectively hardcoded for too), and avoiding that complexity entirely. Linus
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
Junio C Hamano writes: > From: Junio C Hamano > Date: Thu, 29 Sep 2016 21:19:20 -0700 > Subject: [PATCH] abbrev: adjust to the new world order > > The default_abbrev used to be a concrete value usable as the default > abbreviation length. The code that sets custom abbreviation length, > in response to command line argument, often did something like: > > if (skip_prefix(arg, "--abbrev=", &arg)) > abbrev = atoi(arg); > else if (!strcmp("--abbrev", &arg)) > abbrev = DEFAULT_ABBREV; > /* make the value sane */ > if (abbrev < 0 || 40 < abbrev) > abbrev = ... some sane value ... > > The new world order however is that the default_abbrev is a negative > value that signals find_unique_abbrev() that it needs to dynamically > find out a good default value. We shouldn't coerce a negative value > into a random positive value like the above sample code. > > Signed-off-by: Junio C Hamano There is another instance buried deep in an obscure macro. A minimum fix may look like this, but I really hope somebody else finds a better approach. Peff alluded to "when it is still -1 substituting it with a reasonable value like 7" in a separate thread, and we probably would want a way to allow accessing that "reasonable value like 7" without triggering auto sizing logic too early. With this and the patch in the message I am responding to, your patch from the last night seems to pass all the tests for me. transport.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/transport.h b/transport.h index 6fe3485325..8a96e22bb0 100644 --- a/transport.h +++ b/transport.h @@ -142,7 +142,7 @@ struct transport { #define TRANSPORT_PUSH_ATOMIC 8192 #define TRANSPORT_PUSH_OPTIONS 16384 -#define TRANSPORT_SUMMARY_WIDTH (2 * DEFAULT_ABBREV + 3) +#define TRANSPORT_SUMMARY_WIDTH (2 * (DEFAULT_ABBREV < 0 ? 7 : DEFAULT_ABBREV) + 3) #define TRANSPORT_SUMMARY(x) (int)(TRANSPORT_SUMMARY_WIDTH + strlen(x) - gettext_width(x)), (x) /* Returns a transport suitable for the url */
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Fri, Sep 30, 2016 at 10:54 AM, Linus Torvalds wrote: > >> So IMHO, the best combination is the init_default_abbrev() you posted in >> [1], but initialized at the top of find_unique_abbrev(). And cached >> there, obviously, in a similar way. > > That's certainly possible, but I'm really not happy with how the > counting function looks. And nobody actually stood up to say "yeah, > that gets alternate loose objects right" or "if you have tons of those > alternate loose objects you have other issues anyway". I think > somebody would have to "own" that counting function, the advantage of > just putting it into disambiguate_state is that we just get the > counting for free.. Side note: maybe we can mix the two approaches, and keep the counting in the disambiguation state, and just make the counting function do init_object_disambiguation(); find_short_object_filename(&ds); find_short_packed_object(&ds); finish_object_disambiguation(&ds, sha1); and then just use "ds.nrobjects". So the counting would still be done by the disambiguation code, it just woudln't be in get_short_sha1(). So here's another version that takes that approach. And if somebody (hint hint) wants to do the counting differently, they can perhaps send an incremental patch to do that. (This patch also contains the few setup issues Junio found with the new "default_abbrev is negative" model) Linus builtin/rev-parse.c | 5 +++-- diff.c | 2 +- environment.c | 2 +- sha1_name.c | 39 ++- 4 files changed, 43 insertions(+), 5 deletions(-) diff --git a/builtin/rev-parse.c b/builtin/rev-parse.c index 4da1f1da2..cfb0f1510 100644 --- a/builtin/rev-parse.c +++ b/builtin/rev-parse.c @@ -671,8 +671,9 @@ int cmd_rev_parse(int argc, const char **argv, const char *prefix) filter &= ~(DO_FLAGS|DO_NOREV); verify = 1; abbrev = DEFAULT_ABBREV; - if (arg[7] == '=') - abbrev = strtoul(arg + 8, NULL, 10); + if (!arg[7]) + continue; + abbrev = strtoul(arg + 8, NULL, 10); if (abbrev < MINIMUM_ABBREV) abbrev = MINIMUM_ABBREV; else if (40 <= abbrev) diff --git a/diff.c b/diff.c index 59920747d..c6d445915 100644 --- a/diff.c +++ b/diff.c @@ -3421,7 +3421,7 @@ void diff_setup_done(struct diff_options *options) */ read_cache(); } - if (options->abbrev <= 0 || 40 < options->abbrev) + if (options->abbrev > 40) options->abbrev = 40; /* full */ /* diff --git a/environment.c b/environment.c index c1442df9a..fd6681e46 100644 --- a/environment.c +++ b/environment.c @@ -16,7 +16,7 @@ int trust_executable_bit = 1; int trust_ctime = 1; int check_stat = 1; int has_symlinks = 1; -int minimum_abbrev = 4, default_abbrev = 7; +int minimum_abbrev = 4, default_abbrev = -1; int ignore_case; int assume_unchanged; int prefer_symlink_refs; diff --git a/sha1_name.c b/sha1_name.c index 3b647fd7c..684b36dba 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -15,6 +15,7 @@ typedef int (*disambiguate_hint_fn)(const unsigned char *, void *); struct disambiguate_state { int len; /* length of prefix in hex chars */ + unsigned int nrobjects; char hex_pfx[GIT_SHA1_HEXSZ + 1]; unsigned char bin_pfx[GIT_SHA1_RAWSZ]; @@ -118,6 +119,12 @@ static void find_short_object_filename(struct disambiguate_state *ds) if (strlen(de->d_name) != 38) continue; + + // We only look at the one subdirectory, and we assume + // each subdirectory is roughly similar, so each object + // we find probably has 255 other objects in the other + // fan-out directories + ds->nrobjects += 256; if (memcmp(de->d_name, ds->hex_pfx + 2, ds->len - 2)) continue; memcpy(hex + 2, de->d_name, 38); @@ -151,6 +158,7 @@ static void unique_in_pack(struct packed_git *p, open_pack_index(p); num = p->num_objects; + ds->nrobjects += num; last = num; while (first < last) { uint32_t mid = (first + last) / 2; @@ -455,17 +463,46 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn fn, void *cb_data) return ret; } +static int get_automatic_abbrev(const char *hex) +{ + static int len; + struct disambiguate_state ds; + + if (init_object_disambiguation(hex, 7, &ds) < 0) + return 7; + + find_short_object_filename(&d
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Fri, Sep 30, 2016 at 10:54:16AM -0700, Linus Torvalds wrote: > On Fri, Sep 30, 2016 at 1:06 AM, Jeff King wrote: > > > > I agree that this deals with the performance concerns by caching the > > default_abbrev_len and starting there. I still think it's unnecessarily > > invasive to touch get_short_sha1() at all, which is otherwise only a > > reading function. > > So the reason that d oesn't work is that the "disambiguate_state" data > where we keep the number of objects is only visible within > get_short_sha1(). > > So outside that function, you don't have any sane way to figure out > how many objects. So then you have to do the extra counting function.. Right. I think you should do the extra counting function. It's a few more lines, but the design is way less tangled. > > So IMHO, the best combination is the init_default_abbrev() you posted in > > [1], but initialized at the top of find_unique_abbrev(). And cached > > there, obviously, in a similar way. > > That's certainly possible, but I'm really not happy with how the > counting function looks. And nobody actually stood up to say "yeah, > that gets alternate loose objects right" or "if you have tons of those > alternate loose objects you have other issues anyway". I think > somebody would have to "own" that counting function, the advantage of > just putting it into disambiguate_state is that we just get the > counting for free.. I don't think you _need_ get the alternate loose objects right. In fact, I don't think you need to care about loose objects at all. For the scales we're talking about, they're a rounding error. I would have done it like this: diff --git a/sha1_file.c b/sha1_file.c index 65deaf9..1845502 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -1382,6 +1382,32 @@ static void prepare_packed_git_one(char *objdir, int local) strbuf_release(&path); } +static int approximate_object_count_valid; + +/* + * Give a fast, rough count of the number of objects in the repository. This + * ignores loose objects completely. If you have a lot of them, then either + * you should repack because your performance will be awful, or they are + * all unreachable objects about to be pruned, in which case they're not really + * interesting as a measure of repo size in the first place. + */ +unsigned long approximate_object_count(void) +{ + static unsigned long count; + if (!approximate_object_count_valid) { + struct packed_git *p; + + prepare_packed_git(); + count = 0; + for (p = packed_git; p; p = p->next) { + if (open_pack_index(p)) + continue; + count += p->num_objects; + } + } + return count; +} + static void *get_next_packed_git(const void *p) { return ((const struct packed_git *)p)->next; @@ -1456,6 +1482,7 @@ void prepare_packed_git(void) void reprepare_packed_git(void) { + approximate_object_count_valid = 0; prepare_packed_git_run_once = 0; prepare_packed_git(); }
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
Jeff King writes: > I agree that this deals with the performance concerns by caching the > default_abbrev_len and starting there. I still think it's unnecessarily > invasive to touch get_short_sha1() at all, which is otherwise only a > reading function. > > So IMHO, the best combination is the init_default_abbrev() you posted in > [1], but initialized at the top of find_unique_abbrev(). And cached > there, obviously, in a similar way. Hmm. I am undecided; both approaches look OK to me.
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Fri, Sep 30, 2016 at 1:06 AM, Jeff King wrote: > > I agree that this deals with the performance concerns by caching the > default_abbrev_len and starting there. I still think it's unnecessarily > invasive to touch get_short_sha1() at all, which is otherwise only a > reading function. So the reason that d oesn't work is that the "disambiguate_state" data where we keep the number of objects is only visible within get_short_sha1(). So outside that function, you don't have any sane way to figure out how many objects. So then you have to do the extra counting function.. > So IMHO, the best combination is the init_default_abbrev() you posted in > [1], but initialized at the top of find_unique_abbrev(). And cached > there, obviously, in a similar way. That's certainly possible, but I'm really not happy with how the counting function looks. And nobody actually stood up to say "yeah, that gets alternate loose objects right" or "if you have tons of those alternate loose objects you have other issues anyway". I think somebody would have to "own" that counting function, the advantage of just putting it into disambiguate_state is that we just get the counting for free.. Linus
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 06:18:03PM -0700, Linus Torvalds wrote: > On Thu, Sep 29, 2016 at 5:57 PM, Linus Torvalds > wrote: > > > > Actually, all the other cases seem to be "parse a SHA1 with a known > > length", so they really don't have a negative length. So this seems > > ok, and is easier to verify than the "what all contexts might use > > DEFAULT_ABBREV" thing. There's only a few callers, and it's a static > > function so it's easy to check it locally in sha1_name.c. > > Here's my original patch with just a tiny change that instead of > starting the automatic guessing at 7 each time, it starts at > "default_automatic_abbrev", which is initialized to 7. > > The difference is that if we decide that "oh, that was too small, need > to repeat", we also update that "default_automatic_abbrev" value, so > that we won't start at the number that we now know was too small. > > So it still loops over the abbrev values, but now it only loops a > couple of times. > > I actually verified the performance impact by doing > > time git rev-list --abbrev-commit HEAD > /dev/null > > on the kernel git tree, and it does actually matter. With my original > patch, we wasted a noticeable amount of time on just the extra > looping, with this it's down to the same performance as just doing it > once at init time (it's about 12s vs 9s on my laptop). I agree that this deals with the performance concerns by caching the default_abbrev_len and starting there. I still think it's unnecessarily invasive to touch get_short_sha1() at all, which is otherwise only a reading function. So IMHO, the best combination is the init_default_abbrev() you posted in [1], but initialized at the top of find_unique_abbrev(). And cached there, obviously, in a similar way. -Peff [1] http://public-inbox.org/git/CA+55aFyVEQ+8TBBUm5KG9APtd9wy8cp_mRO=3nj12dxznla...@mail.gmail.com/
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 04:13:38PM -0700, Junio C Hamano wrote: > There are very early ones in the program startup sequence in the > following functions, but I do not think of a reason why our new and > early call to prepare_packed_git() might be problematic, given that > all of them require us to have an access to the repository (i.e. > this change cannot introduce a regression where a command used to > work outside a repository but barf when prepare_packed_git() is > called early): > > - builtin/describe.c > - builtin/rev-list.c > - builtin/rev-parse.c > > I thought that the one in diff.c might be problematic when the "git > diff" command is run outside a repository with the "--no-index" > option, but it appears that init_default_abbrev() seems to be OK > when run outside a repository. Actually, "diff --no-index" is currently buggy in this regard. In the followup series to jk/setup-sequence-update (which I mentioned but haven't posted yet), I teach get_object_dir() not to blindly default to ".git", and found that "diff --no-index" is perfectly happy to look in ".git/objects" for find_unique_abbrev(), even when we know there's no repository (or it has an unknown vintage). I fixed it there by just using the default abbrev value for out-of-repo diffs, and skip calling find_unique_abbrev() at all. That would here, too. But if we add object-store initialization at other times, it's a potential conflict. IMHO this should stay inside find_unique_abbrev(), where we know we already must look at the object store. -Peff
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
Junio C Hamano writes: > Junio C Hamano writes: > >> There still are breakages seen in t5510 and t5526 that are about the >> verbose output of "git fetch". I'll stop digging at this point >> tonight, and welcome others who look into it ;-) > > OK, just before I leave the keyboard for the night... > > -- >8 -- > From: Junio C Hamano > Date: Thu, 29 Sep 2016 21:19:20 -0700 > Subject: [PATCH] abbrev: adjust to the new world order To those who are following from sidelines, this builds on Linus's third iteration patch (which is based on his first patch), applied on Peff's "give disambiguation help when giving an ambiguity error" series. I didn't merge the work-in-progress going back and forth between Linus and I tonight to any of the integration branches, but it is available as lt/abbrev-auto-2 branch of the "broken down" repository, i.e. git://github.com/gitster/git.git lt/abbrev-auto-2
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 9:18 PM, Linus Torvalds wrote: > > There are probably other things like that. t5510-fetch.sh fails oddly, looks like the output is off by one character. not ok 77 - fetch aligned output It has a magic "cut -c 22-" that expects the output at a specific place, and now it's at column 21 instead of column 22. Strange test, but it still seems to be aligned, just in a different column. But clearly something changed. Linus
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
Junio C Hamano writes: > There still are breakages seen in t5510 and t5526 that are about the > verbose output of "git fetch". I'll stop digging at this point > tonight, and welcome others who look into it ;-) OK, just before I leave the keyboard for the night... -- >8 -- From: Junio C Hamano Date: Thu, 29 Sep 2016 21:19:20 -0700 Subject: [PATCH] abbrev: adjust to the new world order The default_abbrev used to be a concrete value usable as the default abbreviation length. The code that sets custom abbreviation length, in response to command line argument, often did something like: if (skip_prefix(arg, "--abbrev=", &arg)) abbrev = atoi(arg); else if (!strcmp("--abbrev", &arg)) abbrev = DEFAULT_ABBREV; /* make the value sane */ if (abbrev < 0 || 40 < abbrev) abbrev = ... some sane value ... The new world order however is that the default_abbrev is a negative value that signals find_unique_abbrev() that it needs to dynamically find out a good default value. We shouldn't coerce a negative value into a random positive value like the above sample code. Signed-off-by: Junio C Hamano --- builtin/rev-parse.c | 5 +++-- diff.c | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/builtin/rev-parse.c b/builtin/rev-parse.c index 76cf05e2ad..17cbfabdde 100644 --- a/builtin/rev-parse.c +++ b/builtin/rev-parse.c @@ -643,8 +643,9 @@ int cmd_rev_parse(int argc, const char **argv, const char *prefix) filter &= ~(DO_FLAGS|DO_NOREV); verify = 1; abbrev = DEFAULT_ABBREV; - if (arg[7] == '=') - abbrev = strtoul(arg + 8, NULL, 10); + if (!arg[7]) + continue; + abbrev = strtoul(arg + 8, NULL, 10); if (abbrev < MINIMUM_ABBREV) abbrev = MINIMUM_ABBREV; else if (40 <= abbrev) diff --git a/diff.c b/diff.c index c6da383c56..cefc13eb8e 100644 --- a/diff.c +++ b/diff.c @@ -3399,7 +3399,7 @@ void diff_setup_done(struct diff_options *options) */ read_cache(); } - if (options->abbrev <= 0 || 40 < options->abbrev) + if (40 < options->abbrev) options->abbrev = 40; /* full */ /* -- 2.10.0-612-g22341905f2
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 9:10 PM, Junio C Hamano wrote: > > A quick and dirty fix for it may look like this. Crossed emails. Indeed, I just solved the builtin/rev-parse.c thing slightly differently. And you found another failure in the diff code similarly not liking the negative DEFAULT_ABBREV. There are probably other things like that. Linus
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 8:54 PM, Junio C Hamano wrote: > > * The script uses "git rev-parse --short HEAD"; I suspect that it >says "ah, default_abbrev is -1 and minimum_abbrev is 4, so let's >try abbreviating to 4 hexdigits". Ahh, right you are. The logic there is abbrev = DEFAULT_ABBREV; if (arg[7] == '=') abbrev = strtoul(arg + 8, NULL, 10); if (abbrev < MINIMUM_ABBREV) abbrev = MINIMUM_ABBREV; which now does something different than what it used to do because DEFAULT_ABBREV is -1. Putting the "sanity-check the abbrev range" tests inside the "if()" statement that does strtoul() should fix it. Let me test... [ short time passes ] Yup. Incremental patch for that single issue attached. I made it do an early "continue" instead of adding another level on indentation. Linus builtin/rev-parse.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/builtin/rev-parse.c b/builtin/rev-parse.c index 4da1f1da2..cfb0f1510 100644 --- a/builtin/rev-parse.c +++ b/builtin/rev-parse.c @@ -671,8 +671,9 @@ int cmd_rev_parse(int argc, const char **argv, const char *prefix) filter &= ~(DO_FLAGS|DO_NOREV); verify = 1; abbrev = DEFAULT_ABBREV; - if (arg[7] == '=') - abbrev = strtoul(arg + 8, NULL, 10); + if (!arg[7]) + continue; + abbrev = strtoul(arg + 8, NULL, 10); if (abbrev < MINIMUM_ABBREV) abbrev = MINIMUM_ABBREV; else if (40 <= abbrev)
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
Junio C Hamano writes: > Linus Torvalds writes: > >> So this patch may actually be "production ready" apart from the fact >> that some tests still fail (at least t2027-worktree-list.sh) because >> of different short SHA1 cases. > > t2027 has at least two problems. > > * "git worktree" does not read the core.abbrev configuration, >without a recent fix in jc/worktree-config, i.e. d49028e6 >("worktree: honor configuration variables", 2016-09-26). > > * The script uses "git rev-parse --short HEAD"; I suspect that it >says "ah, default_abbrev is -1 and minimum_abbrev is 4, so let's >try abbreviating to 4 hexdigits". > > The first failure in t3203 seems to come from the same issue in > "rev-parse --short". A quick and dirty fix for it may look like this. We leave the variable abbrev to DEFAULT_ABBREV and let find_unique_abbrev() react to "eh, -1? I need to do the auto-scaling". "git diff-tree --abbrev" seems to have a similar problem, and the fix is the same. There still are breakages seen in t5510 and t5526 that are about the verbose output of "git fetch". I'll stop digging at this point tonight, and welcome others who look into it ;-) builtin/rev-parse.c | 14 -- diff.c | 2 +- 2 files changed, 9 insertions(+), 7 deletions(-) diff --git a/builtin/rev-parse.c b/builtin/rev-parse.c index 76cf05e2ad..f8c8c6c22e 100644 --- a/builtin/rev-parse.c +++ b/builtin/rev-parse.c @@ -642,13 +642,15 @@ int cmd_rev_parse(int argc, const char **argv, const char *prefix) starts_with(arg, "--short=")) { filter &= ~(DO_FLAGS|DO_NOREV); verify = 1; - abbrev = DEFAULT_ABBREV; - if (arg[7] == '=') + if (arg[7] != '=') { + abbrev = DEFAULT_ABBREV; + } else { abbrev = strtoul(arg + 8, NULL, 10); - if (abbrev < MINIMUM_ABBREV) - abbrev = MINIMUM_ABBREV; - else if (40 <= abbrev) - abbrev = 40; + if (abbrev < MINIMUM_ABBREV) + abbrev = MINIMUM_ABBREV; + else if (40 <= abbrev) + abbrev = 40; + } continue; } if (!strcmp(arg, "--sq")) { diff --git a/diff.c b/diff.c index c6da383c56..cefc13eb8e 100644 --- a/diff.c +++ b/diff.c @@ -3399,7 +3399,7 @@ void diff_setup_done(struct diff_options *options) */ read_cache(); } - if (options->abbrev <= 0 || 40 < options->abbrev) + if (40 < options->abbrev) options->abbrev = 40; /* full */ /*
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
Linus Torvalds writes: > So this patch may actually be "production ready" apart from the fact > that some tests still fail (at least t2027-worktree-list.sh) because > of different short SHA1 cases. t2027 has at least two problems. * "git worktree" does not read the core.abbrev configuration, without a recent fix in jc/worktree-config, i.e. d49028e6 ("worktree: honor configuration variables", 2016-09-26). * The script uses "git rev-parse --short HEAD"; I suspect that it says "ah, default_abbrev is -1 and minimum_abbrev is 4, so let's try abbreviating to 4 hexdigits". The first failure in t3203 seems to come from the same issue in "rev-parse --short".
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 12:06:23PM -0700, Linus Torvalds wrote: > On Thu, Sep 29, 2016 at 11:55 AM, Linus Torvalds > wrote: > > > > For the kernel, just the *math* right now actually gives 12 > > characters. For current git it actually seems to say that 8 is the > > correct number. For small projects, you'll still see 7. > > Sorry, the git number is 9, not 8. The reason is that git has roughly > 212k objects, and 9 hex digits gets expected collisions at about 256k > objects. > > So the logic means that we'll see 7 hex digits for projects with less > than 16k objects, 8 hex digits if there are less than 64k objects, and > 9 hex digits for projects like git that currently have fewer than 256k > objects. > > But git itself might not be *that* far from going to 10 hex digits > with my patch. > > The kernel uses 12 he digits because the collision math says that's > the right thing for a project with between 4M and 16M objects (with > the kernel being at 5M). OTOH, how often does one refer to trees or blobs with abbreviated sha1s? Most of the time, you'd use abbreviated sha1s for commits. And the number of commits in git and the kernel repositories are much lower than the number of overall objects. rev-list --all --count on the git repo gives me 46790. On the kernel, it gives 618078. Now, the interesting thing is looking at the *actual* collisions in those spaces. At 9 digits, there's only one commit collision in the kernel repo: 45f014c5264f5e68ef0e51b36f4ef5ede3d18397 45f014c52eef022873b19d6a20eb0ec9668f2b09 And two commit collisions at 8 digits in the git repo: 1536dd9c1df0b7167b139f080cc4774ef63f 1536dd9c61b5582cf07057cb715dd6dc6620 2e6e3e82ee36b3e1bec1db8db24817270080424e 2e6e3e829f3759823d70e7af511bc04cd05ad0af At 7 digits, there are 5 actual commit collisions in the git repo and 718 in the kernel repo only one of those collisions involve more than 2 commits. Mike
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 5:57 PM, Linus Torvalds wrote: > > Actually, all the other cases seem to be "parse a SHA1 with a known > length", so they really don't have a negative length. So this seems > ok, and is easier to verify than the "what all contexts might use > DEFAULT_ABBREV" thing. There's only a few callers, and it's a static > function so it's easy to check it locally in sha1_name.c. Here's my original patch with just a tiny change that instead of starting the automatic guessing at 7 each time, it starts at "default_automatic_abbrev", which is initialized to 7. The difference is that if we decide that "oh, that was too small, need to repeat", we also update that "default_automatic_abbrev" value, so that we won't start at the number that we now know was too small. So it still loops over the abbrev values, but now it only loops a couple of times. I actually verified the performance impact by doing time git rev-list --abbrev-commit HEAD > /dev/null on the kernel git tree, and it does actually matter. With my original patch, we wasted a noticeable amount of time on just the extra looping, with this it's down to the same performance as just doing it once at init time (it's about 12s vs 9s on my laptop). So this patch may actually be "production ready" apart from the fact that some tests still fail (at least t2027-worktree-list.sh) because of different short SHA1 cases. Linus cache.h | 1 + environment.c | 2 +- sha1_name.c | 26 +- 3 files changed, 27 insertions(+), 2 deletions(-) diff --git a/cache.h b/cache.h index 6e33f2f28..d2da6d186 100644 --- a/cache.h +++ b/cache.h @@ -1207,6 +1207,7 @@ struct object_context { #define GET_SHA1_TREEISH 020 #define GET_SHA1_BLOB 040 #define GET_SHA1_FOLLOW_SYMLINKS 0100 +#define GET_SHA1_AUTOMATIC 0200 #define GET_SHA1_ONLY_TO_DIE04000 #define GET_SHA1_DISAMBIGUATORS \ diff --git a/environment.c b/environment.c index c1442df9a..fd6681e46 100644 --- a/environment.c +++ b/environment.c @@ -16,7 +16,7 @@ int trust_executable_bit = 1; int trust_ctime = 1; int check_stat = 1; int has_symlinks = 1; -int minimum_abbrev = 4, default_abbrev = 7; +int minimum_abbrev = 4, default_abbrev = -1; int ignore_case; int assume_unchanged; int prefer_symlink_refs; diff --git a/sha1_name.c b/sha1_name.c index 3b647fd7c..1003c96ea 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -15,6 +15,7 @@ typedef int (*disambiguate_hint_fn)(const unsigned char *, void *); struct disambiguate_state { int len; /* length of prefix in hex chars */ + unsigned int nrobjects; char hex_pfx[GIT_SHA1_HEXSZ + 1]; unsigned char bin_pfx[GIT_SHA1_RAWSZ]; @@ -118,6 +119,12 @@ static void find_short_object_filename(struct disambiguate_state *ds) if (strlen(de->d_name) != 38) continue; + + // We only look at the one subdirectory, and we assume + // each subdirectory is roughly similar, so each object + // we find probably has 255 other objects in the other + // fan-out directories + ds->nrobjects += 256; if (memcmp(de->d_name, ds->hex_pfx + 2, ds->len - 2)) continue; memcpy(hex + 2, de->d_name, 38); @@ -151,6 +158,7 @@ static void unique_in_pack(struct packed_git *p, open_pack_index(p); num = p->num_objects; + ds->nrobjects += num; last = num; while (first < last) { uint32_t mid = (first + last) / 2; @@ -380,6 +388,9 @@ static int show_ambiguous_object(const unsigned char *sha1, void *data) return 0; } +// Why seven? That's our historical default before the automatic abbreviation +static int default_automatic_abbrev = 7; + static int get_short_sha1(const char *name, int len, unsigned char *sha1, unsigned flags) { @@ -426,6 +437,14 @@ static int get_short_sha1(const char *name, int len, unsigned char *sha1, for_each_abbrev(ds.hex_pfx, show_ambiguous_object, &ds); } + if (len < 16 && !status && (flags & GET_SHA1_AUTOMATIC)) { + unsigned int expect_collision = 1 << (len * 2); + if (ds.nrobjects > expect_collision) { + default_automatic_abbrev = len+1; + return SHORT_NAME_AMBIGUOUS; + } + } + return status; } @@ -458,14 +477,19 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn fn, void *cb_data) int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) { int status, exists; + int flags = GET_SHA1_QUIETLY; + if (len < 0) { + flags |= GET_SHA1_AUTOMATIC; + len = default_automatic_abbrev; + } sha1_to_hex_r(hex, sha1
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 5:56 PM, Mike Hommey wrote: > > OTOH, how often does one refer to trees or blobs with abbreviated sha1s? > Most of the time, you'd use abbreviated sha1s for commits. And the number > of commits in git and the kernel repositories are much lower than the > number of overall objects. See that whole other discussion about this. I agree. If we only ever worried about just commits, the abbreviation length wouldn't need to be grown nearly as aggressively. The current default would still be wrong for the kernel, but it wouldn't be as noticeably wrong, and updating it to 8 or 9 would be fine. That said, people argued against that too. We *do* end up having abbreviated SHA1's for blobs in the diff index. When I said that _I_ neer use it, somebody piped up to say that they do. So I'd rather just keep the existing semantics (a hash is a hash is a hash), and just abbreviate at a sufficient point that we don't have to worry too much about disambiguating further by object type. Linus
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 5:28 PM, Linus Torvalds wrote: > > To be fair, my original patch had a different worry that I didn't > bother with: what if one of the _other_ callers of "get_short_sha1()" > passed in -1 to it. I only handled the -1 case in th eone path care > about in that first RFC for testing. So I'm *not* suggesting you > should apply my first version,, It has issues too. Actually, all the other cases seem to be "parse a SHA1 with a known length", so they really don't have a negative length. So this seems ok, and is easier to verify than the "what all contexts might use DEFAULT_ABBREV" thing. There's only a few callers, and it's a static function so it's easy to check it locally in sha1_name.c. Linus
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 5:20 PM, Linus Torvalds wrote: > > As you say, my original patch had neither of those issues. To be fair, my original patch had a different worry that I didn't bother with: what if one of the _other_ callers of "get_short_sha1()" passed in -1 to it. I only handled the -1 case in th eone path care about in that first RFC for testing. So I'm *not* suggesting you should apply my first version,, It has issues too. Let me see if I can massage my first hacky RFC test-patch into something more reliable. Linus
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 4:13 PM, Junio C Hamano wrote: > > One thing that worries me is if we are ready to start accessing the > object store in all codepaths when we ask for DEFAULT_ABBREV. Yes. That was my main worry too. I also looked at just doing an explicit if (abbrev_commit && default_abbrev < 0) default_abbrev = get_default_abbrev(); and in many ways that would be nicer exactly because the point where this happens is then explicit, instead of being hidden behind that macro that may end up being done in random places. But it wasn't entirely obvious which all paths would need that initialization either, so on the whole it was very much a "six of one, half a dozen of the other" thing. As you say, my original patch had neither of those issues. It just stupidly re-did the loop over and over, and maybe the right thing to do is to have that original code, but just short-circuit the "over and over" behavior by just resetting default_abbrev to the value we do find. Linus
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
Junio C Hamano writes: > Linus Torvalds writes: > >> The advantage of the >> previous patch was that it got the object counting right almost >> automatically, this actually has its own new object counting code and >> maybe I screwed it up. I guess another advantage of your original approach was that it delayed the counting to the very last minute, so the things that worried me in my previous response were automatically made non-issues.
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
Linus Torvalds writes: > Somebody should really double-check my heuristics, to see that I did > the pack counting etc right. It doesn't do alternate loose file > counting at all, and maybe it could matter. The advantage of the > previous patch was that it got the object counting right almost > automatically, this actually has its own new object counting code and > maybe I screwed it up. One thing that worries me is if we are ready to start accessing the object store in all codepaths when we ask for DEFAULT_ABBREV. The worries are twofold: (1) Do we do the right thing if object store is not available to us? Some commands can be run outside repository, and if our call to prepare_packed_git() or loose object iteration barfed in some way, that would introduce a regression. (2) Is calling prepare_packed_git() too early interfere with how the commands expect its own prepare_packed_git() work? That is, if a command has this sequence, "ask DEFAULT_ABBREV, arrange things, and then call prepare_packed_git()", and the existing "arrange things" step had something that causes a new pack to become eligible to be read by prepare_packed_git(), like adding to the list of alternate object stores, its own prepare_packed_git() will now become a no-op. I browsed through "tig grep DEFAULT_ABBREV \*.c" and it seems that in majority of the hits, we not just are ready to start accessing, but already have an object or two, which must have come from an already open object store, so they are OK. Especially the ones that use it as the last argument to find_unique_abbrev() are OK as we are about to open the object store to do the computation. There are very early ones in the program startup sequence in the following functions, but I do not think of a reason why our new and early call to prepare_packed_git() might be problematic, given that all of them require us to have an access to the repository (i.e. this change cannot introduce a regression where a command used to work outside a repository but barf when prepare_packed_git() is called early): - builtin/describe.c - builtin/rev-list.c - builtin/rev-parse.c I thought that the one in diff.c might be problematic when the "git diff" command is run outside a repository with the "--no-index" option, but it appears that init_default_abbrev() seems to be OK when run outside a repository. There is one in parse-options-cb.c that is used to parse the --abbrev command line option. This might cause a cosmetic problem but when the user is asking for an abbreviation, it is expected that we will have an access to the object store anyway, so it may be OK. I am sorry that none of the above is about your math ;-) I suck at math so I won't comment.
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 12:45 PM, Junio C Hamano wrote: > > I think that is a reasonable way to go. > > #define DEFAULT_ABBREV get_default_abbrev() > > would help. So something like this that replaces the previous patch? Somebody should really double-check my heuristics, to see that I did the pack counting etc right. It doesn't do alternate loose file counting at all, and maybe it could matter. The advantage of the previous patch was that it got the object counting right almost automatically, this actually has its own new object counting code and maybe I screwed it up. Linus cache.h | 3 ++- environment.c | 2 +- sha1_file.c | 43 +++ 3 files changed, 46 insertions(+), 2 deletions(-) diff --git a/cache.h b/cache.h index 6e33f2f28..a022e1bd2 100644 --- a/cache.h +++ b/cache.h @@ -1186,8 +1186,9 @@ static inline int hex2chr(const char *s) } /* Convert to/from hex/sha1 representation */ +extern int get_default_abbrev(void); #define MINIMUM_ABBREV minimum_abbrev -#define DEFAULT_ABBREV default_abbrev +#define DEFAULT_ABBREV get_default_abbrev() struct object_context { unsigned char tree[20]; diff --git a/environment.c b/environment.c index c1442df9a..fd6681e46 100644 --- a/environment.c +++ b/environment.c @@ -16,7 +16,7 @@ int trust_executable_bit = 1; int trust_ctime = 1; int check_stat = 1; int has_symlinks = 1; -int minimum_abbrev = 4, default_abbrev = 7; +int minimum_abbrev = 4, default_abbrev = -1; int ignore_case; int assume_unchanged; int prefer_symlink_refs; diff --git a/sha1_file.c b/sha1_file.c index ca149a607..28ba04b65 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -3720,3 +3720,46 @@ int for_each_packed_object(each_packed_object_fn cb, void *data, unsigned flags) } return r ? r : pack_errors; } + +static int init_default_abbrev(void) +{ + unsigned long count = 0; + struct packed_git *p; + struct strbuf buf = STRBUF_INIT; + DIR *dir; + char *name; + int ret; + + prepare_packed_git(); + for (p = packed_git; p; p = p->next) { + if (open_pack_index(p)) + continue; + count += p->num_objects; + } + + strbuf_addstr(&buf, get_object_directory()); + strbuf_addstr(&buf, "/42/"); + name = strbuf_detach(&buf, NULL); + dir = opendir(name); + free(name); + if (dir) { + struct dirent *de; + while ((de = readdir(dir)) != NULL) { + count += 256; + } + closedir(dir); + } + for (ret = 7; ret < 15; ret++) { + unsigned long expect_collision = 1ul << (ret * 2); + if (count < expect_collision) + break; + } + return ret; +} + +int get_default_abbrev(void) +{ + if (default_abbrev < 0) + default_abbrev = init_default_abbrev(); + return default_abbrev; +}
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
Linus Torvalds writes: > But you could easily also just instead have it do something like > > if (default_abbrev < 0) > default_abbrev = initialize_abbrev(); > > at startup time if "abbrev_commit" is set, and just do it once and for > all rather rthan the odd loping behavior. I think that is a reasonable way to go. #define DEFAULT_ABBREV get_default_abbrev() would help.
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
Linus Torvalds writes: > On Thu, Sep 29, 2016 at 11:55 AM, Linus Torvalds > wrote: >> >> For the kernel, just the *math* right now actually gives 12 >> characters. For current git it actually seems to say that 8 is the >> correct number. For small projects, you'll still see 7. > > Sorry, the git number is 9, not 8. The reason is that git has roughly > 212k objects, and 9 hex digits gets expected collisions at about 256k > objects. > > So the logic means that we'll see 7 hex digits for projects with less > than 16k objects, 8 hex digits if there are less than 64k objects, and > 9 hex digits for projects like git that currently have fewer than 256k > objects. Whew. I was wondering where my brain went wrong, as I knew we have 200k objects and 8 hexdigits means 1<<16 = 64k which is way too short.
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 12:16 PM, Jeff King wrote: > > Hmm. So at length 7, we expect collisions at 2^14, which is 16384. That > seems really low. I mean, by the birthday paradox that's where expect > a 50% chance of a collision. But that's a single collision. We > definitely don't expect them to be common at that size. > > So I suspect this could be a bit looser. So I have to admit that I was surprised by how quickly it actually decided that 7 isn't enough. In fact, the reason I initially said that git used 8 digits was that I didn't count very closely, and just verified that it was more than the default 7. But quite frankly, I think the math is correct, and part of that is that the logic is all about not just the current state, but the "reasonably near future". So it is indeed fairly aggressive, and the moment you have more objects than the "we'd expect to probably see _one_ collision" it grows the size. But looking at the kernel situation, that really is what we'd want, because the whole problem with the existing code is that it only takes the *current* situation into account. That's what we want to get away from. We want git to pick a number that is sane from a standpoint of "this project is still growing". And git _already_ has commits that are ambiguous in 8 hex digits and need 9. Yes, it's rare today, but the reason I'm telling kernel developers to use 12 is because while a size-11 collision is very rare today, it does actually happen, and we want o pick a value where it is rare enough that even in the near future it's not going to be a big deal. Don't get me wrong: collisions aren't fatal. So it's not like we have to absolutely avoid them, and I really like your patch series exactly because it makes collisions even less of a deal (particularly since I expect people will not upgrade immediately, so we'll continue to see even new 7-hex-digit short forms even in the kernel). So it's a balance of making the hex string long enough that it's simply not a big worry. So I'm sure it *could* be looser, but I actually also really suspect that git truly *should* use a 9-digit abbreviation rather than 8 (and 7 is definitely starting to be borderline, I think). > As far as the implementation, I was surprised to see it touch > get_short_sha1() at all. That's, after all, for lookups, and we would > never want to require more characters on the reading side. Heh. The implementation is crap. It was literally a "how can I make the smallest possible patch" implementation. I was finishing it off while at a talk by Nicolas Pitre at Linaro Connect where I am right now. So I agree - it does extra work just because that's where it all slotted in with minimal effort. At a minimum, once it finds a good new default, it should just memoize that. So a minimal fix to the "it's stupldly recalculating things over rand over again" would be to just set "default_abbrev" to the value it finds acceptable after the first time it finds something, so that it doesn't end up looping _again_ in the future. But you could easily also just instead have it do something like if (default_abbrev < 0) default_abbrev = initialize_abbrev(); at startup time if "abbrev_commit" is set, and just do it once and for all rather rthan the odd loping behavior. I really just wanted to see how well the concept worked, and I was happy to see that it gave what I thought were the "correct" numbers. And the loop was salready there ... Linus
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 11:55:46AM -0700, Linus Torvalds wrote: > I think the patch can speak for itself, but the basic core is this > section in get_short_sha1(): > > + if (len < 16 && !status && (flags & GET_SHA1_AUTOMATIC)) { > + unsigned int expect_collision = 1 << (len * 2); > + if (ds.nrobjects > expect_collision) > + return SHORT_NAME_AMBIGUOUS; > + } Hmm. So at length 7, we expect collisions at 2^14, which is 16384. That seems really low. I mean, by the birthday paradox that's where expect a 50% chance of a collision. But that's a single collision. We definitely don't expect them to be common at that size. So I suspect this could be a bit looser. The real number we care about is probably something like "there is probability 'p' of a collision when we add a new object", but I'm not sure what that 'p' would be. Or perhaps "we accept collisions in 'n' percent of objects". But again, I don't know that 'n'. I dunno. I suppose being overly conservative with this number leaves room for growth. Repositories generally get bigger, not smaller. :) > What do you think? It's actually a fairly simple patch and I really do > think it makes sense and it seems to just DTRT automatically. I like the general idea. As far as the implementation, I was surprised to see it touch get_short_sha1() at all. That's, after all, for lookups, and we would never want to require more characters on the reading side. I see you worked around it with a flag so that this behavior only kicks in when called via find_unique_abbrev(). But if you look at the caller: > @@ -458,14 +472,19 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn > fn, void *cb_data) > int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) > { > int status, exists; > + int flags = GET_SHA1_QUIETLY; > > + if (len < 0) { > + flags |= GET_SHA1_AUTOMATIC; > + len = 7; > + } > sha1_to_hex_r(hex, sha1); > if (len == 40 || !len) > return 40; > exists = has_sha1_file(sha1); > while (len < 40) { > unsigned char sha1_ret[20]; > - status = get_short_sha1(hex, len, sha1_ret, GET_SHA1_QUIETLY); > + status = get_short_sha1(hex, len, sha1_ret, flags); > if (exists > ? !status > : status == SHORT_NAME_NOT_FOUND) { You can see that we're going to do more work than we would otherwise need to. Because we start at 7, and ask get_short_sha1() "is this unique enough?", and looping. But if we _know_ we won't accept any answer shorter than some N based on the number of objects in the repository, then we should start at that N. IOW, something like: if (len < 0) len = ceil(log_base_2(repository_object_count())); here, and then you don't have to touch get_short_sha1() at all. I suspect you pushed it down into get_short_sha1() because it kind-of does the repository_object_count() step for "free" as it's looking at the object anyway. But that step is really not very expensive. And I'd even say you could just ignore loose objects entirely, and treat them like a rounding error (the way that duplicate objects in packs are treated). That leaves you with just an O(# of packs) loop over a linked list. You could even just keep a global object count up to date in add_packed_git(), and then it's O(1). -Peff
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 11:55 AM, Linus Torvalds wrote: > > For the kernel, just the *math* right now actually gives 12 > characters. For current git it actually seems to say that 8 is the > correct number. For small projects, you'll still see 7. Sorry, the git number is 9, not 8. The reason is that git has roughly 212k objects, and 9 hex digits gets expected collisions at about 256k objects. So the logic means that we'll see 7 hex digits for projects with less than 16k objects, 8 hex digits if there are less than 64k objects, and 9 hex digits for projects like git that currently have fewer than 256k objects. But git itself might not be *that* far from going to 10 hex digits with my patch. The kernel uses 12 he digits because the collision math says that's the right thing for a project with between 4M and 16M objects (with the kernel being at 5M). So on the whole the patch really does seem to just do the right thing automatically. Linus
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 11:37 AM, Linus Torvalds wrote: > > I'm playing with an early patch to make the default more dynamic. > Let's see how well it works in practice, but it looks fairly > promising. Let me test a bit more and send out an RFC patch.. Ok, this is *very* rough, and it doesn't actuall pass all the tests, and I didn't even try to look at why. But it passes the trivial smell-test, and in particular it actually makes mathematical sense... I think the patch can speak for itself, but the basic core is this section in get_short_sha1(): + if (len < 16 && !status && (flags & GET_SHA1_AUTOMATIC)) { + unsigned int expect_collision = 1 << (len * 2); + if (ds.nrobjects > expect_collision) + return SHORT_NAME_AMBIGUOUS; + } basically, what it says is that we will consider a sha1 ambiguous even if it was *technically* unique (that's the '!status' part of the test) if: - the length was 15 or less *and* - the number of objects we have is larger than the expected point where statistically we should start to expect to get one collision. That "expect_collision" math is actually very simple: each hex character adds four bits of range, but since we expect collisions at the square root of the maximum number of objects, we shift by just two bits per hex digits instead. The rest of the patch is a trivial change to just initialize the default short size to -1, and consider that to mean "enable the automatic size checking with a minimum of 7". And the trivial code to estimate the number of objects (which ignores duplicates between packs etc _entirely_). For the kernel, just the *math* right now actually gives 12 characters. For current git it actually seems to say that 8 is the correct number. For small projects, you'll still see 7. ANYWAY. This patch is on top of Jeff's patches in 'pu' (I think those are great regardless of this patch!), and as mentioned, it fails some tests. I suspect that the failures might be due to the abbrev_default being -1, and some other code finds that surprising now. But as mentioned, I didn't really even look at it. What do you think? It's actually a fairly simple patch and I really do think it makes sense and it seems to just DTRT automatically. Linus cache.h | 1 + environment.c | 2 +- sha1_name.c | 21 - 3 files changed, 22 insertions(+), 2 deletions(-) diff --git a/cache.h b/cache.h index 6e33f2f..d2da6d1 100644 --- a/cache.h +++ b/cache.h @@ -1207,6 +1207,7 @@ struct object_context { #define GET_SHA1_TREEISH 020 #define GET_SHA1_BLOB 040 #define GET_SHA1_FOLLOW_SYMLINKS 0100 +#define GET_SHA1_AUTOMATIC 0200 #define GET_SHA1_ONLY_TO_DIE04000 #define GET_SHA1_DISAMBIGUATORS \ diff --git a/environment.c b/environment.c index c1442df..fd6681e 100644 --- a/environment.c +++ b/environment.c @@ -16,7 +16,7 @@ int trust_executable_bit = 1; int trust_ctime = 1; int check_stat = 1; int has_symlinks = 1; -int minimum_abbrev = 4, default_abbrev = 7; +int minimum_abbrev = 4, default_abbrev = -1; int ignore_case; int assume_unchanged; int prefer_symlink_refs; diff --git a/sha1_name.c b/sha1_name.c index 3b647fd..8791ff3 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -15,6 +15,7 @@ typedef int (*disambiguate_hint_fn)(const unsigned char *, void *); struct disambiguate_state { int len; /* length of prefix in hex chars */ + unsigned int nrobjects; char hex_pfx[GIT_SHA1_HEXSZ + 1]; unsigned char bin_pfx[GIT_SHA1_RAWSZ]; @@ -118,6 +119,12 @@ static void find_short_object_filename(struct disambiguate_state *ds) if (strlen(de->d_name) != 38) continue; + + // We only look at the one subdirectory, and we assume + // each subdirectory is roughly similar, so each object + // we find probably has 255 other objects in the other + // fan-out directories + ds->nrobjects += 256; if (memcmp(de->d_name, ds->hex_pfx + 2, ds->len - 2)) continue; memcpy(hex + 2, de->d_name, 38); @@ -151,6 +158,7 @@ static void unique_in_pack(struct packed_git *p, open_pack_index(p); num = p->num_objects; + ds->nrobjects += num; last = num; while (first < last) { uint32_t mid = (first + last) / 2; @@ -426,6 +434,12 @@ static int get_short_sha1(const char *name, int len, unsigned char *sha1, for_each_abbrev(ds.hex_pfx, show_ambiguous_object, &ds); } + if (len < 16 && !status && (flags & GET_SHA1_AUTOMATIC)) { + unsigned int expect_collision = 1 << (len * 2); + if (ds.nrobjects > expect_collision) + return SHORT_NAME_AMBIGUOUS; + } +
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 11:05 AM, Junio C Hamano wrote: > > Yes, "git log --oneline" looks somewhat different and strange for > me, too ;-) I'm playing with an early patch to make the default more dynamic. Let's see how well it works in practice, but it looks fairly promising. Let me test a bit more and send out an RFC patch.. Linus
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
Johannes Sixt writes: > Am 29.09.2016 um 01:30 schrieb Junio C Hamano: >> As Peff said, responding in a thread started by Linus's suggestion >> to raise the default abbreviation to 12 hexdigits: > > This is waayy too large for a new default. The vast majority of > repositories is smallish. For those, the long sequences of hex digits > are an uglification that is almost unbearable. > > I know that kernel developers are important, but their importance has > long been outnumbered by the anonymous and silent masses of users. > > Personally, I use 8 digits just because it is a "rounder" number than > 7, but in all of my repositories 7 would still work just as well. Yes, "git log --oneline" looks somewhat different and strange for me, too ;-) I am sure I'll get used to it if I keep using it, but I suspect that I'd be irritated as I find myself typing 'q' more and more often to "less -S" that is automatically invoked when I do "git log --oneline master.." to see what commits are on my current topic branch.
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
Quoting Jeff King : On Thu, Sep 29, 2016 at 04:44:00AM +0200, SZEDER Gábor wrote: > So 12 seems reasonable, and the only downside for it (or for "13", for > that matter) is a few extra bytes. I dunno, maybe people will really > hate that, but I have a feeling these are mostly cut-and-pasted anyway. I for one raise my hand in protest... "few extra bytes" is not the only downside, and it's not at all about how many characters are copy-and-pasted. In my opinion it's much more important that this change wastes 5 columns worth of valuable screen real estate e.g. for 'git blame' or 'git log --oneline' in projects that don't need it and certainly won't ever need it. True. The core of the issue is that we really only care about this minimum length when _storing_ an abbreviation, but we don't know when the user is just looking at it in the moment, and when they are going to stick it in a commit message, email, or bug tracker. In an ideal world, anybody who was about to store it would run "git describe" or something to come up with some canonical reference format. And we could just bump the default minimum there. Personally, I almost exclusively cite commits as the output of: git log -1 --pretty='tformat:%h (%s, %ad)' --date=short Interesting, I have a pretty format alias that looks almost like this, except that I carry a patch locally allowing me to say %as for short date format :) What I sometimes wished for is a pretty format specifier for 'git describe --contains', which would make it convenient to cite commits like this: v0.99~954 (Initial revision of "git", the information manager from hell, 2005-04-07). It's better than the abbreviated object name, because it will stay unique, assuming that the chosen tag is never deleted, and it carries extra information for humans (the first release containing the referenced commit), while the abbreviated object name is completely meaningless. The obvious drawback that makes it a non-solution for the problem at hand is that this format can only refer to commits that are reachable from a tag and can't be used for commits that are descendants of the most recent tag, e.g. when fixing a bug introduced after the last release. Oh, and the user has to fetch the tag first to be able to make sense of such a reference. and I'd be fine to stick "--abbrev=12" in there for future-proofing. But I don't know what the kernel or other projects do. I'd also be curious to know if the patch I sent in [1] to more aggressively prefer commits would make this less of an issue, and people wouldn't care as much about using longer hashes in the first place. So one option is to merge that (and possibly even make it the default) and see if people still care in 6 months. -Peff [1] http://public-inbox.org/git/20160927123801.3bpdg3hap3kzz...@sigill.intra.peff.net/
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
Jeff King writes: > On Thu, Sep 29, 2016 at 04:44:00AM +0200, SZEDER Gábor wrote: > >> > So 12 seems reasonable, and the only downside for it (or for "13", for >> > that matter) is a few extra bytes. I dunno, maybe people will really >> > hate that, but I have a feeling these are mostly cut-and-pasted anyway. >> >> I for one raise my hand in protest... >> >> "few extra bytes" is not the only downside, and it's not at all about >> how many characters are copy-and-pasted. In my opinion it's much more >> important that this change wastes 5 columns worth of valuable screen >> real estate e.g. for 'git blame' or 'git log --oneline' in projects >> that don't need it and certainly won't ever need it. > > True. The core of the issue is that we really only care about this > minimum length when _storing_ an abbreviation, but we don't know when > the user is just looking at it in the moment, and when they are going to > stick it in a commit message, email, or bug tracker. Perhaps a compromise would be to adapt the length to the size of the project _and_ keep a huge margin. So, essentially, we'd have small projects stick to the 7 characters, and very quickly bump to 12. So, for a fast-growing project, there would be a short window at the beginning of the project where people could cut-and-past short hashes. OTOH, small projects could keep these few columns of screen real-estate. That said, I can certainly live without these 5 columns, don't take my message as an objection to setting to 12 right away. -- Matthieu Moy http://www-verimag.imag.fr/~moy/
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Wed, Sep 28, 2016 at 04:30:47PM -0700, Junio C Hamano wrote: > As Peff said, responding in a thread started by Linus's suggestion > to raise the default abbreviation to 12 hexdigits: > > I actually think "12" might be sane for a long time. That's 48 bits of > sha1, so we'd expect a 50% change of a _single_ collision at 2^24, or 16 > million. The biggest repository I know about (in number of objects) is > the one holding all of the objects for all of the forks of > torvalds/linux on GitHub. It's at about 15 million objects. > > Which _seems_ close, but remember that's the size where we expect to see > a single collision. They don't become common until much later (I didn't > compute an exact number, but Linus's 16x sounds about right). I know > that the growth of the kernel isn't really linear, but I think the need > to bump to "13" might not just be decades, but possibly a century or > more. > > So 12 seems reasonable, and the only downside for it (or for "13", for > that matter) is a few extra bytes. I dunno, maybe people will really > hate that, but I have a feeling these are mostly cut-and-pasted anyway. I am not sure my quote is a good rationale for this bump. It was meant to be a rationale that "12" is big enough, but the "I dunno" at the end kind of glosses over the downsides. -Peff
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 07:27:01AM +0200, Lukas Fleischer wrote: > > Sure, users working on smaller repos are free to reset core.abbrev to > > its original value. I don't have any numbers, of course, but I > > suspect that there are many more smaller repos out there that this > > change will affect disadvantageously, than there are large repos for > > which it's beneficial. > > I know this suggestion comes a bit late but would it make sense to let > the repository owner overwrite the core.abbrev setting? > > One possible way to implement this would be adding .gitconfig support to > repositories with a very limited set of whitelisted variables allowed in > there (could be core.abbrev only to begin with). Or some entirely > separate mechanism like .gitignore. The suggestion for versioned repository-level config comes up from time to time; you can find other instances in the list archive. Usually the biggest issue is that usually nobody comes up with a good example of something that the project would actually want to set. Setting "core.abbrev" at least seems plausible. Though... > With such a mechanism, we could keep the default of 7 which works fine > for most projects. Linus could bump the default to 12 for linux.git. If > some users are not happy with that, they can still overwrite it in their > local Git config. Anybody starting a project could change the initial > value to a suitable value in one of the first commits -- provided they > already have an idea how much the project will grow. That way, hashes > will be "long enough" even for early commits, before any heuristics > could guess that the project would become large. I wonder if in practice we would do just as well to size default_abbrev dynamically based on the number of objects. That doesn't help projects which are just starting, but will eventually grow gigantic. But I doubt that most projects would have the foresight to preemptively set core.abbrev. And that would at least reduce the impact as the project _does_ get big. -Peff
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, Sep 29, 2016 at 04:44:00AM +0200, SZEDER Gábor wrote: > > So 12 seems reasonable, and the only downside for it (or for "13", for > > that matter) is a few extra bytes. I dunno, maybe people will really > > hate that, but I have a feeling these are mostly cut-and-pasted anyway. > > I for one raise my hand in protest... > > "few extra bytes" is not the only downside, and it's not at all about > how many characters are copy-and-pasted. In my opinion it's much more > important that this change wastes 5 columns worth of valuable screen > real estate e.g. for 'git blame' or 'git log --oneline' in projects > that don't need it and certainly won't ever need it. True. The core of the issue is that we really only care about this minimum length when _storing_ an abbreviation, but we don't know when the user is just looking at it in the moment, and when they are going to stick it in a commit message, email, or bug tracker. In an ideal world, anybody who was about to store it would run "git describe" or something to come up with some canonical reference format. And we could just bump the default minimum there. Personally, I almost exclusively cite commits as the output of: git log -1 --pretty='tformat:%h (%s, %ad)' --date=short and I'd be fine to stick "--abbrev=12" in there for future-proofing. But I don't know what the kernel or other projects do. I'd also be curious to know if the patch I sent in [1] to more aggressively prefer commits would make this less of an issue, and people wouldn't care as much about using longer hashes in the first place. So one option is to merge that (and possibly even make it the default) and see if people still care in 6 months. -Peff [1] http://public-inbox.org/git/20160927123801.3bpdg3hap3kzz...@sigill.intra.peff.net/
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
Am 29.09.2016 um 01:30 schrieb Junio C Hamano: As Peff said, responding in a thread started by Linus's suggestion to raise the default abbreviation to 12 hexdigits: This is waayy too large for a new default. The vast majority of repositories is smallish. For those, the long sequences of hex digits are an uglification that is almost unbearable. I know that kernel developers are important, but their importance has long been outnumbered by the anonymous and silent masses of users. Personally, I use 8 digits just because it is a "rounder" number than 7, but in all of my repositories 7 would still work just as well. -- Hannes
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
On Thu, 29 Sep 2016 at 04:44:00, SZEDER Gábor wrote: > I for one raise my hand in protest... > > "few extra bytes" is not the only downside, and it's not at all about > how many characters are copy-and-pasted. In my opinion it's much more > important that this change wastes 5 columns worth of valuable screen > real estate e.g. for 'git blame' or 'git log --oneline' in projects > that don't need it and certainly won't ever need it. > > Sure, users working on smaller repos are free to reset core.abbrev to > its original value. I don't have any numbers, of course, but I > suspect that there are many more smaller repos out there that this > change will affect disadvantageously, than there are large repos for > which it's beneficial. I know this suggestion comes a bit late but would it make sense to let the repository owner overwrite the core.abbrev setting? One possible way to implement this would be adding .gitconfig support to repositories with a very limited set of whitelisted variables allowed in there (could be core.abbrev only to begin with). Or some entirely separate mechanism like .gitignore. With such a mechanism, we could keep the default of 7 which works fine for most projects. Linus could bump the default to 12 for linux.git. If some users are not happy with that, they can still overwrite it in their local Git config. Anybody starting a project could change the initial value to a suitable value in one of the first commits -- provided they already have an idea how much the project will grow. That way, hashes will be "long enough" even for early commits, before any heuristics could guess that the project would become large. Opinions? Regards, Lukas
Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
> As Peff said, responding in a thread started by Linus's suggestion > to raise the default abbreviation to 12 hexdigits: > > I actually think "12" might be sane for a long time. That's 48 bits of > sha1, so we'd expect a 50% change of a _single_ collision at 2^24, or 16 s/change/chance/ I know it's quoted, but still. > million. The biggest repository I know about (in number of objects) is > the one holding all of the objects for all of the forks of > torvalds/linux on GitHub. It's at about 15 million objects. > > Which _seems_ close, but remember that's the size where we expect to see > a single collision. They don't become common until much later (I didn't > compute an exact number, but Linus's 16x sounds about right). I know > that the growth of the kernel isn't really linear, but I think the need > to bump to "13" might not just be decades, but possibly a century or > more. > > So 12 seems reasonable, and the only downside for it (or for "13", for > that matter) is a few extra bytes. I dunno, maybe people will really > hate that, but I have a feeling these are mostly cut-and-pasted anyway. I for one raise my hand in protest... "few extra bytes" is not the only downside, and it's not at all about how many characters are copy-and-pasted. In my opinion it's much more important that this change wastes 5 columns worth of valuable screen real estate e.g. for 'git blame' or 'git log --oneline' in projects that don't need it and certainly won't ever need it. Sure, users working on smaller repos are free to reset core.abbrev to its original value. I don't have any numbers, of course, but I suspect that there are many more smaller repos out there that this change will affect disadvantageously, than there are large repos for which it's beneficial. > And this does exactly that. > > Keep the tests working by explicitly asking for the old 7 hexdigits > setting in the fake system-wide configuration file used for tests. > > Signed-off-by: Junio C Hamano
[PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits
As Peff said, responding in a thread started by Linus's suggestion to raise the default abbreviation to 12 hexdigits: I actually think "12" might be sane for a long time. That's 48 bits of sha1, so we'd expect a 50% change of a _single_ collision at 2^24, or 16 million. The biggest repository I know about (in number of objects) is the one holding all of the objects for all of the forks of torvalds/linux on GitHub. It's at about 15 million objects. Which _seems_ close, but remember that's the size where we expect to see a single collision. They don't become common until much later (I didn't compute an exact number, but Linus's 16x sounds about right). I know that the growth of the kernel isn't really linear, but I think the need to bump to "13" might not just be decades, but possibly a century or more. So 12 seems reasonable, and the only downside for it (or for "13", for that matter) is a few extra bytes. I dunno, maybe people will really hate that, but I have a feeling these are mostly cut-and-pasted anyway. And this does exactly that. Keep the tests working by explicitly asking for the old 7 hexdigits setting in the fake system-wide configuration file used for tests. Signed-off-by: Junio C Hamano --- environment.c| 2 +- t/gitconfig-for-test | 3 +++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/environment.c b/environment.c index ca72464..25daddb 100644 --- a/environment.c +++ b/environment.c @@ -16,7 +16,7 @@ int trust_executable_bit = 1; int trust_ctime = 1; int check_stat = 1; int has_symlinks = 1; -int minimum_abbrev = 4, default_abbrev = 7; +int minimum_abbrev = 4, default_abbrev = 12; int ignore_case; int assume_unchanged; int prefer_symlink_refs; diff --git a/t/gitconfig-for-test b/t/gitconfig-for-test index 4598885..8c28442 100644 --- a/t/gitconfig-for-test +++ b/t/gitconfig-for-test @@ -4,3 +4,6 @@ ;; [user] ;; name = A U Thor ;; email = aut...@example.com + +[core] + abbrev = 7 -- 2.10.0-584-gc9e068c