subject:"\[PATCH 4\/4\] core.abbrev\: raise the default abbreviation to 12 hexdigits"

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-30 Thread Junio C Hamano

Linus Torvalds  writes:

> On Fri, Sep 30, 2016 at 10:54 AM, Linus Torvalds
>  wrote:
>>
>>> So IMHO, the best combination is the init_default_abbrev() you posted in
>>> [1], but initialized at the top of find_unique_abbrev(). And cached
>>> there, obviously, in a similar way.
>>
>> That's certainly possible, but I'm really not happy with how the
>> counting function looks.  And nobody actually stood up to say "yeah,
>> that gets alternate loose objects right" or "if you have tons of those
>> alternate loose objects you have other issues anyway". I think
>> somebody would have to "own" that counting function, the advantage of
>> just putting it into disambiguate_state is that we just get the
>> counting for free..
>
> Side note: maybe we can mix the two approaches, and keep the counting
> in the disambiguation state, and just make the counting function do
>
> init_object_disambiguation();
> find_short_object_filename(&ds);
> find_short_packed_object(&ds);
> finish_object_disambiguation(&ds, sha1);
>
> and then just use "ds.nrobjects". So the counting would still be done
> by the disambiguation code, it just woudln't be in get_short_sha1().
>
> So here's another version that takes that approach. And if somebody
> (hint hint) wants to do the counting differently, they can perhaps
> send an incremental patch to do that.
>
> (This patch also contains the few setup issues Junio found with the
> new "default_abbrev is negative" model)

Sorry, but I do not quite see the point in the difference between
this one and your original that had a hook in get_short_sha1(), as
it seemed to me that Peff's objection was about the counting done in
find_short_object_filename() and find_short_packed_object(), which
is (understandably) still here.

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-30 Thread Ævar Arnfjörð Bjarmason

On Fri, Sep 30, 2016 at 3:01 AM, Linus Torvalds
 wrote:
> On Thu, Sep 29, 2016 at 5:56 PM, Mike Hommey  wrote:
>>
>> OTOH, how often does one refer to trees or blobs with abbreviated sha1s?
>> Most of the time, you'd use abbreviated sha1s for commits. And the number
>> of commits in git and the kernel repositories are much lower than the
>> number of overall objects.
>
> See that whole other discussion about this. I agree. If we only ever
> worried about just commits, the abbreviation length wouldn't need to
> be grown nearly as aggressively. The current default would still be
> wrong for the kernel, but it wouldn't be as noticeably wrong, and
> updating it to 8 or 9 would be fine.
>
> That said, people argued against that too. We *do* end up having
> abbreviated SHA1's for blobs in the diff index. When I said that _I_
> neer use it, somebody piped up to say that they do.
>
> So I'd rather just keep the existing semantics (a hash is a hash is a
> hash), and just abbreviate at a sufficient point that we don't have to
> worry too much about disambiguating further by object type.

I work on a repo that's around the size of linux.git in every way
(commits, objects etc.), and growing twice as fast.

So I also see 8 or 9 digit abbreviations on a daily basis, even with
the current defaults core.abbrev, but I still think growing it so
aggressively is the wrong thing to do.

The fact that we have a core.abbrev option at all and nobody's talking
about getting rid of it entirely means we all acknowledge the UX
convenience of short SHA1s.

I don't think it's a good idea for such UX options to have defaults
that really only make sense for repositories at the very far end of
the bell curve, which is the case with linux.git and the repo I work
on.

Either way you're going to waste somebody's time. I think it's a
better trade-off that some kernel dev occasionally has to look at
Peff's new disambiguation output, than have the wast hordes of
everyday Git users have less screen real estate, need to recite longer
sha1s over the phone during outages (people do that), and any number
of other every day use cases.

I think if anything we should be talking about making the default
shorter & then have some clever auto-scaling by repository size as has
been discussed in this thread to deal with the repositories at the far
end of the bell curve.

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-30 Thread Junio C Hamano

Linus Torvalds  writes:

> Considering that TRANSPORT_SUMMARY and TRANSPORT_SUMMARY_WIDTH are
> both used in exactly one place each, I'd suggest getting rid of that
> crazy macro, and just expanding it in those places to avoid these
> kinds of crazy "hiding variables inside complex defines thning".
>
> And maybe just deciding to hardcode TRANSPORT_SUMMARY_WIDTH to 17
> (which was it's original default value and presumably is what the test
> is effectively hardcoded for too), and avoiding that complexity
> entirely.

For all fairness, when the WIDTH thing was introduced, there were
two places that needed reference it at f1863d0d16 ("refactor
duplicated code in builtin-send-pack.c and transport.c",
2010-02-16).  But that is no longer the case, and it makes sense to
hardcode it as 17 (or something derived from a symbolic constant
that gives the new "default to default").

What TRANSPORT_SUMMARY() does is even more crazy and it really
shouldn't be exposed as a public interface.  Let's move it to its
single calling place.

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-30 Thread Linus Torvalds

On Fri, Sep 30, 2016 at 11:40 AM, Junio C Hamano  wrote:
>
> There is another instance buried deep in an obscure macro.  A
> minimum fix may look like this, but I really hope somebody else
> finds a better approach.

Heh. Yeah, that's just ugly. I assume this is why the odd git fetch
pretty-printing test was off by one column..

Considering that TRANSPORT_SUMMARY and TRANSPORT_SUMMARY_WIDTH are
both used in exactly one place each, I'd suggest getting rid of that
crazy macro, and just expanding it in those places to avoid these
kinds of crazy "hiding variables inside complex defines thning".

And maybe just deciding to hardcode TRANSPORT_SUMMARY_WIDTH to 17
(which was it's original default value and presumably is what the test
is effectively hardcoded for too), and avoiding that complexity
entirely.

Linus

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-30 Thread Junio C Hamano

Junio C Hamano  writes:

> From: Junio C Hamano 
> Date: Thu, 29 Sep 2016 21:19:20 -0700
> Subject: [PATCH] abbrev: adjust to the new world order
>
> The default_abbrev used to be a concrete value usable as the default
> abbreviation length.  The code that sets custom abbreviation length,
> in response to command line argument, often did something like:
>
>   if (skip_prefix(arg, "--abbrev=", &arg))
>   abbrev = atoi(arg);
>   else if (!strcmp("--abbrev", &arg))
>   abbrev = DEFAULT_ABBREV;
>   /* make the value sane */
>   if (abbrev < 0 || 40 < abbrev)
>   abbrev = ... some sane value ...
>
> The new world order however is that the default_abbrev is a negative
> value that signals find_unique_abbrev() that it needs to dynamically
> find out a good default value.  We shouldn't coerce a negative value
> into a random positive value like the above sample code.
>
> Signed-off-by: Junio C Hamano 

There is another instance buried deep in an obscure macro.  A
minimum fix may look like this, but I really hope somebody else
finds a better approach.  Peff alluded to "when it is still -1
substituting it with a reasonable value like 7" in a separate
thread, and we probably would want a way to allow accessing that
"reasonable value like 7" without triggering auto sizing logic
too early.

With this and the patch in the message I am responding to, your
patch from the last night seems to pass all the tests for me.

 transport.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/transport.h b/transport.h
index 6fe3485325..8a96e22bb0 100644
--- a/transport.h
+++ b/transport.h
@@ -142,7 +142,7 @@ struct transport {
 #define TRANSPORT_PUSH_ATOMIC 8192
 #define TRANSPORT_PUSH_OPTIONS 16384
 
-#define TRANSPORT_SUMMARY_WIDTH (2 * DEFAULT_ABBREV + 3)
+#define TRANSPORT_SUMMARY_WIDTH (2 * (DEFAULT_ABBREV < 0 ? 7 : DEFAULT_ABBREV) 
+ 3)
 #define TRANSPORT_SUMMARY(x) (int)(TRANSPORT_SUMMARY_WIDTH + strlen(x) - 
gettext_width(x)), (x)
 
 /* Returns a transport suitable for the url */

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-30 Thread Linus Torvalds

On Fri, Sep 30, 2016 at 10:54 AM, Linus Torvalds
 wrote:
>
>> So IMHO, the best combination is the init_default_abbrev() you posted in
>> [1], but initialized at the top of find_unique_abbrev(). And cached
>> there, obviously, in a similar way.
>
> That's certainly possible, but I'm really not happy with how the
> counting function looks.  And nobody actually stood up to say "yeah,
> that gets alternate loose objects right" or "if you have tons of those
> alternate loose objects you have other issues anyway". I think
> somebody would have to "own" that counting function, the advantage of
> just putting it into disambiguate_state is that we just get the
> counting for free..

Side note: maybe we can mix the two approaches, and keep the counting
in the disambiguation state, and just make the counting function do

init_object_disambiguation();
find_short_object_filename(&ds);
find_short_packed_object(&ds);
finish_object_disambiguation(&ds, sha1);

and then just use "ds.nrobjects". So the counting would still be done
by the disambiguation code, it just woudln't be in get_short_sha1().

So here's another version that takes that approach. And if somebody
(hint hint) wants to do the counting differently, they can perhaps
send an incremental patch to do that.

(This patch also contains the few setup issues Junio found with the
new "default_abbrev is negative" model)

  Linus
 builtin/rev-parse.c |  5 +++--
 diff.c  |  2 +-
 environment.c   |  2 +-
 sha1_name.c | 39 ++-
 4 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/builtin/rev-parse.c b/builtin/rev-parse.c
index 4da1f1da2..cfb0f1510 100644
--- a/builtin/rev-parse.c
+++ b/builtin/rev-parse.c
@@ -671,8 +671,9 @@ int cmd_rev_parse(int argc, const char **argv, const char 
*prefix)
filter &= ~(DO_FLAGS|DO_NOREV);
verify = 1;
abbrev = DEFAULT_ABBREV;
-   if (arg[7] == '=')
-   abbrev = strtoul(arg + 8, NULL, 10);
+   if (!arg[7])
+   continue;
+   abbrev = strtoul(arg + 8, NULL, 10);
if (abbrev < MINIMUM_ABBREV)
abbrev = MINIMUM_ABBREV;
else if (40 <= abbrev)
diff --git a/diff.c b/diff.c
index 59920747d..c6d445915 100644
--- a/diff.c
+++ b/diff.c
@@ -3421,7 +3421,7 @@ void diff_setup_done(struct diff_options *options)
 */
read_cache();
}
-   if (options->abbrev <= 0 || 40 < options->abbrev)
+   if (options->abbrev > 40)
options->abbrev = 40; /* full */
 
/*
diff --git a/environment.c b/environment.c
index c1442df9a..fd6681e46 100644
--- a/environment.c
+++ b/environment.c
@@ -16,7 +16,7 @@ int trust_executable_bit = 1;
 int trust_ctime = 1;
 int check_stat = 1;
 int has_symlinks = 1;
-int minimum_abbrev = 4, default_abbrev = 7;
+int minimum_abbrev = 4, default_abbrev = -1;
 int ignore_case;
 int assume_unchanged;
 int prefer_symlink_refs;
diff --git a/sha1_name.c b/sha1_name.c
index 3b647fd7c..684b36dba 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -15,6 +15,7 @@ typedef int (*disambiguate_hint_fn)(const unsigned char *, 
void *);
 
 struct disambiguate_state {
int len; /* length of prefix in hex chars */
+   unsigned int nrobjects;
char hex_pfx[GIT_SHA1_HEXSZ + 1];
unsigned char bin_pfx[GIT_SHA1_RAWSZ];
 
@@ -118,6 +119,12 @@ static void find_short_object_filename(struct 
disambiguate_state *ds)
 
if (strlen(de->d_name) != 38)
continue;
+
+   // We only look at the one subdirectory, and we assume
+   // each subdirectory is roughly similar, so each object
+   // we find probably has 255 other objects in the other
+   // fan-out directories
+   ds->nrobjects += 256;
if (memcmp(de->d_name, ds->hex_pfx + 2, ds->len - 2))
continue;
memcpy(hex + 2, de->d_name, 38);
@@ -151,6 +158,7 @@ static void unique_in_pack(struct packed_git *p,
 
open_pack_index(p);
num = p->num_objects;
+   ds->nrobjects += num;
last = num;
while (first < last) {
uint32_t mid = (first + last) / 2;
@@ -455,17 +463,46 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn 
fn, void *cb_data)
return ret;
 }
 
+static int get_automatic_abbrev(const char *hex)
+{
+   static int len;
+   struct disambiguate_state ds;
+
+   if (init_object_disambiguation(hex, 7, &ds) < 0)
+   return 7;
+
+   find_short_object_filename(&d

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-30 Thread Jeff King

On Fri, Sep 30, 2016 at 10:54:16AM -0700, Linus Torvalds wrote:

> On Fri, Sep 30, 2016 at 1:06 AM, Jeff King  wrote:
> >
> > I agree that this deals with the performance concerns by caching the
> > default_abbrev_len and starting there. I still think it's unnecessarily
> > invasive to touch get_short_sha1() at all, which is otherwise only a
> > reading function.
> 
> So the reason that d oesn't work is that the "disambiguate_state" data
> where we keep the number of objects is only visible within
> get_short_sha1().
> 
> So outside that function, you don't have any sane way to figure out
> how many objects. So then you have to do the extra counting function..

Right. I think you should do the extra counting function. It's a few
more lines, but the design is way less tangled.

> > So IMHO, the best combination is the init_default_abbrev() you posted in
> > [1], but initialized at the top of find_unique_abbrev(). And cached
> > there, obviously, in a similar way.
> 
> That's certainly possible, but I'm really not happy with how the
> counting function looks.  And nobody actually stood up to say "yeah,
> that gets alternate loose objects right" or "if you have tons of those
> alternate loose objects you have other issues anyway". I think
> somebody would have to "own" that counting function, the advantage of
> just putting it into disambiguate_state is that we just get the
> counting for free..

I don't think you _need_ get the alternate loose objects right. In fact,
I don't think you need to care about loose objects at all. For the
scales we're talking about, they're a rounding error. I would have done
it like this:

diff --git a/sha1_file.c b/sha1_file.c
index 65deaf9..1845502 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1382,6 +1382,32 @@ static void prepare_packed_git_one(char *objdir, int 
local)
strbuf_release(&path);
 }
 
+static int approximate_object_count_valid;
+
+/*
+ * Give a fast, rough count of the number of objects in the repository. This
+ * ignores loose objects completely. If you have a lot of them, then either
+ * you should repack because your performance will be awful, or they are
+ * all unreachable objects about to be pruned, in which case they're not really
+ * interesting as a measure of repo size in the first place.
+ */
+unsigned long approximate_object_count(void)
+{
+   static unsigned long count;
+   if (!approximate_object_count_valid) {
+   struct packed_git *p;
+
+   prepare_packed_git();
+   count = 0;
+   for (p = packed_git; p; p = p->next) {
+   if (open_pack_index(p))
+   continue;
+   count += p->num_objects;
+   }
+   }
+   return count;
+}
+
 static void *get_next_packed_git(const void *p)
 {
return ((const struct packed_git *)p)->next;
@@ -1456,6 +1482,7 @@ void prepare_packed_git(void)
 
 void reprepare_packed_git(void)
 {
+   approximate_object_count_valid = 0;
prepare_packed_git_run_once = 0;
prepare_packed_git();
 }

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-30 Thread Junio C Hamano

Jeff King  writes:

> I agree that this deals with the performance concerns by caching the
> default_abbrev_len and starting there. I still think it's unnecessarily
> invasive to touch get_short_sha1() at all, which is otherwise only a
> reading function.
>
> So IMHO, the best combination is the init_default_abbrev() you posted in
> [1], but initialized at the top of find_unique_abbrev(). And cached
> there, obviously, in a similar way.

Hmm. I am undecided; both approaches look OK to me.

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-30 Thread Linus Torvalds

On Fri, Sep 30, 2016 at 1:06 AM, Jeff King  wrote:
>
> I agree that this deals with the performance concerns by caching the
> default_abbrev_len and starting there. I still think it's unnecessarily
> invasive to touch get_short_sha1() at all, which is otherwise only a
> reading function.

So the reason that d oesn't work is that the "disambiguate_state" data
where we keep the number of objects is only visible within
get_short_sha1().

So outside that function, you don't have any sane way to figure out
how many objects. So then you have to do the extra counting function..

> So IMHO, the best combination is the init_default_abbrev() you posted in
> [1], but initialized at the top of find_unique_abbrev(). And cached
> there, obviously, in a similar way.

That's certainly possible, but I'm really not happy with how the
counting function looks.  And nobody actually stood up to say "yeah,
that gets alternate loose objects right" or "if you have tons of those
alternate loose objects you have other issues anyway". I think
somebody would have to "own" that counting function, the advantage of
just putting it into disambiguate_state is that we just get the
counting for free..

 Linus

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-30 Thread Jeff King

On Thu, Sep 29, 2016 at 06:18:03PM -0700, Linus Torvalds wrote:

> On Thu, Sep 29, 2016 at 5:57 PM, Linus Torvalds
>  wrote:
> >
> > Actually, all the other cases seem to be "parse a SHA1 with a known
> > length", so they really don't have a negative length.  So this seems
> > ok, and is easier to verify than the "what all contexts might use
> > DEFAULT_ABBREV" thing. There's only a few callers, and it's a static
> > function so it's easy to check it locally in sha1_name.c.
> 
> Here's my original patch with just a tiny change that instead of
> starting the automatic guessing at 7 each time, it starts at
> "default_automatic_abbrev", which is initialized to 7.
> 
> The difference is that if we decide that "oh, that was too small, need
> to repeat", we also update that "default_automatic_abbrev" value, so
> that we won't start at the number that we now know was too small.
> 
> So it still loops over the abbrev values, but now it only loops a
> couple of times.
> 
> I actually verified the performance impact by doing
> 
>   time git rev-list --abbrev-commit HEAD > /dev/null
> 
> on the kernel git tree, and it does actually matter. With my original
> patch, we wasted a noticeable amount of time on just the extra
> looping, with this it's down to the same performance as just doing it
> once at init time (it's about 12s vs 9s on my laptop).

I agree that this deals with the performance concerns by caching the
default_abbrev_len and starting there. I still think it's unnecessarily
invasive to touch get_short_sha1() at all, which is otherwise only a
reading function.

So IMHO, the best combination is the init_default_abbrev() you posted in
[1], but initialized at the top of find_unique_abbrev(). And cached
there, obviously, in a similar way.

-Peff

[1] 
http://public-inbox.org/git/CA+55aFyVEQ+8TBBUm5KG9APtd9wy8cp_mRO=3nj12dxznla...@mail.gmail.com/

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-30 Thread Jeff King

On Thu, Sep 29, 2016 at 04:13:38PM -0700, Junio C Hamano wrote:

> There are very early ones in the program startup sequence in the
> following functions, but I do not think of a reason why our new and
> early call to prepare_packed_git() might be problematic, given that
> all of them require us to have an access to the repository (i.e.
> this change cannot introduce a regression where a command used to
> work outside a repository but barf when prepare_packed_git() is
> called early):
> 
>  - builtin/describe.c
>  - builtin/rev-list.c
>  - builtin/rev-parse.c
> 
> I thought that the one in diff.c might be problematic when the "git
> diff" command is run outside a repository with the "--no-index"
> option, but it appears that init_default_abbrev() seems to be OK
> when run outside a repository.

Actually, "diff --no-index" is currently buggy in this regard. In the
followup series to jk/setup-sequence-update (which I mentioned but
haven't posted yet), I teach get_object_dir() not to blindly default to
".git", and found that "diff --no-index" is perfectly happy to look in
".git/objects" for find_unique_abbrev(), even when we know there's no
repository (or it has an unknown vintage).

I fixed it there by just using the default abbrev value for out-of-repo
diffs, and skip calling find_unique_abbrev() at all. That would here,
too.

But if we add object-store initialization at other times, it's a
potential conflict. IMHO this should stay inside find_unique_abbrev(),
where we know we already must look at the object store.

-Peff

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Junio C Hamano

Junio C Hamano  writes:

> Junio C Hamano  writes:
>
>> There still are breakages seen in t5510 and t5526 that are about the
>> verbose output of "git fetch".  I'll stop digging at this point
>> tonight, and welcome others who look into it ;-)
>
> OK, just before I leave the keyboard for the night...
>
> -- >8 --
> From: Junio C Hamano 
> Date: Thu, 29 Sep 2016 21:19:20 -0700
> Subject: [PATCH] abbrev: adjust to the new world order

To those who are following from sidelines, this builds on Linus's
third iteration patch (which is based on his first patch), applied
on Peff's "give disambiguation help when giving an ambiguity error"
series.  I didn't merge the work-in-progress going back and forth
between Linus and I tonight to any of the integration branches, but
it is available as lt/abbrev-auto-2 branch of the "broken down"
repository, i.e.

git://github.com/gitster/git.git lt/abbrev-auto-2

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Linus Torvalds

On Thu, Sep 29, 2016 at 9:18 PM, Linus Torvalds
 wrote:
>
> There are probably other things like that.

t5510-fetch.sh fails oddly, looks like the output is off by one character.

   not ok 77 - fetch aligned output

It has a magic "cut -c 22-" that expects the output at a specific
place, and now it's at column 21 instead of column 22. Strange test,
but it still seems to be aligned, just in a different column.

But clearly something changed.

 Linus

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Junio C Hamano

Junio C Hamano  writes:

> There still are breakages seen in t5510 and t5526 that are about the
> verbose output of "git fetch".  I'll stop digging at this point
> tonight, and welcome others who look into it ;-)

OK, just before I leave the keyboard for the night...

-- >8 --
From: Junio C Hamano 
Date: Thu, 29 Sep 2016 21:19:20 -0700
Subject: [PATCH] abbrev: adjust to the new world order

The default_abbrev used to be a concrete value usable as the default
abbreviation length.  The code that sets custom abbreviation length,
in response to command line argument, often did something like:

if (skip_prefix(arg, "--abbrev=", &arg))
abbrev = atoi(arg);
else if (!strcmp("--abbrev", &arg))
abbrev = DEFAULT_ABBREV;
/* make the value sane */
if (abbrev < 0 || 40 < abbrev)
abbrev = ... some sane value ...

The new world order however is that the default_abbrev is a negative
value that signals find_unique_abbrev() that it needs to dynamically
find out a good default value.  We shouldn't coerce a negative value
into a random positive value like the above sample code.

Signed-off-by: Junio C Hamano 
---
 builtin/rev-parse.c | 5 +++--
 diff.c  | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/builtin/rev-parse.c b/builtin/rev-parse.c
index 76cf05e2ad..17cbfabdde 100644
--- a/builtin/rev-parse.c
+++ b/builtin/rev-parse.c
@@ -643,8 +643,9 @@ int cmd_rev_parse(int argc, const char **argv, const char 
*prefix)
filter &= ~(DO_FLAGS|DO_NOREV);
verify = 1;
abbrev = DEFAULT_ABBREV;
-   if (arg[7] == '=')
-   abbrev = strtoul(arg + 8, NULL, 10);
+   if (!arg[7])
+   continue;
+   abbrev = strtoul(arg + 8, NULL, 10);
if (abbrev < MINIMUM_ABBREV)
abbrev = MINIMUM_ABBREV;
else if (40 <= abbrev)
diff --git a/diff.c b/diff.c
index c6da383c56..cefc13eb8e 100644
--- a/diff.c
+++ b/diff.c
@@ -3399,7 +3399,7 @@ void diff_setup_done(struct diff_options *options)
 */
read_cache();
}
-   if (options->abbrev <= 0 || 40 < options->abbrev)
+   if (40 < options->abbrev)
options->abbrev = 40; /* full */
 
/*
-- 
2.10.0-612-g22341905f2

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Linus Torvalds

On Thu, Sep 29, 2016 at 9:10 PM, Junio C Hamano  wrote:
>
> A quick and dirty fix for it may look like this.

Crossed emails.

Indeed, I just solved the builtin/rev-parse.c thing slightly differently.

And you found another failure in the diff code similarly not liking
the negative DEFAULT_ABBREV.  There are probably other things like
that.

  Linus

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Linus Torvalds

On Thu, Sep 29, 2016 at 8:54 PM, Junio C Hamano  wrote:
>
>  * The script uses "git rev-parse --short HEAD"; I suspect that it
>says "ah, default_abbrev is -1 and minimum_abbrev is 4, so let's
>try abbreviating to 4 hexdigits".

Ahh, right you are. The logic there is

abbrev = DEFAULT_ABBREV;
if (arg[7] == '=')
abbrev = strtoul(arg + 8, NULL, 10);
if (abbrev < MINIMUM_ABBREV)
abbrev = MINIMUM_ABBREV;


which now does something different than what it used to do because
DEFAULT_ABBREV is -1.

Putting the "sanity-check the abbrev range" tests inside the "if()"
statement that does strtoul() should fix it. Let me test...

[ short time passes ]

Yup. Incremental patch for that single issue attached.  I made it do
an early "continue" instead of adding another level on indentation.

 Linus
 builtin/rev-parse.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/builtin/rev-parse.c b/builtin/rev-parse.c
index 4da1f1da2..cfb0f1510 100644
--- a/builtin/rev-parse.c
+++ b/builtin/rev-parse.c
@@ -671,8 +671,9 @@ int cmd_rev_parse(int argc, const char **argv, const char 
*prefix)
filter &= ~(DO_FLAGS|DO_NOREV);
verify = 1;
abbrev = DEFAULT_ABBREV;
-   if (arg[7] == '=')
-   abbrev = strtoul(arg + 8, NULL, 10);
+   if (!arg[7])
+   continue;
+   abbrev = strtoul(arg + 8, NULL, 10);
if (abbrev < MINIMUM_ABBREV)
abbrev = MINIMUM_ABBREV;
else if (40 <= abbrev)

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Junio C Hamano

Junio C Hamano  writes:

> Linus Torvalds  writes:
>
>> So this patch may actually be "production ready" apart from the fact
>> that some tests still fail (at least t2027-worktree-list.sh) because
>> of different short SHA1 cases.
>
> t2027 has at least two problems.
>
>  * "git worktree" does not read the core.abbrev configuration,
>without a recent fix in jc/worktree-config, i.e. d49028e6
>("worktree: honor configuration variables", 2016-09-26).
>
>  * The script uses "git rev-parse --short HEAD"; I suspect that it
>says "ah, default_abbrev is -1 and minimum_abbrev is 4, so let's
>try abbreviating to 4 hexdigits".
>
> The first failure in t3203 seems to come from the same issue in
> "rev-parse --short".

A quick and dirty fix for it may look like this.

We leave the variable abbrev to DEFAULT_ABBREV and let
find_unique_abbrev() react to "eh, -1? I need to do the
auto-scaling".  "git diff-tree --abbrev" seems to have a similar
problem, and the fix is the same.

There still are breakages seen in t5510 and t5526 that are about the
verbose output of "git fetch".  I'll stop digging at this point
tonight, and welcome others who look into it ;-)

 builtin/rev-parse.c | 14 --
 diff.c  |  2 +-
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/builtin/rev-parse.c b/builtin/rev-parse.c
index 76cf05e2ad..f8c8c6c22e 100644
--- a/builtin/rev-parse.c
+++ b/builtin/rev-parse.c
@@ -642,13 +642,15 @@ int cmd_rev_parse(int argc, const char **argv, const char 
*prefix)
starts_with(arg, "--short=")) {
filter &= ~(DO_FLAGS|DO_NOREV);
verify = 1;
-   abbrev = DEFAULT_ABBREV;
-   if (arg[7] == '=')
+   if (arg[7] != '=') {
+   abbrev = DEFAULT_ABBREV;
+   } else {
abbrev = strtoul(arg + 8, NULL, 10);
-   if (abbrev < MINIMUM_ABBREV)
-   abbrev = MINIMUM_ABBREV;
-   else if (40 <= abbrev)
-   abbrev = 40;
+   if (abbrev < MINIMUM_ABBREV)
+   abbrev = MINIMUM_ABBREV;
+   else if (40 <= abbrev)
+   abbrev = 40;
+   }
continue;
}
if (!strcmp(arg, "--sq")) {
diff --git a/diff.c b/diff.c
index c6da383c56..cefc13eb8e 100644
--- a/diff.c
+++ b/diff.c
@@ -3399,7 +3399,7 @@ void diff_setup_done(struct diff_options *options)
 */
read_cache();
}
-   if (options->abbrev <= 0 || 40 < options->abbrev)
+   if (40 < options->abbrev)
options->abbrev = 40; /* full */
 
/*

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Junio C Hamano

Linus Torvalds  writes:

> So this patch may actually be "production ready" apart from the fact
> that some tests still fail (at least t2027-worktree-list.sh) because
> of different short SHA1 cases.

t2027 has at least two problems.

 * "git worktree" does not read the core.abbrev configuration,
   without a recent fix in jc/worktree-config, i.e. d49028e6
   ("worktree: honor configuration variables", 2016-09-26).

 * The script uses "git rev-parse --short HEAD"; I suspect that it
   says "ah, default_abbrev is -1 and minimum_abbrev is 4, so let's
   try abbreviating to 4 hexdigits".

The first failure in t3203 seems to come from the same issue in
"rev-parse --short".

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Mike Hommey

On Thu, Sep 29, 2016 at 12:06:23PM -0700, Linus Torvalds wrote:
> On Thu, Sep 29, 2016 at 11:55 AM, Linus Torvalds
>  wrote:
> >
> > For the kernel, just the *math* right now actually gives 12
> > characters. For current git it actually seems to say that 8 is the
> > correct number. For small projects, you'll still see 7.
> 
> Sorry, the git number is 9, not 8. The reason is that git has roughly
> 212k objects, and 9 hex digits gets expected collisions at about 256k
> objects.
> 
> So the logic means that we'll see 7 hex digits for projects with less
> than 16k objects, 8 hex digits if there are less than 64k objects, and
> 9 hex digits for projects like git that currently have fewer than 256k
> objects.
> 
> But git itself might not be *that* far from going to 10 hex digits
> with my patch.
> 
> The kernel uses 12 he digits because the collision math says that's
> the right thing for a project with between 4M and 16M objects (with
> the kernel being at 5M).

OTOH, how often does one refer to trees or blobs with abbreviated sha1s?
Most of the time, you'd use abbreviated sha1s for commits. And the number
of commits in git and the kernel repositories are much lower than the
number of overall objects.

rev-list --all --count on the git repo gives me 46790. On the kernel, it
gives 618078.

Now, the interesting thing is looking at the *actual* collisions in
those spaces.

At 9 digits, there's only one commit collision in the kernel repo:
  45f014c5264f5e68ef0e51b36f4ef5ede3d18397
  45f014c52eef022873b19d6a20eb0ec9668f2b09

And two commit collisions at 8 digits in the git repo:
  1536dd9c1df0b7167b139f080cc4774ef63f
  1536dd9c61b5582cf07057cb715dd6dc6620

  2e6e3e82ee36b3e1bec1db8db24817270080424e
  2e6e3e829f3759823d70e7af511bc04cd05ad0af

At 7 digits, there are 5 actual commit collisions in the git repo and
718 in the kernel repo only one of those collisions involve more than 2
commits.

Mike

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Linus Torvalds

On Thu, Sep 29, 2016 at 5:57 PM, Linus Torvalds
 wrote:
>
> Actually, all the other cases seem to be "parse a SHA1 with a known
> length", so they really don't have a negative length.  So this seems
> ok, and is easier to verify than the "what all contexts might use
> DEFAULT_ABBREV" thing. There's only a few callers, and it's a static
> function so it's easy to check it locally in sha1_name.c.

Here's my original patch with just a tiny change that instead of
starting the automatic guessing at 7 each time, it starts at
"default_automatic_abbrev", which is initialized to 7.

The difference is that if we decide that "oh, that was too small, need
to repeat", we also update that "default_automatic_abbrev" value, so
that we won't start at the number that we now know was too small.

So it still loops over the abbrev values, but now it only loops a
couple of times.

I actually verified the performance impact by doing

  time git rev-list --abbrev-commit HEAD > /dev/null

on the kernel git tree, and it does actually matter. With my original
patch, we wasted a noticeable amount of time on just the extra
looping, with this it's down to the same performance as just doing it
once at init time (it's about 12s vs 9s on my laptop).

So this patch may actually be "production ready" apart from the fact
that some tests still fail (at least t2027-worktree-list.sh) because
of different short SHA1 cases.

 Linus
 cache.h   |  1 +
 environment.c |  2 +-
 sha1_name.c   | 26 +-
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/cache.h b/cache.h
index 6e33f2f28..d2da6d186 100644
--- a/cache.h
+++ b/cache.h
@@ -1207,6 +1207,7 @@ struct object_context {
 #define GET_SHA1_TREEISH  020
 #define GET_SHA1_BLOB 040
 #define GET_SHA1_FOLLOW_SYMLINKS 0100
+#define GET_SHA1_AUTOMATIC  0200
 #define GET_SHA1_ONLY_TO_DIE04000
 
 #define GET_SHA1_DISAMBIGUATORS \
diff --git a/environment.c b/environment.c
index c1442df9a..fd6681e46 100644
--- a/environment.c
+++ b/environment.c
@@ -16,7 +16,7 @@ int trust_executable_bit = 1;
 int trust_ctime = 1;
 int check_stat = 1;
 int has_symlinks = 1;
-int minimum_abbrev = 4, default_abbrev = 7;
+int minimum_abbrev = 4, default_abbrev = -1;
 int ignore_case;
 int assume_unchanged;
 int prefer_symlink_refs;
diff --git a/sha1_name.c b/sha1_name.c
index 3b647fd7c..1003c96ea 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -15,6 +15,7 @@ typedef int (*disambiguate_hint_fn)(const unsigned char *, 
void *);
 
 struct disambiguate_state {
int len; /* length of prefix in hex chars */
+   unsigned int nrobjects;
char hex_pfx[GIT_SHA1_HEXSZ + 1];
unsigned char bin_pfx[GIT_SHA1_RAWSZ];
 
@@ -118,6 +119,12 @@ static void find_short_object_filename(struct 
disambiguate_state *ds)
 
if (strlen(de->d_name) != 38)
continue;
+
+   // We only look at the one subdirectory, and we assume
+   // each subdirectory is roughly similar, so each object
+   // we find probably has 255 other objects in the other
+   // fan-out directories
+   ds->nrobjects += 256;
if (memcmp(de->d_name, ds->hex_pfx + 2, ds->len - 2))
continue;
memcpy(hex + 2, de->d_name, 38);
@@ -151,6 +158,7 @@ static void unique_in_pack(struct packed_git *p,
 
open_pack_index(p);
num = p->num_objects;
+   ds->nrobjects += num;
last = num;
while (first < last) {
uint32_t mid = (first + last) / 2;
@@ -380,6 +388,9 @@ static int show_ambiguous_object(const unsigned char *sha1, 
void *data)
return 0;
 }
 
+// Why seven? That's our historical default before the automatic abbreviation
+static int default_automatic_abbrev = 7;
+
 static int get_short_sha1(const char *name, int len, unsigned char *sha1,
  unsigned flags)
 {
@@ -426,6 +437,14 @@ static int get_short_sha1(const char *name, int len, 
unsigned char *sha1,
for_each_abbrev(ds.hex_pfx, show_ambiguous_object, &ds);
}
 
+   if (len < 16 && !status && (flags & GET_SHA1_AUTOMATIC)) {
+   unsigned int expect_collision = 1 << (len * 2);
+   if (ds.nrobjects > expect_collision) {
+   default_automatic_abbrev = len+1;
+   return SHORT_NAME_AMBIGUOUS;
+   }
+   }
+
return status;
 }
 
@@ -458,14 +477,19 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn 
fn, void *cb_data)
 int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
 {
int status, exists;
+   int flags = GET_SHA1_QUIETLY;
 
+   if (len < 0) {
+   flags |= GET_SHA1_AUTOMATIC;
+   len = default_automatic_abbrev;
+   }
sha1_to_hex_r(hex, sha1

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Linus Torvalds

On Thu, Sep 29, 2016 at 5:56 PM, Mike Hommey  wrote:
>
> OTOH, how often does one refer to trees or blobs with abbreviated sha1s?
> Most of the time, you'd use abbreviated sha1s for commits. And the number
> of commits in git and the kernel repositories are much lower than the
> number of overall objects.

See that whole other discussion about this. I agree. If we only ever
worried about just commits, the abbreviation length wouldn't need to
be grown nearly as aggressively. The current default would still be
wrong for the kernel, but it wouldn't be as noticeably wrong, and
updating it to 8 or 9 would be fine.

That said, people argued against that too. We *do* end up having
abbreviated SHA1's for blobs in the diff index. When I said that _I_
neer use it, somebody piped up to say that they do.

So I'd rather just keep the existing semantics (a hash is a hash is a
hash), and just abbreviate at a sufficient point that we don't have to
worry too much about disambiguating further by object type.

  Linus

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Linus Torvalds

On Thu, Sep 29, 2016 at 5:28 PM, Linus Torvalds
 wrote:
>
> To be fair, my original patch had a different worry that I didn't
> bother with: what if one of the _other_ callers of "get_short_sha1()"
> passed in -1 to it.  I only handled the -1 case in th eone path care
> about in that first RFC for testing. So I'm *not* suggesting you
> should apply my first version,, It has issues too.

Actually, all the other cases seem to be "parse a SHA1 with a known
length", so they really don't have a negative length.  So this seems
ok, and is easier to verify than the "what all contexts might use
DEFAULT_ABBREV" thing. There's only a few callers, and it's a static
function so it's easy to check it locally in sha1_name.c.

   Linus

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Linus Torvalds

On Thu, Sep 29, 2016 at 5:20 PM, Linus Torvalds
 wrote:
>
> As you say, my original patch had neither of those issues.

To be fair, my original patch had a different worry that I didn't
bother with: what if one of the _other_ callers of "get_short_sha1()"
passed in -1 to it.  I only handled the -1 case in th eone path care
about in that first RFC for testing. So I'm *not* suggesting you
should apply my first version,, It has issues too.

Let me see if I can massage my first hacky RFC test-patch into
something more reliable.

  Linus

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Linus Torvalds

On Thu, Sep 29, 2016 at 4:13 PM, Junio C Hamano  wrote:
>
> One thing that worries me is if we are ready to start accessing the
> object store in all codepaths when we ask for DEFAULT_ABBREV.

Yes. That was my main worry too. I also looked at just doing an explicit

 if (abbrev_commit && default_abbrev < 0)
  default_abbrev = get_default_abbrev();

and in many ways that would be nicer exactly because the point where
this happens is then explicit, instead of being hidden behind that
macro that may end up being done in random places.

But it wasn't entirely obvious which all paths would need that
initialization either, so on the whole it was very much a "six of one,
half a dozen of the other" thing.

As you say, my original patch had neither of those issues. It just
stupidly re-did the loop over and over, and maybe the right thing to
do is to have that original code, but just short-circuit the "over and
over" behavior by just resetting default_abbrev to the value we do
find.

  Linus

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Junio C Hamano

Junio C Hamano  writes:

> Linus Torvalds  writes:
>
>> The advantage of the
>> previous patch was that it got the object counting right almost
>> automatically, this actually has its own new object counting code and
>> maybe I screwed it up.

I guess another advantage of your original approach was that it
delayed the counting to the very last minute, so the things that
worried me in my previous response were automatically made
non-issues.

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Junio C Hamano

Linus Torvalds  writes:

> Somebody should really double-check my heuristics, to see that I did
> the pack counting etc right.  It doesn't do alternate loose file
> counting at all, and maybe it could matter.  The advantage of the
> previous patch was that it got the object counting right almost
> automatically, this actually has its own new object counting code and
> maybe I screwed it up.

One thing that worries me is if we are ready to start accessing the
object store in all codepaths when we ask for DEFAULT_ABBREV.  The
worries are twofold:

 (1) Do we do the right thing if object store is not available to
 us?  Some commands can be run outside repository, and if our
 call to prepare_packed_git() or loose object iteration barfed
 in some way, that would introduce a regression.

 (2) Is calling prepare_packed_git() too early interfere with how
 the commands expect its own prepare_packed_git() work?  That
 is, if a command has this sequence, "ask DEFAULT_ABBREV,
 arrange things, and then call prepare_packed_git()", and the
 existing "arrange things" step had something that causes a new
 pack to become eligible to be read by prepare_packed_git(),
 like adding to the list of alternate object stores, its own
 prepare_packed_git() will now become a no-op.

I browsed through "tig grep DEFAULT_ABBREV \*.c" and it seems that
in majority of the hits, we not just are ready to start accessing,
but already have an object or two, which must have come from an
already open object store, so they are OK.  Especially the ones that
use it as the last argument to find_unique_abbrev() are OK as we are
about to open the object store to do the computation.

There are very early ones in the program startup sequence in the
following functions, but I do not think of a reason why our new and
early call to prepare_packed_git() might be problematic, given that
all of them require us to have an access to the repository (i.e.
this change cannot introduce a regression where a command used to
work outside a repository but barf when prepare_packed_git() is
called early):

 - builtin/describe.c
 - builtin/rev-list.c
 - builtin/rev-parse.c

I thought that the one in diff.c might be problematic when the "git
diff" command is run outside a repository with the "--no-index"
option, but it appears that init_default_abbrev() seems to be OK
when run outside a repository.

There is one in parse-options-cb.c that is used to parse the --abbrev
command line option.  This might cause a cosmetic problem but when
the user is asking for an abbreviation, it is expected that we will
have an access to the object store anyway, so it may be OK.

I am sorry that none of the above is about your math ;-)  I suck at
math so I won't comment.

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Linus Torvalds

On Thu, Sep 29, 2016 at 12:45 PM, Junio C Hamano  wrote:
>
> I think that is a reasonable way to go.
>
> #define DEFAULT_ABBREV get_default_abbrev()
>
> would help.

So something like this that replaces the previous patch?

Somebody should really double-check my heuristics, to see that I did
the pack counting etc right.  It doesn't do alternate loose file
counting at all, and maybe it could matter.  The advantage of the
previous patch was that it got the object counting right almost
automatically, this actually has its own new object counting code and
maybe I screwed it up.

Linus
 cache.h   |  3 ++-
 environment.c |  2 +-
 sha1_file.c   | 43 +++
 3 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/cache.h b/cache.h
index 6e33f2f28..a022e1bd2 100644
--- a/cache.h
+++ b/cache.h
@@ -1186,8 +1186,9 @@ static inline int hex2chr(const char *s)
 }
 
 /* Convert to/from hex/sha1 representation */
+extern int get_default_abbrev(void);
 #define MINIMUM_ABBREV minimum_abbrev
-#define DEFAULT_ABBREV default_abbrev
+#define DEFAULT_ABBREV get_default_abbrev()
 
 struct object_context {
unsigned char tree[20];
diff --git a/environment.c b/environment.c
index c1442df9a..fd6681e46 100644
--- a/environment.c
+++ b/environment.c
@@ -16,7 +16,7 @@ int trust_executable_bit = 1;
 int trust_ctime = 1;
 int check_stat = 1;
 int has_symlinks = 1;
-int minimum_abbrev = 4, default_abbrev = 7;
+int minimum_abbrev = 4, default_abbrev = -1;
 int ignore_case;
 int assume_unchanged;
 int prefer_symlink_refs;
diff --git a/sha1_file.c b/sha1_file.c
index ca149a607..28ba04b65 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3720,3 +3720,46 @@ int for_each_packed_object(each_packed_object_fn cb, 
void *data, unsigned flags)
}
return r ? r : pack_errors;
 }
+
+static int init_default_abbrev(void)
+{
+   unsigned long count = 0;
+   struct packed_git *p;
+   struct strbuf buf = STRBUF_INIT;
+   DIR *dir;
+   char *name;
+   int ret;
+
+   prepare_packed_git();
+   for (p = packed_git; p; p = p->next) {
+   if (open_pack_index(p))
+   continue;
+   count += p->num_objects;
+   }
+
+   strbuf_addstr(&buf, get_object_directory());
+   strbuf_addstr(&buf, "/42/");
+   name = strbuf_detach(&buf, NULL);
+   dir = opendir(name);
+   free(name);
+   if (dir) {
+   struct dirent *de;
+   while ((de = readdir(dir)) != NULL) {
+   count += 256;
+   }
+   closedir(dir);
+   }
+   for (ret = 7; ret < 15; ret++) {
+   unsigned long expect_collision = 1ul << (ret * 2);
+   if (count < expect_collision)
+   break;
+   }
+   return ret;
+}
+
+int get_default_abbrev(void)
+{
+   if (default_abbrev < 0)
+   default_abbrev = init_default_abbrev();
+   return default_abbrev;
+}

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Junio C Hamano

Linus Torvalds  writes:

> But you could easily also just instead have it do something like
>
>   if (default_abbrev < 0)
> default_abbrev = initialize_abbrev();
>
> at startup time if "abbrev_commit" is set, and just do it once and for
> all rather rthan the odd loping behavior.

I think that is a reasonable way to go.

#define DEFAULT_ABBREV get_default_abbrev()

would help.

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Junio C Hamano

Linus Torvalds  writes:

> On Thu, Sep 29, 2016 at 11:55 AM, Linus Torvalds
>  wrote:
>>
>> For the kernel, just the *math* right now actually gives 12
>> characters. For current git it actually seems to say that 8 is the
>> correct number. For small projects, you'll still see 7.
>
> Sorry, the git number is 9, not 8. The reason is that git has roughly
> 212k objects, and 9 hex digits gets expected collisions at about 256k
> objects.
>
> So the logic means that we'll see 7 hex digits for projects with less
> than 16k objects, 8 hex digits if there are less than 64k objects, and
> 9 hex digits for projects like git that currently have fewer than 256k
> objects.

Whew.  I was wondering where my brain went wrong, as I knew we have
200k objects and 8 hexdigits means 1<<16 = 64k which is way too
short.

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Linus Torvalds

On Thu, Sep 29, 2016 at 12:16 PM, Jeff King  wrote:
>
> Hmm. So at length 7, we expect collisions at 2^14, which is 16384. That
> seems really low. I mean, by the birthday paradox that's where expect
> a 50% chance of a collision. But that's a single collision. We
> definitely don't expect them to be common at that size.
>
> So I suspect this could be a bit looser.

So I have to admit that I was surprised by how quickly it actually
decided that 7 isn't enough. In fact, the reason I initially said that
git used 8 digits was that I didn't count very closely, and just
verified that it was more than the default 7.

But quite frankly, I think the math is correct, and part of that is
that the logic is all about not just the current state, but the
"reasonably near future".

So it is indeed fairly aggressive, and the moment you have more
objects than the "we'd expect to probably see  _one_ collision" it
grows the size. But looking at the kernel situation, that really is
what we'd want, because the whole problem with the existing code is
that it only takes the *current* situation into account. That's what
we want to get away from. We want git to pick a number that is sane
from a standpoint of "this project is still growing".

And git _already_ has commits that are ambiguous in 8 hex digits and
need 9. Yes, it's rare today, but the reason I'm telling kernel
developers to use 12 is because while a size-11 collision is very rare
today, it does actually happen, and we want o pick a value where it is
rare enough that even in the near future it's not going to be a big
deal.

Don't get me wrong: collisions aren't fatal. So it's not like we have
to absolutely avoid them, and I really like your patch series exactly
because it makes collisions even less of a deal (particularly since I
expect people will not upgrade immediately, so we'll continue to see
even new 7-hex-digit short forms even in the kernel). So it's a
balance of making the hex string long enough that it's simply not a
big worry.

So I'm sure it *could* be looser, but I actually also really suspect
that git truly *should* use a 9-digit abbreviation rather than 8 (and
7 is definitely starting to be borderline, I think).

> As far as the implementation, I was surprised to see it touch
> get_short_sha1() at all. That's, after all, for lookups, and we would
> never want to require more characters on the reading side.

Heh. The implementation is crap. It was literally a "how can I make
the smallest possible patch" implementation. I was finishing it off
while at a talk by Nicolas Pitre at Linaro Connect where I am right
now.

So I agree - it does extra work just because that's where it all
slotted in with minimal effort.

At a minimum, once it finds a good new default, it should just memoize
that. So a minimal fix to the "it's stupldly recalculating things over
rand over again" would be to just set "default_abbrev" to the value it
finds acceptable after the first time it finds something, so that it
doesn't end up looping _again_ in the future.

But you could easily also just instead have it do something like

  if (default_abbrev < 0)
default_abbrev = initialize_abbrev();

at startup time if "abbrev_commit" is set, and just do it once and for
all rather rthan the odd loping behavior.

I really just wanted to see how well the concept worked, and I was
happy to see that it gave what I thought were the "correct" numbers.
And the loop was salready there ...

Linus

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Jeff King

On Thu, Sep 29, 2016 at 11:55:46AM -0700, Linus Torvalds wrote:

> I think the patch can speak for itself, but the basic core is this
> section in get_short_sha1():
> 
>   +   if (len < 16 && !status && (flags & GET_SHA1_AUTOMATIC)) {
>   +   unsigned int expect_collision = 1 << (len * 2);
>   +   if (ds.nrobjects > expect_collision)
>   +   return SHORT_NAME_AMBIGUOUS;
>   +   }

Hmm. So at length 7, we expect collisions at 2^14, which is 16384. That
seems really low. I mean, by the birthday paradox that's where expect
a 50% chance of a collision. But that's a single collision. We
definitely don't expect them to be common at that size.

So I suspect this could be a bit looser. The real number we care about
is probably something like "there is probability 'p' of a collision when
we add a new object", but I'm not sure what that 'p' would be. Or
perhaps "we accept collisions in 'n' percent of objects". But again, I
don't know that 'n'.

I dunno. I suppose being overly conservative with this number leaves
room for growth. Repositories generally get bigger, not smaller. :)

> What do you think? It's actually a fairly simple patch and I really do
> think it makes sense and it seems to just DTRT automatically.

I like the general idea.

As far as the implementation, I was surprised to see it touch
get_short_sha1() at all. That's, after all, for lookups, and we would
never want to require more characters on the reading side.

I see you worked around it with a flag so that this behavior only kicks
in when called via find_unique_abbrev(). But if you look at the caller:

> @@ -458,14 +472,19 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn 
> fn, void *cb_data)
>  int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
>  {
>   int status, exists;
> + int flags = GET_SHA1_QUIETLY;
>  
> + if (len < 0) {
> + flags |= GET_SHA1_AUTOMATIC;
> + len = 7;
> + }
>   sha1_to_hex_r(hex, sha1);
>   if (len == 40 || !len)
>   return 40;
>   exists = has_sha1_file(sha1);
>   while (len < 40) {
>   unsigned char sha1_ret[20];
> - status = get_short_sha1(hex, len, sha1_ret, GET_SHA1_QUIETLY);
> + status = get_short_sha1(hex, len, sha1_ret, flags);
>   if (exists
>   ? !status
>   : status == SHORT_NAME_NOT_FOUND) {

You can see that we're going to do more work than we would otherwise
need to. Because we start at 7, and ask get_short_sha1() "is this unique
enough?", and looping. But if we _know_ we won't accept any answer
shorter than some N based on the number of objects in the repository,
then we should start at that N.

IOW, something like:

  if (len < 0)
len = ceil(log_base_2(repository_object_count()));

here, and then you don't have to touch get_short_sha1() at all.

I suspect you pushed it down into get_short_sha1() because it kind-of
does the repository_object_count() step for "free" as it's looking at
the object anyway. But that step is really not very expensive. And I'd
even say you could just ignore loose objects entirely, and treat them
like a rounding error (the way that duplicate objects in packs are
treated).

That leaves you with just an O(# of packs) loop over a linked list. You
could even just keep a global object count up to date in
add_packed_git(), and then it's O(1).

-Peff

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Linus Torvalds

On Thu, Sep 29, 2016 at 11:55 AM, Linus Torvalds
 wrote:
>
> For the kernel, just the *math* right now actually gives 12
> characters. For current git it actually seems to say that 8 is the
> correct number. For small projects, you'll still see 7.

Sorry, the git number is 9, not 8. The reason is that git has roughly
212k objects, and 9 hex digits gets expected collisions at about 256k
objects.

So the logic means that we'll see 7 hex digits for projects with less
than 16k objects, 8 hex digits if there are less than 64k objects, and
9 hex digits for projects like git that currently have fewer than 256k
objects.

But git itself might not be *that* far from going to 10 hex digits
with my patch.

The kernel uses 12 he digits because the collision math says that's
the right thing for a project with between 4M and 16M objects (with
the kernel being at 5M).

So on the whole the patch really does seem to just do the right thing
automatically.

  Linus

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Linus Torvalds

On Thu, Sep 29, 2016 at 11:37 AM, Linus Torvalds
 wrote:
>
> I'm playing with an early patch to make the default more dynamic.
> Let's see how well it works in practice, but it looks fairly
> promising. Let me test a bit more and send out an RFC patch..

Ok, this is *very* rough, and it doesn't actuall pass all the tests,
and I didn't even try to look at why. But it passes the trivial
smell-test, and in particular it actually makes mathematical sense...

I think the patch can speak for itself, but the basic core is this
section in get_short_sha1():

  +   if (len < 16 && !status && (flags & GET_SHA1_AUTOMATIC)) {
  +   unsigned int expect_collision = 1 << (len * 2);
  +   if (ds.nrobjects > expect_collision)
  +   return SHORT_NAME_AMBIGUOUS;
  +   }

basically, what it says is that we will consider a sha1 ambiguous even
if it was *technically* unique (that's the '!status' part of the test)
if:

 - the length was 15 or less

*and*

 - the number of objects we have is larger than the expected point
where statistically we should start to expect to get one collision.

That "expect_collision" math is actually very simple: each hex
character adds four bits of range, but since we expect collisions at
the square root of the maximum number of objects, we shift by just two
bits per hex digits instead.

The rest of the patch is a trivial change to just initialize the
default short size to -1, and consider that to mean "enable the
automatic size checking with a minimum of 7". And the trivial code to
estimate the number of objects (which ignores duplicates between packs
etc _entirely_).

For the kernel, just the *math* right now actually gives 12
characters. For current git it actually seems to say that 8 is the
correct number. For small projects, you'll still see 7.

ANYWAY. This patch is on top of Jeff's patches in 'pu' (I think those
are great regardless of this patch!), and as mentioned, it fails some
tests. I suspect that the failures might be due to the abbrev_default
being -1, and some other code finds that surprising now. But as
mentioned, I didn't really even look at it.

What do you think? It's actually a fairly simple patch and I really do
think it makes sense and it seems to just DTRT automatically.

  Linus
 cache.h   |  1 +
 environment.c |  2 +-
 sha1_name.c   | 21 -
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/cache.h b/cache.h
index 6e33f2f..d2da6d1 100644
--- a/cache.h
+++ b/cache.h
@@ -1207,6 +1207,7 @@ struct object_context {
 #define GET_SHA1_TREEISH  020
 #define GET_SHA1_BLOB 040
 #define GET_SHA1_FOLLOW_SYMLINKS 0100
+#define GET_SHA1_AUTOMATIC  0200
 #define GET_SHA1_ONLY_TO_DIE04000

 #define GET_SHA1_DISAMBIGUATORS \
diff --git a/environment.c b/environment.c
index c1442df..fd6681e 100644
--- a/environment.c
+++ b/environment.c
@@ -16,7 +16,7 @@ int trust_executable_bit = 1;
 int trust_ctime = 1;
 int check_stat = 1;
 int has_symlinks = 1;
-int minimum_abbrev = 4, default_abbrev = 7;
+int minimum_abbrev = 4, default_abbrev = -1;
 int ignore_case;
 int assume_unchanged;
 int prefer_symlink_refs;
diff --git a/sha1_name.c b/sha1_name.c
index 3b647fd..8791ff3 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -15,6 +15,7 @@ typedef int (*disambiguate_hint_fn)(const unsigned char *, 
void *);

 struct disambiguate_state {
int len; /* length of prefix in hex chars */
+   unsigned int nrobjects;
char hex_pfx[GIT_SHA1_HEXSZ + 1];
unsigned char bin_pfx[GIT_SHA1_RAWSZ];

@@ -118,6 +119,12 @@ static void find_short_object_filename(struct 
disambiguate_state *ds)

if (strlen(de->d_name) != 38)
continue;
+
+   // We only look at the one subdirectory, and we assume
+   // each subdirectory is roughly similar, so each object
+   // we find probably has 255 other objects in the other
+   // fan-out directories
+   ds->nrobjects += 256;
if (memcmp(de->d_name, ds->hex_pfx + 2, ds->len - 2))
continue;
memcpy(hex + 2, de->d_name, 38);
@@ -151,6 +158,7 @@ static void unique_in_pack(struct packed_git *p,

open_pack_index(p);
num = p->num_objects;
+   ds->nrobjects += num;
last = num;
while (first < last) {
uint32_t mid = (first + last) / 2;
@@ -426,6 +434,12 @@ static int get_short_sha1(const char *name, int len, 
unsigned char *sha1,
for_each_abbrev(ds.hex_pfx, show_ambiguous_object, &ds);
}

+   if (len < 16 && !status && (flags & GET_SHA1_AUTOMATIC)) {
+   unsigned int expect_collision = 1 << (len * 2);
+   if (ds.nrobjects > expect_collision)
+   return SHORT_NAME_AMBIGUOUS;
+   }
+

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Linus Torvalds

On Thu, Sep 29, 2016 at 11:05 AM, Junio C Hamano  wrote:
>
> Yes, "git log --oneline" looks somewhat different and strange for
> me, too ;-)

I'm playing with an early patch to make the default more dynamic.
Let's see how well it works in practice, but it looks fairly
promising. Let me test a bit more and send out an RFC patch..

  Linus

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Junio C Hamano

Johannes Sixt  writes:

> Am 29.09.2016 um 01:30 schrieb Junio C Hamano:
>> As Peff said, responding in a thread started by Linus's suggestion
>> to raise the default abbreviation to 12 hexdigits:
>
> This is waayy too large for a new default. The vast majority of
> repositories is smallish. For those, the long sequences of hex digits
> are an uglification that is almost unbearable.
>
> I know that kernel developers are important, but their importance has
> long been outnumbered by the anonymous and silent masses of users.
>
> Personally, I use 8 digits just because it is a "rounder" number than
> 7, but in all of my repositories 7 would still work just as well.

Yes, "git log --oneline" looks somewhat different and strange for
me, too ;-)

I am sure I'll get used to it if I keep using it, but I suspect that
I'd be irritated as I find myself typing 'q' more and more often to
"less -S" that is automatically invoked when I do "git log --oneline
master.." to see what commits are on my current topic branch.

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread SZEDER Gábor

Quoting Jeff King :

On Thu, Sep 29, 2016 at 04:44:00AM +0200, SZEDER Gábor wrote:

> So 12 seems reasonable, and the only downside for it (or for "13", for
> that matter) is a few extra bytes. I dunno, maybe people will really
> hate that, but I have a feeling these are mostly  
cut-and-pasted anyway.

I for one raise my hand in protest...

"few extra bytes" is not the only downside, and it's not at all about
how many characters are copy-and-pasted.  In my opinion it's much more
important that this change wastes 5 columns worth of valuable screen
real estate e.g. for 'git blame' or 'git log --oneline' in projects
that don't need it and certainly won't ever need it.

True. The core of the issue is that we really only care about this
minimum length when _storing_ an abbreviation, but we don't know when
the user is just looking at it in the moment, and when they are going to
stick it in a commit message, email, or bug tracker.

In an ideal world, anybody who was about to store it would run "git
describe" or something to come up with some canonical reference format.
And we could just bump the default minimum there. Personally, I almost
exclusively cite commits as the output of:

  git log -1 --pretty='tformat:%h (%s, %ad)' --date=short

Interesting, I have a pretty format alias that looks almost like this,
except that I carry a patch locally allowing me to say %as for short
date format :)

What I sometimes wished for is a pretty format specifier for 'git
describe --contains', which would make it convenient to cite commits
like this: v0.99~954 (Initial revision of "git", the information manager
from hell, 2005-04-07).  It's better than the abbreviated object name,
because it will stay unique, assuming that the chosen tag is never
deleted, and it carries extra information for humans (the first release
containing the referenced commit), while the abbreviated object name is
completely meaningless.

The obvious drawback that makes it a non-solution for the problem at
hand is that this format can only refer to commits that are reachable
from a tag and can't be used for commits that are descendants of the
most recent tag, e.g. when fixing a bug introduced after the last
release.  Oh, and the user has to fetch the tag first to be able to
make sense of such a reference.

and I'd be fine to stick "--abbrev=12" in there for future-proofing. But
I don't know what the kernel or other projects do.

I'd also be curious to know if the patch I sent in [1] to more
aggressively prefer commits would make this less of an issue, and people
wouldn't care as much about using longer hashes in the first place. So
one option is to merge that (and possibly even make it the default) and
see if people still care in 6 months.

-Peff

[1]  
http://public-inbox.org/git/20160927123801.3bpdg3hap3kzz...@sigill.intra.peff.net/

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Matthieu Moy

Jeff King  writes:

> On Thu, Sep 29, 2016 at 04:44:00AM +0200, SZEDER Gábor wrote:
>
>> > So 12 seems reasonable, and the only downside for it (or for "13", for
>> > that matter) is a few extra bytes. I dunno, maybe people will really
>> > hate that, but I have a feeling these are mostly cut-and-pasted anyway.
>> 
>> I for one raise my hand in protest...
>> 
>> "few extra bytes" is not the only downside, and it's not at all about
>> how many characters are copy-and-pasted.  In my opinion it's much more
>> important that this change wastes 5 columns worth of valuable screen
>> real estate e.g. for 'git blame' or 'git log --oneline' in projects
>> that don't need it and certainly won't ever need it.
>
> True. The core of the issue is that we really only care about this
> minimum length when _storing_ an abbreviation, but we don't know when
> the user is just looking at it in the moment, and when they are going to
> stick it in a commit message, email, or bug tracker.

Perhaps a compromise would be to adapt the length to the size of the
project _and_ keep a huge margin. So, essentially, we'd have small
projects stick to the 7 characters, and very quickly bump to 12.

So, for a fast-growing project, there would be a short window at the
beginning of the project where people could cut-and-past short hashes.
OTOH, small projects could keep these few columns of screen real-estate.

That said, I can certainly live without these 5 columns, don't take my
message as an objection to setting to 12 right away.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Jeff King

On Wed, Sep 28, 2016 at 04:30:47PM -0700, Junio C Hamano wrote:

> As Peff said, responding in a thread started by Linus's suggestion
> to raise the default abbreviation to 12 hexdigits:
> 
> I actually think "12" might be sane for a long time. That's 48 bits of
> sha1, so we'd expect a 50% change of a _single_ collision at 2^24, or 16
> million.  The biggest repository I know about (in number of objects) is
> the one holding all of the objects for all of the forks of
> torvalds/linux on GitHub. It's at about 15 million objects.
> 
> Which _seems_ close, but remember that's the size where we expect to see
> a single collision. They don't become common until much later (I didn't
> compute an exact number, but Linus's 16x sounds about right). I know
> that the growth of the kernel isn't really linear, but I think the need
> to bump to "13" might not just be decades, but possibly a century or
> more.
> 
> So 12 seems reasonable, and the only downside for it (or for "13", for
> that matter) is a few extra bytes. I dunno, maybe people will really
> hate that, but I have a feeling these are mostly cut-and-pasted anyway.

I am not sure my quote is a good rationale for this bump. It was meant
to be a rationale that "12" is big enough, but the "I dunno" at the end
kind of glosses over the downsides.

-Peff

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Jeff King

On Thu, Sep 29, 2016 at 07:27:01AM +0200, Lukas Fleischer wrote:

> > Sure, users working on smaller repos are free to reset core.abbrev to
> > its original value.  I don't have any numbers, of course, but I
> > suspect that there are many more smaller repos out there that this
> > change will affect disadvantageously, than there are large repos for
> > which it's beneficial.
> 
> I know this suggestion comes a bit late but would it make sense to let
> the repository owner overwrite the core.abbrev setting?
> 
> One possible way to implement this would be adding .gitconfig support to
> repositories with a very limited set of whitelisted variables allowed in
> there (could be core.abbrev only to begin with). Or some entirely
> separate mechanism like .gitignore.

The suggestion for versioned repository-level config comes up from time
to time; you can find other instances in the list archive. Usually the
biggest issue is that usually nobody comes up with a good example of
something that the project would actually want to set. Setting
"core.abbrev" at least seems plausible.

Though...

> With such a mechanism, we could keep the default of 7 which works fine
> for most projects. Linus could bump the default to 12 for linux.git. If
> some users are not happy with that, they can still overwrite it in their
> local Git config. Anybody starting a project could change the initial
> value to a suitable value in one of the first commits -- provided they
> already have an idea how much the project will grow. That way, hashes
> will be "long enough" even for early commits, before any heuristics
> could guess that the project would become large.

I wonder if in practice we would do just as well to size default_abbrev
dynamically based on the number of objects. That doesn't help projects
which are just starting, but will eventually grow gigantic.  But I doubt
that most projects would have the foresight to preemptively set
core.abbrev. And that would at least reduce the impact as the project
_does_ get big.

-Peff

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-29 Thread Jeff King

On Thu, Sep 29, 2016 at 04:44:00AM +0200, SZEDER Gábor wrote:

> > So 12 seems reasonable, and the only downside for it (or for "13", for
> > that matter) is a few extra bytes. I dunno, maybe people will really
> > hate that, but I have a feeling these are mostly cut-and-pasted anyway.
> 
> I for one raise my hand in protest...
> 
> "few extra bytes" is not the only downside, and it's not at all about
> how many characters are copy-and-pasted.  In my opinion it's much more
> important that this change wastes 5 columns worth of valuable screen
> real estate e.g. for 'git blame' or 'git log --oneline' in projects
> that don't need it and certainly won't ever need it.

True. The core of the issue is that we really only care about this
minimum length when _storing_ an abbreviation, but we don't know when
the user is just looking at it in the moment, and when they are going to
stick it in a commit message, email, or bug tracker.

In an ideal world, anybody who was about to store it would run "git
describe" or something to come up with some canonical reference format.
And we could just bump the default minimum there. Personally, I almost
exclusively cite commits as the output of:

  git log -1 --pretty='tformat:%h (%s, %ad)' --date=short

and I'd be fine to stick "--abbrev=12" in there for future-proofing. But
I don't know what the kernel or other projects do.

I'd also be curious to know if the patch I sent in [1] to more
aggressively prefer commits would make this less of an issue, and people
wouldn't care as much about using longer hashes in the first place. So
one option is to merge that (and possibly even make it the default) and
see if people still care in 6 months.

-Peff

[1] 
http://public-inbox.org/git/20160927123801.3bpdg3hap3kzz...@sigill.intra.peff.net/

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-28 Thread Johannes Sixt


Am 29.09.2016 um 01:30 schrieb Junio C Hamano:

As Peff said, responding in a thread started by Linus's suggestion
to raise the default abbreviation to 12 hexdigits:


This is waayy too large for a new default. The vast majority of 
repositories is smallish. For those, the long sequences of hex digits 
are an uglification that is almost unbearable.


I know that kernel developers are important, but their importance has 
long been outnumbered by the anonymous and silent masses of users.


Personally, I use 8 digits just because it is a "rounder" number than 7, 
but in all of my repositories 7 would still work just as well.


-- Hannes

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-28 Thread Lukas Fleischer

On Thu, 29 Sep 2016 at 04:44:00, SZEDER Gábor wrote:
> I for one raise my hand in protest...
> 
> "few extra bytes" is not the only downside, and it's not at all about
> how many characters are copy-and-pasted.  In my opinion it's much more
> important that this change wastes 5 columns worth of valuable screen
> real estate e.g. for 'git blame' or 'git log --oneline' in projects
> that don't need it and certainly won't ever need it.
> 
> Sure, users working on smaller repos are free to reset core.abbrev to
> its original value.  I don't have any numbers, of course, but I
> suspect that there are many more smaller repos out there that this
> change will affect disadvantageously, than there are large repos for
> which it's beneficial.

I know this suggestion comes a bit late but would it make sense to let
the repository owner overwrite the core.abbrev setting?

One possible way to implement this would be adding .gitconfig support to
repositories with a very limited set of whitelisted variables allowed in
there (could be core.abbrev only to begin with). Or some entirely
separate mechanism like .gitignore.

With such a mechanism, we could keep the default of 7 which works fine
for most projects. Linus could bump the default to 12 for linux.git. If
some users are not happy with that, they can still overwrite it in their
local Git config. Anybody starting a project could change the initial
value to a suitable value in one of the first commits -- provided they
already have an idea how much the project will grow. That way, hashes
will be "long enough" even for early commits, before any heuristics
could guess that the project would become large.

Opinions?

Regards,
Lukas

Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-28 Thread SZEDER Gábor

> As Peff said, responding in a thread started by Linus's suggestion
> to raise the default abbreviation to 12 hexdigits:
> 
> I actually think "12" might be sane for a long time. That's 48 bits of
> sha1, so we'd expect a 50% change of a _single_ collision at 2^24, or 16

s/change/chance/

I know it's quoted, but still.

> million.  The biggest repository I know about (in number of objects) is
> the one holding all of the objects for all of the forks of
> torvalds/linux on GitHub. It's at about 15 million objects.
> 
> Which _seems_ close, but remember that's the size where we expect to see
> a single collision. They don't become common until much later (I didn't
> compute an exact number, but Linus's 16x sounds about right). I know
> that the growth of the kernel isn't really linear, but I think the need
> to bump to "13" might not just be decades, but possibly a century or
> more.
> 
> So 12 seems reasonable, and the only downside for it (or for "13", for
> that matter) is a few extra bytes. I dunno, maybe people will really
> hate that, but I have a feeling these are mostly cut-and-pasted anyway.

I for one raise my hand in protest...

"few extra bytes" is not the only downside, and it's not at all about
how many characters are copy-and-pasted.  In my opinion it's much more
important that this change wastes 5 columns worth of valuable screen
real estate e.g. for 'git blame' or 'git log --oneline' in projects
that don't need it and certainly won't ever need it.

Sure, users working on smaller repos are free to reset core.abbrev to
its original value.  I don't have any numbers, of course, but I
suspect that there are many more smaller repos out there that this
change will affect disadvantageously, than there are large repos for
which it's beneficial.


> And this does exactly that.
> 
> Keep the tests working by explicitly asking for the old 7 hexdigits
> setting in the fake system-wide configuration file used for tests.
> 
> Signed-off-by: Junio C Hamano

[PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

2016-09-28 Thread Junio C Hamano

As Peff said, responding in a thread started by Linus's suggestion
to raise the default abbreviation to 12 hexdigits:

I actually think "12" might be sane for a long time. That's 48 bits of
sha1, so we'd expect a 50% change of a _single_ collision at 2^24, or 16
million.  The biggest repository I know about (in number of objects) is
the one holding all of the objects for all of the forks of
torvalds/linux on GitHub. It's at about 15 million objects.

Which _seems_ close, but remember that's the size where we expect to see
a single collision. They don't become common until much later (I didn't
compute an exact number, but Linus's 16x sounds about right). I know
that the growth of the kernel isn't really linear, but I think the need
to bump to "13" might not just be decades, but possibly a century or
more.

So 12 seems reasonable, and the only downside for it (or for "13", for
that matter) is a few extra bytes. I dunno, maybe people will really
hate that, but I have a feeling these are mostly cut-and-pasted anyway.

And this does exactly that.

Keep the tests working by explicitly asking for the old 7 hexdigits
setting in the fake system-wide configuration file used for tests.

Signed-off-by: Junio C Hamano 
---
 environment.c| 2 +-
 t/gitconfig-for-test | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/environment.c b/environment.c
index ca72464..25daddb 100644
--- a/environment.c
+++ b/environment.c
@@ -16,7 +16,7 @@ int trust_executable_bit = 1;
 int trust_ctime = 1;
 int check_stat = 1;
 int has_symlinks = 1;
-int minimum_abbrev = 4, default_abbrev = 7;
+int minimum_abbrev = 4, default_abbrev = 12;
 int ignore_case;
 int assume_unchanged;
 int prefer_symlink_refs;
diff --git a/t/gitconfig-for-test b/t/gitconfig-for-test
index 4598885..8c28442 100644
--- a/t/gitconfig-for-test
+++ b/t/gitconfig-for-test
@@ -4,3 +4,6 @@
 ;; [user]
 ;; name = A U Thor
 ;; email = aut...@example.com
+
+[core]
+   abbrev = 7
-- 
2.10.0-584-gc9e068c

44 matches

Mail list logo