Re: [PATCH v2 4/5] get_sha1: speed up ambiguous 40-hex test

2014-01-14 Thread Michael Haggerty
On 01/14/2014 10:50 AM, Jeff King wrote:
> On Fri, Jan 10, 2014 at 04:41:20AM -0500, Jeff King wrote:
> 
>> That being said, we could further optimize this by not opening the files
>> at all (and make that the responsibility of do_one_ref, which we are
>> avoiding here). I am slightly worried about the open() cost of my
>> solution. It's amortized away in a big call, but it is probably
>> noticeable for something like `git rev-parse <40-hex>`.
> 
> I took a look at this. It gets a bit hairy. My strategy is to add a flag
> to ask read_loose_refs to create REF_INCOMPLETE values. We currently use
> this flag for loose REF_DIRs to mean "we haven't opendir()'d the
> subdirectory yet". This would extend it to the non-REF_DIR case to mean
> "we haven't opened the loose ref file yet". We'd check REF_INCOMPLETE
> before handing the ref_entry to a callback, and complete it if
> necessary.
> 
> It gets ugly, though, because we need to pass that flag through quite a
> bit of callstack. get_ref_dir() needs to know it, which means all of
> find_containing_dir, etc need it, meaning it pollutes all of the
> packed-refs code paths too.
> 
> I have a half-done patch in this direction if that doesn't sound too
> nasty.

A long time ago I write a patch series to allow incomplete reading of
references, but my version *always* read them lazily, so it was much
simpler (no need to pass a new option down the call stack).  It didn't
seem to speed things up in general, so I never submitted it.

Reading lazily only from particular callers is more complicated, and I
can see how it would get messy.

Given the race avoidance needed between packed/loose references, lazy
reading would mean that after each reference is read, the packed-refs
file would need to be stat()ted again to make sure that it hasn't been
changed since the last check.  I know this isn't an issue for your use
case, because you plan *never* to read the file contents.  But it does
increase the price of lazy reference reading to most callers.

On the other hand, if we ever go in the direction of routing *all*
reference lookups--including lookups of single references--through the
cache, then lazy reading of references probably becomes essential to
avoid populating more of the cache than necessary.

>>> This doesn't correctly handle the rule
>>>
>>> "refs/remotes/%.*s/HEAD"
>> [...]
> 
>> I'll see how painful it is to make it work.
> 
> It's actually reasonably painful. I thought at first we could get away
> with more cleverly parsing the rule, find the prefix (up to the
> placeholder), and then look for the suffix ("/HEAD") inside there. But
> it can never work with the current do_for_each_* code. That code only
> triggers a callback when we see a concrete ref. It _never_ lets the
> callbacks see an intermediate directory.
> 
> So a NO_RECURSE flag is not sufficient to handle this case. I'd need to
> teach do_for_each_ref to recurse based on pathspecs, or a custom
> callback function. And that is getting quite complicated.

Another possibility would be to have an "int recurse" parameter rather
than "bool recurse", telling how many levels to recurse.  Then one could
do a

do_for_each_ref(..., "refs/remotes", ..., recurse=2)

to get all of the refs/remotes/*/HEAD references.  Though since all of
the heads for a remote are also siblings of "refs/remotes/foo/HEAD", it
could still involve a lot of superfluous file reading.  And the integer
wouldn't fit conveniently in the flags parameter.

> I think it might be simpler to just do my own custom traversal. What I
> need is much simpler than what do_for_each_entry provides. I don't need
> recursion, and I don't actually need to look at the loose and packed
> refs together. It's OK for me to do them one at a time because I don't
> care about the actual value; I just want to know about which refs exist.

Yes.  Still, the code is really piling up for this one warning for the
contrived eventuality that somebody wants to pass SHA-1s and branch
names together in a single cat-file invocation *and* wants to pass lots
of inputs at once and so is worried about performance *and* has
reference names that look like SHA-1s.  Otherwise we could just leave
the warning disabled in this case, as now.  Or we could add a new
"--hashes-only" option that tells cat-file to treat all of its
arguments/inputs as SHA-1s; such an option would permit an even faster
code path for bulk callers.

Michael

-- 
Michael Haggerty
mhag...@alum.mit.edu
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/5] get_sha1: speed up ambiguous 40-hex test

2014-01-14 Thread Jeff King
On Fri, Jan 10, 2014 at 04:41:20AM -0500, Jeff King wrote:

> That being said, we could further optimize this by not opening the files
> at all (and make that the responsibility of do_one_ref, which we are
> avoiding here). I am slightly worried about the open() cost of my
> solution. It's amortized away in a big call, but it is probably
> noticeable for something like `git rev-parse <40-hex>`.

I took a look at this. It gets a bit hairy. My strategy is to add a flag
to ask read_loose_refs to create REF_INCOMPLETE values. We currently use
this flag for loose REF_DIRs to mean "we haven't opendir()'d the
subdirectory yet". This would extend it to the non-REF_DIR case to mean
"we haven't opened the loose ref file yet". We'd check REF_INCOMPLETE
before handing the ref_entry to a callback, and complete it if
necessary.

It gets ugly, though, because we need to pass that flag through quite a
bit of callstack. get_ref_dir() needs to know it, which means all of
find_containing_dir, etc need it, meaning it pollutes all of the
packed-refs code paths too.

I have a half-done patch in this direction if that doesn't sound too
nasty.

> > This doesn't correctly handle the rule
> > 
> > "refs/remotes/%.*s/HEAD"
> [...]

> I'll see how painful it is to make it work.

It's actually reasonably painful. I thought at first we could get away
with more cleverly parsing the rule, find the prefix (up to the
placeholder), and then look for the suffix ("/HEAD") inside there. But
it can never work with the current do_for_each_* code. That code only
triggers a callback when we see a concrete ref. It _never_ lets the
callbacks see an intermediate directory.

So a NO_RECURSE flag is not sufficient to handle this case. I'd need to
teach do_for_each_ref to recurse based on pathspecs, or a custom
callback function. And that is getting quite complicated.

I think it might be simpler to just do my own custom traversal. What I
need is much simpler than what do_for_each_entry provides. I don't need
recursion, and I don't actually need to look at the loose and packed
refs together. It's OK for me to do them one at a time because I don't
care about the actual value; I just want to know about which refs exist.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/5] get_sha1: speed up ambiguous 40-hex test

2014-01-10 Thread Jeff King
On Wed, Jan 08, 2014 at 05:09:25PM +0100, Michael Haggerty wrote:

> It's not only racy WRT other processes.  If the current git process
> would create a new reference, it wouldn't be reflected in the cache.
> 
> It's true that the main ref_cache doesn't invalidate itself
> automatically either when a new reference is created, so it's not really
> a fair complaint.  However, as we add places where the cache is
> invalidated, it is easy to overlook this cache that is stuck in static
> variables within a function definition and it is impossible to
> invalidate it.  Might it not be better to attach the cache to the
> ref_cache structure instead, and couple its lifetime to that object?

Yeah, I noticed that we don't ever invalidate the loose ref cache. I
think that's mostly fine, as we rely on resolve_ref to cover the cases
that need to be ordered properly. And in this particular case, it's
"only" a warning (and a rather obscure one, at that), so I think the
stakes are low.

That being said, it should not be hard at all to attach the cache to the
ref_cache. Since we are generated from that cache, the lifetimes should
be the same.

> Alternatively, the cache could be created and managed on the caller
> side, since the caller would know when the cache would have to be
> invalidated.  Also, different callers are likely to have different
> performance characteristics.  It is unlikely that the time to initialize
> the cache will be amortized in most cases; in fact, "rev-list --stdin"
> might be the *only* plausible use case.

The two I know of are "rev-list --stdin" and "cat-file --batch-check".

> Regarding the overall strategy: you gather all refnames that could be
> confused with an SHA-1 into a sha1_array, then later look up SHA-1s in
> the array to see if they are ambiguous.  This is a very special-case
> optimization for SHA-1s.

Yes, it is very sha1-specific. Part of my goal was that in the common
case, the check would collapse to O(# of ambiguous refs), which is
typically 0.

That may be premature optimization, though. As you note below, doing a
few binary searches through the in-memory ref cache is _probably_ fine,
too. And we can do that without a separate index.

> I wonder whether another approach would gain almost the same amount of
> performance but be more general.  We could change dwim_ref() (or a
> version of it?) to read its data out of a ref_cache instead of going to
> disk every time.  Then, at the cost of populating the relevant parts of
> the ref_cache once, we would have fast dwim_ref() calls for all strings.

I'm very nervous about turning dwim_ref into a cache. As we noted above,
we never invalidate the cache, so any write-then-read operations could
get stale data. That is not as risky as caching, say, resolve_ref, but
it still makes me nervous. Caching just the warning has much lower
stakes.

> It's true that the lookups wouldn't be quite so fast--they would require
> a few bisects per refname lookup (one for each level in the refname
> hierarchy) and several refname lookups (one for each ref_rev_parse_rule)
> for every dwim_ref() call, vs. a single bisect in your current design.
> But this approach it would bring us most of the gain, it might
> nevertheless be preferable.

I don't think this would be all that hard to measure. I'll see what I
can do.

> > I wanted to make the ref traversal as cheap as possible, hence the
> > NO_RECURSE flag I added. I thought INCLUDE_BROKEN used to not open up
> > the refs at all, but it looks like it does these days. I wonder if that
> > is worth changing or not.
> 
> What do you mean by "open up the refs"?  The loose reference files are
> read when populating the cache.  (Was that ever different?)

I meant actually open the ref files and read the sha1. But that is me
being dumb. We have always done that, as we must provide the sha1 via
to the for_each_ref callback.

That being said, we could further optimize this by not opening the files
at all (and make that the responsibility of do_one_ref, which we are
avoiding here). I am slightly worried about the open() cost of my
solution. It's amortized away in a big call, but it is probably
noticeable for something like `git rev-parse <40-hex>`.

> This doesn't correctly handle the rule
> 
>   "refs/remotes/%.*s/HEAD"
>
> We might be willing to accept this limitation, but it should at least be
> mentioned somewhere.  OTOH if we want to handle this pattern as well, we
> could do use a technique like that of shorten_unambiguous_ref().

Yes, you're right. I considered this, but for some reason convinced
myself that it would be OK to just look for "refs/remotes/<40-hex>" in
this case (which is what my patch does). But obviously that's wrong.
The ref traversal won't find a _directory_ with that name.

I'll see how painful it is to make it work. I have to say I was tempted
to simply manually write the rules. It's a duplication that could go out
of sync with the ref_rev_parse rules, and that's nasty. But that list 

Re: [PATCH v2 4/5] get_sha1: speed up ambiguous 40-hex test

2014-01-09 Thread Junio C Hamano
Michael Haggerty  writes:

> It's not only racy WRT other processes.  If the current git process
> would create a new reference, it wouldn't be reflected in the cache.
>
> It's true that the main ref_cache doesn't invalidate itself
> automatically either when a new reference is created, so it's not really
> a fair complaint.  However, as we add places where the cache is
> invalidated, it is easy to overlook this cache that is stuck in static
> variables within a function definition and it is impossible to
> invalidate it.  Might it not be better to attach the cache to the
> ref_cache structure instead, and couple its lifetime to that object?
>
> Alternatively, the cache could be created and managed on the caller
> side, since the caller would know when the cache would have to be
> invalidated.  Also, different callers are likely to have different
> performance characteristics.  It is unlikely that the time to initialize
> the cache will be amortized in most cases; in fact, "rev-list --stdin"
> might be the *only* plausible use case.

True.

> Regarding the overall strategy: you gather all refnames that could be
> confused with an SHA-1 into a sha1_array, then later look up SHA-1s in
> the array to see if they are ambiguous.  This is a very special-case
> optimization for SHA-1s.
>
> I wonder whether another approach would gain almost the same amount of
> performance but be more general.  We could change dwim_ref() (or a
> version of it?) to read its data out of a ref_cache instead of going to
> disk every time.  Then, at the cost of populating the relevant parts of
> the ref_cache once, we would have fast dwim_ref() calls for all strings.

If opendir-readdir to grab only the names (but not values) of many
refs is a lot faster than stat-open-read a handful of dwim-ref
locations for a given name, that optimization might be worthwhile,
but I think that requires an update to read_loose_refs() not to
read_ref_full() and the users of refs API to instead lazily resolve
the refs, no?

If I ask for five names (say 'maint', 'master', 'next', 'pu',
'jch'), the current code will do 5 dwim_ref()s, each of which will
consult 6 locations with resolve_ref_unsafe(), totalling 30 calls to
resolve_ref_unsafe(), each of which in turn is essentially an open
followed by either an return on ENOENT or a read.  So 30 opens and 5
reads in total.

With your lazy ref_cache scheme, instead we would enumerate all the
loose ones in the same 6 directories (e.g. refs/tags/, refs/heads),
so 6 opendir()s with as many readdir()s as I have loose refs, plus
we open-read them in read_loose_refs() called from get_ref_dir()
with the current ref_cache code.  For me, "find .git/refs/heads"
gives 500+ lines of output, which suggests that using the ref_cache
mechanism for dwim_ref() may not be a huge win, unless it is updated
to be extremely lazy, and readdir()s turns out to be extremely less
heavier than open-read.  Also it is unlikely that the cost to
initialize the cache is amortized to be a net win unless we are
dealing with tons of dwim_ref()s.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/5] get_sha1: speed up ambiguous 40-hex test

2014-01-08 Thread Michael Haggerty
On 01/08/2014 12:59 AM, Jeff King wrote:
> Since 798c35f (get_sha1: warn about full or short object
> names that look like refs, 2013-05-29), a 40-hex sha1 causes
> us to call dwim_ref on the result, on the off chance that we
> have a matching ref. This can cause a noticeable slow-down
> when there are a large number of objects.  E.g., on
> linux.git:
> 
>   [baseline timing]
>   $ best-of-five git rev-list --all --pretty=raw
>   real0m3.996s
>   user0m3.900s
>   sys 0m0.100s
> 
>   [same thing, but calling get_sha1 on each commit from stdin]
>   $ git rev-list --all >commits
>   $ best-of-five -i commits git rev-list --stdin --pretty=raw
>   real0m7.862s
>   user0m6.108s
>   sys 0m1.760s
> 
> The problem is that each call to dwim_ref individually stats
> the possible refs in refs/heads, refs/tags, etc. In the
> common case, there are no refs that look like sha1s at all.
> We can therefore do the same check much faster by loading
> all ambiguous-looking candidates once, and then checking our
> index for each object.
> 
> This is technically more racy (somebody might create such a
> ref after we build our index), but that's OK, as it's just a
> warning (and we provide no guarantees about whether a
> simultaneous process ran before or after the ref was created
> anyway).

It's not only racy WRT other processes.  If the current git process
would create a new reference, it wouldn't be reflected in the cache.

It's true that the main ref_cache doesn't invalidate itself
automatically either when a new reference is created, so it's not really
a fair complaint.  However, as we add places where the cache is
invalidated, it is easy to overlook this cache that is stuck in static
variables within a function definition and it is impossible to
invalidate it.  Might it not be better to attach the cache to the
ref_cache structure instead, and couple its lifetime to that object?

Alternatively, the cache could be created and managed on the caller
side, since the caller would know when the cache would have to be
invalidated.  Also, different callers are likely to have different
performance characteristics.  It is unlikely that the time to initialize
the cache will be amortized in most cases; in fact, "rev-list --stdin"
might be the *only* plausible use case.

Regarding the overall strategy: you gather all refnames that could be
confused with an SHA-1 into a sha1_array, then later look up SHA-1s in
the array to see if they are ambiguous.  This is a very special-case
optimization for SHA-1s.

I wonder whether another approach would gain almost the same amount of
performance but be more general.  We could change dwim_ref() (or a
version of it?) to read its data out of a ref_cache instead of going to
disk every time.  Then, at the cost of populating the relevant parts of
the ref_cache once, we would have fast dwim_ref() calls for all strings.

It's true that the lookups wouldn't be quite so fast--they would require
a few bisects per refname lookup (one for each level in the refname
hierarchy) and several refname lookups (one for each ref_rev_parse_rule)
for every dwim_ref() call, vs. a single bisect in your current design.
But this approach it would bring us most of the gain, it might
nevertheless be preferable.

> Here is the time after this patch, which implements the
> strategy described above:
> 
>   $ best-of-five -i commits git rev-list --stdin --pretty=raw
>   real0m4.966s
>   user0m4.776s
>   sys 0m0.192s
> 
> We still pay some price to read the commits from stdin, but
> notice the system time is much lower, as we are avoiding
> hundreds of thousands of stat() calls.
> 
> Signed-off-by: Jeff King 
> ---
> I wanted to make the ref traversal as cheap as possible, hence the
> NO_RECURSE flag I added. I thought INCLUDE_BROKEN used to not open up
> the refs at all, but it looks like it does these days. I wonder if that
> is worth changing or not.

What do you mean by "open up the refs"?  The loose reference files are
read when populating the cache.  (Was that ever different?)  But the
call to ref_resolves_to_object() in do_one_ref() is skipped when the
INCLUDE_BROKEN flag is used.

> 
>  refs.c  | 47 +++
>  refs.h  |  2 ++
>  sha1_name.c |  4 +---
>  3 files changed, 50 insertions(+), 3 deletions(-)
> 
> diff --git a/refs.c b/refs.c
> index ca854d6..cddd871 100644
> --- a/refs.c
> +++ b/refs.c
> @@ -4,6 +4,7 @@
>  #include "tag.h"
>  #include "dir.h"
>  #include "string-list.h"
> +#include "sha1-array.h"
>  
>  /*
>   * Make sure "ref" is something reasonable to have under ".git/refs/";
> @@ -2042,6 +2043,52 @@ int dwim_log(const char *str, int len, unsigned char 
> *sha1, char **log)
>   return logs_found;
>  }
>  
> +static int check_ambiguous_sha1_ref(const char *refname,
> + const unsigned char *sha1,
> + int flags,
> + void *data)
> +{

[PATCH v2 4/5] get_sha1: speed up ambiguous 40-hex test

2014-01-07 Thread Jeff King
Since 798c35f (get_sha1: warn about full or short object
names that look like refs, 2013-05-29), a 40-hex sha1 causes
us to call dwim_ref on the result, on the off chance that we
have a matching ref. This can cause a noticeable slow-down
when there are a large number of objects.  E.g., on
linux.git:

  [baseline timing]
  $ best-of-five git rev-list --all --pretty=raw
  real0m3.996s
  user0m3.900s
  sys 0m0.100s

  [same thing, but calling get_sha1 on each commit from stdin]
  $ git rev-list --all >commits
  $ best-of-five -i commits git rev-list --stdin --pretty=raw
  real0m7.862s
  user0m6.108s
  sys 0m1.760s

The problem is that each call to dwim_ref individually stats
the possible refs in refs/heads, refs/tags, etc. In the
common case, there are no refs that look like sha1s at all.
We can therefore do the same check much faster by loading
all ambiguous-looking candidates once, and then checking our
index for each object.

This is technically more racy (somebody might create such a
ref after we build our index), but that's OK, as it's just a
warning (and we provide no guarantees about whether a
simultaneous process ran before or after the ref was created
anyway).

Here is the time after this patch, which implements the
strategy described above:

  $ best-of-five -i commits git rev-list --stdin --pretty=raw
  real0m4.966s
  user0m4.776s
  sys 0m0.192s

We still pay some price to read the commits from stdin, but
notice the system time is much lower, as we are avoiding
hundreds of thousands of stat() calls.

Signed-off-by: Jeff King 
---
I wanted to make the ref traversal as cheap as possible, hence the
NO_RECURSE flag I added. I thought INCLUDE_BROKEN used to not open up
the refs at all, but it looks like it does these days. I wonder if that
is worth changing or not.

 refs.c  | 47 +++
 refs.h  |  2 ++
 sha1_name.c |  4 +---
 3 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/refs.c b/refs.c
index ca854d6..cddd871 100644
--- a/refs.c
+++ b/refs.c
@@ -4,6 +4,7 @@
 #include "tag.h"
 #include "dir.h"
 #include "string-list.h"
+#include "sha1-array.h"
 
 /*
  * Make sure "ref" is something reasonable to have under ".git/refs/";
@@ -2042,6 +2043,52 @@ int dwim_log(const char *str, int len, unsigned char 
*sha1, char **log)
return logs_found;
 }
 
+static int check_ambiguous_sha1_ref(const char *refname,
+   const unsigned char *sha1,
+   int flags,
+   void *data)
+{
+   unsigned char tmp_sha1[20];
+   if (strlen(refname) == 40 && !get_sha1_hex(refname, tmp_sha1))
+   sha1_array_append(data, tmp_sha1);
+   return 0;
+}
+
+static void build_ambiguous_sha1_ref_index(struct sha1_array *idx)
+{
+   const char **rule;
+
+   for (rule = ref_rev_parse_rules; *rule; rule++) {
+   const char *prefix = *rule;
+   const char *end = strstr(prefix, "%.*s");
+   char *buf;
+
+   if (!end)
+   continue;
+
+   buf = xmemdupz(prefix, end - prefix);
+   do_for_each_ref(&ref_cache, buf, check_ambiguous_sha1_ref,
+   end - prefix,
+   DO_FOR_EACH_INCLUDE_BROKEN |
+   DO_FOR_EACH_NO_RECURSE,
+   idx);
+   free(buf);
+   }
+}
+
+int sha1_is_ambiguous_with_ref(const unsigned char *sha1)
+{
+   struct sha1_array idx = SHA1_ARRAY_INIT;
+   static int loaded;
+
+   if (!loaded) {
+   build_ambiguous_sha1_ref_index(&idx);
+   loaded = 1;
+   }
+
+   return sha1_array_lookup(&idx, sha1) >= 0;
+}
+
 static struct ref_lock *lock_ref_sha1_basic(const char *refname,
const unsigned char *old_sha1,
int flags, int *type_p)
diff --git a/refs.h b/refs.h
index 87a1a79..c7d5f89 100644
--- a/refs.h
+++ b/refs.h
@@ -229,4 +229,6 @@ int update_refs(const char *action, const struct ref_update 
**updates,
 extern int parse_hide_refs_config(const char *var, const char *value, const 
char *);
 extern int ref_is_hidden(const char *);
 
+int sha1_is_ambiguous_with_ref(const unsigned char *sha1);
+
 #endif /* REFS_H */
diff --git a/sha1_name.c b/sha1_name.c
index a5578f7..f83ecb7 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -452,13 +452,11 @@ static int get_sha1_basic(const char *str, int len, 
unsigned char *sha1)
 
if (len == 40 && !get_sha1_hex(str, sha1)) {
if (warn_ambiguous_refs && warn_on_object_refname_ambiguity) {
-   refs_found = dwim_ref(str, len, tmp_sha1, &real_ref);
-   if (refs_found > 0) {
+   if (sha1_is_ambiguous_with_ref(sha1)) {
warning(warn_msg,