Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}

2013-05-31 Thread Ramkumar Ramachandra
Thomas Rast wrote:
 diff --git a/commit.c b/commit.c
 index 888e02a..00e8d4a 100644
 --- a/commit.c
 +++ b/commit.c
 @@ -31,8 +31,12 @@ static struct commit *check_commit(struct object *obj,
  struct commit *lookup_commit_reference_gently(const unsigned char *sha1,
   int quiet)
  {
 -   struct object *obj = deref_tag(parse_object(sha1), NULL, 0);
 -
 +   struct object *obj;
 +   int type = sha1_object_info(sha1, NULL);
 +   /* If it's neither tag nor commit, parsing the object is wasted 
 effort */
 +   if (type != OBJ_TAG  type != OBJ_COMMIT)
 +   return NULL;
 +   obj = deref_tag(parse_object(sha1), NULL, 0);
 if (!obj)
 return NULL;
 return check_commit(obj, sha1, quiet);

As Jeff points out, you've introduced an extra sha1_object_info() call
in the common case of tag (which derefs into a commit anyway) and
commit slowing things down.

So, my main doubt centres around how sha1_object_info() determines the
type of the object without actually parsing it.  You have to open up
the file and look at the fields near the top, no? (or fallback to blob
failing that).  I am reading it:

1. It calls sha1_loose_object_info() or sha1_packed_object_info(),
depending on whether the particular file is in-pack or not.  Lets see
what is common between them.

2. The loose counterpart seems to call unpack_sha1_header() after
mmap'ing the file.  This ultimately ends up calling
unpack_object_header_buffer(), which is also what the packed
counterpart calls.

3. I didn't understand what unpack_object_header_buffer() is doing.
And'ing with some magic 0x80 and shifting by 4 bits iteratively? type
= (c  4)  7?

In contrast, parse_object() first calls lookup_object() to look it up
in some hashtable to get the type -- the packfile idx, presumably?
Why don't you also do that instead of sha1_object_info()?  Or, why
don't you wrap parse_object() in an API that doesn't go beyond the
first blob check (and not execute parse_object_buffer())?

Also, does this patch fix the bug Alex reported?

Apologies if I've misunderstood something horribly (which does seem to
be the case).

Thanks.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}

2013-05-31 Thread Thomas Rast
Jeff King p...@peff.net writes:

 On Thu, May 30, 2013 at 10:00:23PM +0200, Thomas Rast wrote:

 lookup_commit_reference_gently unconditionally parses the object given
 to it.  This slows down git-describe a lot if you have a repository
 with large tagged blobs in it: parse_object() will read the entire
 blob and verify that its sha1 matches, only to then throw it away.
 
 Speed it up by checking the type with sha1_object_info() prior to
 unpacking.

 This would speed up the case where we do not end up looking at the
 object at all, but it will slow down the (presumably common) case where
 we will in fact find a commit and end up parsing the object anyway.

 Have you measured the impact of this on normal operations? During a
 traversal, we spend a measurable amount of time looking up commits in
 packfiles, and this would presumably double it.

I don't think so, but admittedly I didn't measure it.

The reason why it's unlikely is that this is specific to
lookup_commit_reference_gently, which according to some grepping is
usually done on refs or values that refs might have; e.g. on the oldnew
sides of a fetch in remote.c, or in many places in the callback of some
variant of for_each_ref.

Of course if you have a ridiculously large number of refs (and I gather
_you_ do), this will hurt somewhat in the usual case, but speed up the
case where there is a ref (usually a lightweight tag) directly pointing
at a large blob.

I'm not sure this can be fixed without the change you outline here:

 This is not the first time I have seen this tradeoff in git.  It would
 be nice if our object access was structured to do incremental
 examination of the objects (i.e., store the packfile index lookup or
 partial unpack of a loose object header, and then use that to complete
 the next step of actually getting the contents).

But in any case I see the point, I should try and gather some
performance numbers.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}

2013-05-31 Thread Thomas Rast
Ramkumar Ramachandra artag...@gmail.com writes:

 Thomas Rast wrote:
 +   struct object *obj;
 +   int type = sha1_object_info(sha1, NULL);
 +   /* If it's neither tag nor commit, parsing the object is wasted 
 effort */
 +   if (type != OBJ_TAG  type != OBJ_COMMIT)
 +   return NULL;
 +   obj = deref_tag(parse_object(sha1), NULL, 0);
[...]
 In contrast, parse_object() first calls lookup_object() to look it up
 in some hashtable to get the type -- the packfile idx, presumably?
 Why don't you also do that instead of sha1_object_info()?  Or, why
 don't you wrap parse_object() in an API that doesn't go beyond the
 first blob check (and not execute parse_object_buffer())?

 Also, does this patch fix the bug Alex reported?

 Apologies if I've misunderstood something horribly (which does seem to
 be the case).

Yes, it does fix the bug.  (It's not really buggy, just slow.)

However, you implicitly point out an important point: If we have the
object, and it was already parsed (obj-parsed is set), parse_object()
is essentially free.  But sha1_object_info is not, it will in particular
unconditionally dig through long delta chains to discover the base type
of an object that has already been unpacked.


As for your original questions: lookup_object() is do we have it in our
big object hashtable? -- the one that holds many[1] objects, that Peff
recently sped up.

sha1_object_info() and read_object() are in many ways parallel functions
that do approximately the following:

  check all pack indexes for this object
  if we found a hit:
attempt to unpack by recursively going through deltas
(for _info, no need to unpack, but we still go through the delta
chain because the type of object is determined by the innermost
delta base)
  try to load it as a loose object
  it could have been repacked and pruned while we were looking, so:
reload pack index information
try the packs again (search indexes, then unpack)
  complain


[1]  blobs in particular are frequently not stored in that hash table,
because it is an insert-only table

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}

2013-05-31 Thread Jeff King
On Fri, May 31, 2013 at 10:08:06AM +0200, Thomas Rast wrote:

  Have you measured the impact of this on normal operations? During a
  traversal, we spend a measurable amount of time looking up commits in
  packfiles, and this would presumably double it.
 
 I don't think so, but admittedly I didn't measure it.
 
 The reason why it's unlikely is that this is specific to
 lookup_commit_reference_gently, which according to some grepping is
 usually done on refs or values that refs might have; e.g. on the oldnew
 sides of a fetch in remote.c, or in many places in the callback of some
 variant of for_each_ref.

Yeah, I saw that the _gently form backs some of the other forms
(non-gently, lookup_commit_or_die) and was worried that we would use it
as part of the revision traversal to find parents. But we don't, of
course; we use lookup_commit, because we would not accept a parent that
is a tag pointing to a commit.

So I think it probably won't matter in any sane case.

 Of course if you have a ridiculously large number of refs (and I gather
 _you_ do), this will hurt somewhat in the usual case, but speed up the
 case where there is a ref (usually a lightweight tag) directly pointing
 at a large blob.

In my large-ref cases, there are often a lot of duplicate refs anyway
(e.g., many forks of a project having the same tags). So usually the
right thing there is to use lookup_object to see if we have the object
already anyway. parse_object has this optimization, but we can add it
into sha1_object_info, too, if it turns out to be a problem.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}

2013-05-30 Thread Jeff King
On Thu, May 30, 2013 at 10:00:23PM +0200, Thomas Rast wrote:

 lookup_commit_reference_gently unconditionally parses the object given
 to it.  This slows down git-describe a lot if you have a repository
 with large tagged blobs in it: parse_object() will read the entire
 blob and verify that its sha1 matches, only to then throw it away.
 
 Speed it up by checking the type with sha1_object_info() prior to
 unpacking.

This would speed up the case where we do not end up looking at the
object at all, but it will slow down the (presumably common) case where
we will in fact find a commit and end up parsing the object anyway.

Have you measured the impact of this on normal operations? During a
traversal, we spend a measurable amount of time looking up commits in
packfiles, and this would presumably double it.

This is not the first time I have seen this tradeoff in git.  It would
be nice if our object access was structured to do incremental
examination of the objects (i.e., store the packfile index lookup or
partial unpack of a loose object header, and then use that to complete
the next step of actually getting the contents).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}

2013-05-30 Thread Duy Nguyen
On Fri, May 31, 2013 at 4:22 AM, Jeff King p...@peff.net wrote:
 On Thu, May 30, 2013 at 10:00:23PM +0200, Thomas Rast wrote:

 lookup_commit_reference_gently unconditionally parses the object given
 to it.  This slows down git-describe a lot if you have a repository
 with large tagged blobs in it: parse_object() will read the entire
 blob and verify that its sha1 matches, only to then throw it away.

 Speed it up by checking the type with sha1_object_info() prior to
 unpacking.

 This would speed up the case where we do not end up looking at the
 object at all, but it will slow down the (presumably common) case where
 we will in fact find a commit and end up parsing the object anyway.

Perhaps turn quiet into a bitmap and only let git-describe do this?
--
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html