Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}
Thomas Rast wrote: diff --git a/commit.c b/commit.c index 888e02a..00e8d4a 100644 --- a/commit.c +++ b/commit.c @@ -31,8 +31,12 @@ static struct commit *check_commit(struct object *obj, struct commit *lookup_commit_reference_gently(const unsigned char *sha1, int quiet) { - struct object *obj = deref_tag(parse_object(sha1), NULL, 0); - + struct object *obj; + int type = sha1_object_info(sha1, NULL); + /* If it's neither tag nor commit, parsing the object is wasted effort */ + if (type != OBJ_TAG type != OBJ_COMMIT) + return NULL; + obj = deref_tag(parse_object(sha1), NULL, 0); if (!obj) return NULL; return check_commit(obj, sha1, quiet); As Jeff points out, you've introduced an extra sha1_object_info() call in the common case of tag (which derefs into a commit anyway) and commit slowing things down. So, my main doubt centres around how sha1_object_info() determines the type of the object without actually parsing it. You have to open up the file and look at the fields near the top, no? (or fallback to blob failing that). I am reading it: 1. It calls sha1_loose_object_info() or sha1_packed_object_info(), depending on whether the particular file is in-pack or not. Lets see what is common between them. 2. The loose counterpart seems to call unpack_sha1_header() after mmap'ing the file. This ultimately ends up calling unpack_object_header_buffer(), which is also what the packed counterpart calls. 3. I didn't understand what unpack_object_header_buffer() is doing. And'ing with some magic 0x80 and shifting by 4 bits iteratively? type = (c 4) 7? In contrast, parse_object() first calls lookup_object() to look it up in some hashtable to get the type -- the packfile idx, presumably? Why don't you also do that instead of sha1_object_info()? Or, why don't you wrap parse_object() in an API that doesn't go beyond the first blob check (and not execute parse_object_buffer())? Also, does this patch fix the bug Alex reported? Apologies if I've misunderstood something horribly (which does seem to be the case). Thanks. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}
Jeff King p...@peff.net writes: On Thu, May 30, 2013 at 10:00:23PM +0200, Thomas Rast wrote: lookup_commit_reference_gently unconditionally parses the object given to it. This slows down git-describe a lot if you have a repository with large tagged blobs in it: parse_object() will read the entire blob and verify that its sha1 matches, only to then throw it away. Speed it up by checking the type with sha1_object_info() prior to unpacking. This would speed up the case where we do not end up looking at the object at all, but it will slow down the (presumably common) case where we will in fact find a commit and end up parsing the object anyway. Have you measured the impact of this on normal operations? During a traversal, we spend a measurable amount of time looking up commits in packfiles, and this would presumably double it. I don't think so, but admittedly I didn't measure it. The reason why it's unlikely is that this is specific to lookup_commit_reference_gently, which according to some grepping is usually done on refs or values that refs might have; e.g. on the oldnew sides of a fetch in remote.c, or in many places in the callback of some variant of for_each_ref. Of course if you have a ridiculously large number of refs (and I gather _you_ do), this will hurt somewhat in the usual case, but speed up the case where there is a ref (usually a lightweight tag) directly pointing at a large blob. I'm not sure this can be fixed without the change you outline here: This is not the first time I have seen this tradeoff in git. It would be nice if our object access was structured to do incremental examination of the objects (i.e., store the packfile index lookup or partial unpack of a loose object header, and then use that to complete the next step of actually getting the contents). But in any case I see the point, I should try and gather some performance numbers. -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}
Ramkumar Ramachandra artag...@gmail.com writes: Thomas Rast wrote: + struct object *obj; + int type = sha1_object_info(sha1, NULL); + /* If it's neither tag nor commit, parsing the object is wasted effort */ + if (type != OBJ_TAG type != OBJ_COMMIT) + return NULL; + obj = deref_tag(parse_object(sha1), NULL, 0); [...] In contrast, parse_object() first calls lookup_object() to look it up in some hashtable to get the type -- the packfile idx, presumably? Why don't you also do that instead of sha1_object_info()? Or, why don't you wrap parse_object() in an API that doesn't go beyond the first blob check (and not execute parse_object_buffer())? Also, does this patch fix the bug Alex reported? Apologies if I've misunderstood something horribly (which does seem to be the case). Yes, it does fix the bug. (It's not really buggy, just slow.) However, you implicitly point out an important point: If we have the object, and it was already parsed (obj-parsed is set), parse_object() is essentially free. But sha1_object_info is not, it will in particular unconditionally dig through long delta chains to discover the base type of an object that has already been unpacked. As for your original questions: lookup_object() is do we have it in our big object hashtable? -- the one that holds many[1] objects, that Peff recently sped up. sha1_object_info() and read_object() are in many ways parallel functions that do approximately the following: check all pack indexes for this object if we found a hit: attempt to unpack by recursively going through deltas (for _info, no need to unpack, but we still go through the delta chain because the type of object is determined by the innermost delta base) try to load it as a loose object it could have been repacked and pruned while we were looking, so: reload pack index information try the packs again (search indexes, then unpack) complain [1] blobs in particular are frequently not stored in that hash table, because it is an insert-only table -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}
On Fri, May 31, 2013 at 10:08:06AM +0200, Thomas Rast wrote: Have you measured the impact of this on normal operations? During a traversal, we spend a measurable amount of time looking up commits in packfiles, and this would presumably double it. I don't think so, but admittedly I didn't measure it. The reason why it's unlikely is that this is specific to lookup_commit_reference_gently, which according to some grepping is usually done on refs or values that refs might have; e.g. on the oldnew sides of a fetch in remote.c, or in many places in the callback of some variant of for_each_ref. Yeah, I saw that the _gently form backs some of the other forms (non-gently, lookup_commit_or_die) and was worried that we would use it as part of the revision traversal to find parents. But we don't, of course; we use lookup_commit, because we would not accept a parent that is a tag pointing to a commit. So I think it probably won't matter in any sane case. Of course if you have a ridiculously large number of refs (and I gather _you_ do), this will hurt somewhat in the usual case, but speed up the case where there is a ref (usually a lightweight tag) directly pointing at a large blob. In my large-ref cases, there are often a lot of duplicate refs anyway (e.g., many forks of a project having the same tags). So usually the right thing there is to use lookup_object to see if we have the object already anyway. parse_object has this optimization, but we can add it into sha1_object_info, too, if it turns out to be a problem. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}
On Thu, May 30, 2013 at 10:00:23PM +0200, Thomas Rast wrote: lookup_commit_reference_gently unconditionally parses the object given to it. This slows down git-describe a lot if you have a repository with large tagged blobs in it: parse_object() will read the entire blob and verify that its sha1 matches, only to then throw it away. Speed it up by checking the type with sha1_object_info() prior to unpacking. This would speed up the case where we do not end up looking at the object at all, but it will slow down the (presumably common) case where we will in fact find a commit and end up parsing the object anyway. Have you measured the impact of this on normal operations? During a traversal, we spend a measurable amount of time looking up commits in packfiles, and this would presumably double it. This is not the first time I have seen this tradeoff in git. It would be nice if our object access was structured to do incremental examination of the objects (i.e., store the packfile index lookup or partial unpack of a loose object header, and then use that to complete the next step of actually getting the contents). -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}
On Fri, May 31, 2013 at 4:22 AM, Jeff King p...@peff.net wrote: On Thu, May 30, 2013 at 10:00:23PM +0200, Thomas Rast wrote: lookup_commit_reference_gently unconditionally parses the object given to it. This slows down git-describe a lot if you have a repository with large tagged blobs in it: parse_object() will read the entire blob and verify that its sha1 matches, only to then throw it away. Speed it up by checking the type with sha1_object_info() prior to unpacking. This would speed up the case where we do not end up looking at the object at all, but it will slow down the (presumably common) case where we will in fact find a commit and end up parsing the object anyway. Perhaps turn quiet into a bitmap and only let git-describe do this? -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html