Re: [PATCH 5/5] sha1_file: don't re-scan pack directory for null sha1

2017-11-24 Thread Junio C Hamano
Jeff King  writes:

> So I dunno. That is far from exhaustive, but it does seem like most
> cases should assume the presence of the file.

You are right.  I should have considered that "we got a random
object-name looking thing and we do not care if it does not exist"
is a very narrow minority case.  Most of the object names we deal
with come from local repository traversal and unless things like the
"fsck-promisor" comes into the picture, we should always have them
available locally.

> But it may not be that big a deal. For the most part, all bets are off
> in a corrupt repo. My main concern is just that we do not want the
> corruption to spread or to make it harder for us to recover from (and
> that includes allowing us to delete or overwrite other data that would
> otherwise be forbidden, which is what's happening in the fetch case).

Absolutely.

Thanks.


Re: [PATCH 5/5] sha1_file: don't re-scan pack directory for null sha1

2017-11-24 Thread Jeff King
On Thu, Nov 23, 2017 at 11:35:21AM +0900, Junio C Hamano wrote:

> Not limiting us to the caller in the "fetch" codepath, I think the
> expectation by callers of lookup_commit_reference_gently() in the
> ideal world would be:
> 
>  - It has an object name, and wants to use it as point in the commit
>DAG to define the traversal over the DAG, if it refers to a
>commit known to us.
> 
>  - It does not know if these object names represent a tag object, a
>commit object, or some other object.  It does not know if the
>local repository actually has them (e.g. we received a "have"
>from the other side---missing is expected).
> 
>  - Hence, it would happily accept a NULL as "we do not have it" and
>"we do have it, but it is not a commit-ish".
> 
> And from that point of view, 2, 3a (missing), and 4 (0{40}) to yield
> NULL is perfectly fine.  3b (exists but broken) may be a noteworthy
> event, but for the purpose of the caller, it may want to proceed as
> if the object is missing from our end, so it might deserve warning()
> but not die(), at least as the default behaviour.

Hmm, yeah, I did not differentiate 3a and 3b in my analysis. How we'd
want to handle "missing" would vary from caller to caller, I think.
Certainly for this case in "fetch" where it's the "old" value for a ref,
it would be a corruption not to have it. Just grepping around a few of
the other callers, I see:

  - archive.c:parse_treeish_arg: fine not to have it (we'd complain soon
after that it doesn't point to a tree either). But also fine to
complain hard.

  - blame.c:dwim_reverse_initial, and checkout_paths and switch_branches
in checkout.c: missing object would mean a broken HEAD

  - show-branch.c:append_ref: missing would mean a broken ref

  - clear_commit_marks_for_object_array: conceptually OK to have a
missing object, though I suspect practically impossible since we
examined and marked the objects earlier

  - ref-filter's --merged, --contains, etc: the low-level code quietly
ignores a missing object or non-commit, but the command-line parser
enforces that we find a commit. So probably impossible to trigger,
but arguably the second call should be a BUG().

So I dunno. That is far from exhaustive, but it does seem like most
cases should assume the presence of the file.

As for your proposed behavior:

> And from that point of view, 2, 3a (missing), and 4 (0{40}) to yield
> NULL is perfectly fine.  3b (exists but broken) may be a noteworthy
> event, but for the purpose of the caller, it may want to proceed as
> if the object is missing from our end, so it might deserve warning()
> but not die(), at least as the default behaviour.

That's actually not far from what happens now, with the only difference
being that we _do_ actually die() on a corruption (really any error
except ENOENT). I forgot that when I wrote my earlier message. You can
see it by updating the "fetch" reproduction I sent earlier to corrupt:

-- >8 --
git init parent
git -C parent commit --allow-empty -m base

git clone parent child
git -C parent commit --allow-empty -m more

cd child
for i in .git/objects/??/*
do
chmod +w $i
echo corrupt >$i
done
git fetch
-- 8< --

which gives something like:

error: inflate: data stream error (incorrect header check)
error: unable to unpack 55c66a401fd4190382f9cd8b70c11f9f5adb044e header
fatal: loose object 55c66a401fd4190382f9cd8b70c11f9f5adb044e (stored in 
.git/objects/55/c66a401fd4190382f9cd8b70c11f9f5adb044e) is corrupt

So that's good. I do still think that many of the callers of
lookup_commit_reference_gently() probably ought to die() in the
"missing" case rather than continue (and produce subtly wrong answers).
But it may not be that big a deal. For the most part, all bets are off
in a corrupt repo. My main concern is just that we do not want the
corruption to spread or to make it harder for us to recover from (and
that includes allowing us to delete or overwrite other data that would
otherwise be forbidden, which is what's happening in the fetch case).
Most of the callers probably don't fall into that situation, but it
might be nice to err on the side of caution.

-Peff


Re: [PATCH 5/5] sha1_file: don't re-scan pack directory for null sha1

2017-11-22 Thread Junio C Hamano
Jeff King  writes:

> So there are four cases we care about for this call in fetch:
>
>   1. We fed a real sha1 and got a commit (or peeled to one).
>
>   2. We fed a real sha1 which resolved to a non-commit, and we got NULL.
>
>   3. We fed a real sha1 and the object was missing or corrupted, and we
>  got NULL.
>
>   4. We fed a null sha1 and got NULL.
>
> Right now we lump cases 2-4 together as "do not do a fast-forward
> check". That's fine for 2 and 4, but probably not for 3. We can easily
> catch case 4 ourselves (if we care to), but distinguishing case 3 from
> the others is hard. How should lookup_commit_reference_gently() signal
> it to us?

Not limiting us to the caller in the "fetch" codepath, I think the
expectation by callers of lookup_commit_reference_gently() in the
ideal world would be:

 - It has an object name, and wants to use it as point in the commit
   DAG to define the traversal over the DAG, if it refers to a
   commit known to us.

 - It does not know if these object names represent a tag object, a
   commit object, or some other object.  It does not know if the
   local repository actually has them (e.g. we received a "have"
   from the other side---missing is expected).

 - Hence, it would happily accept a NULL as "we do not have it" and
   "we do have it, but it is not a commit-ish".

And from that point of view, 2, 3a (missing), and 4 (0{40}) to yield
NULL is perfectly fine.  3b (exists but broken) may be a noteworthy
event, but for the purpose of the caller, it may want to proceed as
if the object is missing from our end, so it might deserve warning()
but not die(), at least as the default behaviour.


Re: [PATCH 5/5] sha1_file: don't re-scan pack directory for null sha1

2017-11-22 Thread Jeff King
On Wed, Nov 22, 2017 at 10:42:30AM +0900, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > I'm not sure what the right behavior is, but I'm pretty sure that's not
> > it. Probably one of:
> >
> >   - skip updating the ref when we see the breakage
> >
> >   - ditto, but terminate the whole operation, since we might be deleting
> > other refs and in a broken repo we're probably best to make as few
> > changes as possible
> >
> >   - behave as if it was a non-ff, which would allow "--force" to
> > overwrite the broken ref. Maybe convenient for fixing things, but
> > possibly surprising (and it's not that hard to just delete the
> > broken refs manually before proceeding).
> 
> Perhaps the last one would be the ideal endgame, but the second one
> may be a good stopping point in the shorter term.

This turns out to be a lot trickier than I expected. The crux of the
matter is that the case we care about is hidden inside
lookup_commit_reference_gently(), which doesn't distinguish between
corruption and "not a commit".

So there are four cases we care about for this call in fetch:

  1. We fed a real sha1 and got a commit (or peeled to one).

  2. We fed a real sha1 which resolved to a non-commit, and we got NULL.

  3. We fed a real sha1 and the object was missing or corrupted, and we
 got NULL.

  4. We fed a null sha1 and got NULL.

Right now we lump cases 2-4 together as "do not do a fast-forward
check". That's fine for 2 and 4, but probably not for 3. We can easily
catch case 4 ourselves (if we care to), but distinguishing case 3 from
the others is hard. How should lookup_commit_reference_gently() signal
it to us?

Or should lookup_commit_reference_gently() die on corruption? That's not
very "gentle", but I think the "gently" here is really about "it might
not be a commit", not "the repo might be corrupted". But I think even
that may be the tip of the iceberg. The next thing we do is feed the
commits to in_merge_bases(), which will happily return "nope" if the old
commit cannot be parsed (because it has only a boolean return value).

So I dunno. Maybe it is a losing battle to try to pass this kind of
corruption information up the stack.  I'm tempted to say that there
should just be a "paranoid" flag to globally die() whenever we see a
corruption (and you could run with it normally, but relax it whenever
you're investigating a broken repo). But I doubt even that works. Not
having the "old_oid" object at all would be a repo corruption here, but
how are the low-level routines supposed to know when a missing object is
a corruption and when it is not?

-Peff


Re: [PATCH 5/5] sha1_file: don't re-scan pack directory for null sha1

2017-11-21 Thread Jeff King
On Wed, Nov 22, 2017 at 10:49:25AM +0900, Junio C Hamano wrote:

> WRT existing codepaths that pass 0{40} and refuses to notice a
> potential repository corruption (from getting a NULL for a non null
> object name), I think we would need a sweep of the codebase and fix
> them in the longer term.  As long as that will happen someday, either
> approach between "we know 'no loose object? let's redo the packs' is
> the part that matters performance-wise, so let's do a short-cut only
> for that" and "we know that callers that comes with 0{40} want to get
> NULL back" should be OK, I would think.

I agree. Let's go with the "v2 5/5" I posted then.

I'll try to work up a patch for the fetch.c case I found tomorrow, but I
suspect there are many more. But that's largely orthogonal to the
series.

-Peff


Re: [PATCH 5/5] sha1_file: don't re-scan pack directory for null sha1

2017-11-21 Thread Junio C Hamano
Jeff King  writes:

> Here's a re-roll of patch 5 that behaves this way (the patch should be
> unsurprising, but I've updated the commit message to match).
>
> I did notice one other thing: the function looks up replace objects, so
> we have both the original and the replaced sha1 available. My earlier
> patch used the original sha1, but this one uses the replaced value.
> I'm not sure if that's sane or not. It lets the fast-path kick in if you
> point a replace ref at 0{40}. And it lets you point refs/replace/0{40}
> to something else. I doubt that's useful, but it could perhaps be for
> debugging, etc.
>
> In most cases, of course, it wouldn't matter (since pointing to or from
> the null sha1 is vaguely crazy in the first place).

I tend to agree that those who go crazy/fancy with replace mechanism
can keep both halves when it breaks.

WRT existing codepaths that pass 0{40} and refuses to notice a
potential repository corruption (from getting a NULL for a non null
object name), I think we would need a sweep of the codebase and fix
them in the longer term.  As long as that will happen someday, either
approach between "we know 'no loose object? let's redo the packs' is
the part that matters performance-wise, so let's do a short-cut only
for that" and "we know that callers that comes with 0{40} want to get
NULL back" should be OK, I would think.





Re: [PATCH 5/5] sha1_file: don't re-scan pack directory for null sha1

2017-11-21 Thread Junio C Hamano
Jeff King  writes:

> I'm not sure what the right behavior is, but I'm pretty sure that's not
> it. Probably one of:
>
>   - skip updating the ref when we see the breakage
>
>   - ditto, but terminate the whole operation, since we might be deleting
> other refs and in a broken repo we're probably best to make as few
> changes as possible
>
>   - behave as if it was a non-ff, which would allow "--force" to
> overwrite the broken ref. Maybe convenient for fixing things, but
> possibly surprising (and it's not that hard to just delete the
> broken refs manually before proceeding).

Perhaps the last one would be the ideal endgame, but the second one
may be a good stopping point in the shorter term.



Re: [PATCH 5/5] sha1_file: don't re-scan pack directory for null sha1

2017-11-21 Thread Jeff King
On Tue, Nov 21, 2017 at 02:20:19PM +0900, Junio C Hamano wrote:

> After queuing this series to an earlier part of 'pu' and resolving a
> conflict with jh/fsck-promisors topic, I ended up with a code that
> rejects 0{40} a lot earlier, before we try to see if a pack entry
> for 0{40} exists, even though the patch that is queued on this topic
> is more conservative (i.e. the above one).
> 
> Perhaps we would want to use the alternate version that declares the
> 0{40} is a sentinel that signals that there is no such object in
> this topic---that would give us a consistent semantics without
> having to adjust jh/fsck-promisors when it becomes ready to be
> merged.

I think we could adjust the patches without too much effort. That said,
I am very on-the-fence about whether to just declare the null sha1 as
"not a real object" and automatically return from the lookup without
doing any work. But if you agree that it's not likely to hurt anything,
it's IMHO a lot easier to reason about.

Here's a re-roll of patch 5 that behaves this way (the patch should be
unsurprising, but I've updated the commit message to match).

I did notice one other thing: the function looks up replace objects, so
we have both the original and the replaced sha1 available. My earlier
patch used the original sha1, but this one uses the replaced value.
I'm not sure if that's sane or not. It lets the fast-path kick in if you
point a replace ref at 0{40}. And it lets you point refs/replace/0{40}
to something else. I doubt that's useful, but it could perhaps be for
debugging, etc.

In most cases, of course, it wouldn't matter (since pointing to or from
the null sha1 is vaguely crazy in the first place).

-- >8 --
Subject: [PATCH v2 5/5] sha1_file: fast-path null sha1 as a missing object

In theory nobody should ever ask the low-level object code
for a null sha1. It's used as a sentinel for "no such
object" in lots of places, so leaking through to this level
is a sign that the higher-level code is not being careful
about its error-checking.  In practice, though, quite a few
code paths seem to rely on the null sha1 lookup failing as a
way to quietly propagate non-existence (e.g., by feeding it
to lookup_commit_reference_gently(), which then returns
NULL).

When this happens, we do two inefficient things:

  1. We actually search for the null sha1 in packs and in
 the loose object directory.

  2. When we fail to find it, we re-scan the pack directory
 in case a simultaneous repack happened to move it from
 loose to packed. This can be very expensive if you have
 a large number of packs.

Only the second one actually causes noticeable performance
problems, so we could treat them independently. But for the
sake of simplicity (both of code and of reasoning about it),
it makes sense to just declare that the null sha1 cannot be
a real on-disk object, and looking it up will always return
"no such object".

There's no real loss of functionality to do so Its use as a
sentinel value means that anybody who is unlucky enough to
hit the 2^-160th chance of generating an object with that
sha1 is already going to find the object largely unusable.

In an ideal world, we'd simply fix all of the callers to
notice the null sha1 and avoid passing it to us. But a
simple experiment to catch this with a BUG() shows that
there are a large number of code paths that do so.

So in the meantime, let's fix the performance problem by
taking a fast exit from the object lookup when we see a null
sha1. p5551 shows off the improvement (when a fetched ref is
new, the "old" sha1 is 0{40}, which ends up being passed for
fast-forward checks, the status table abbreviations, etc):

  TestHEAD^ HEAD
  
  5551.4: fetch   5.51(5.03+0.48)   0.17(0.10+0.06) -96.9%

Signed-off-by: Jeff King 
---
 sha1_file.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/sha1_file.c b/sha1_file.c
index 8a7c6b7eba..ae4b71f445 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1152,6 +1152,9 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
lookup_replace_object(sha1) :
sha1;
 
+   if (is_null_sha1(real))
+   return -1;
+
if (!oi)
oi = _oi;
 
-- 
2.15.0.578.g35b419775f



Re: [PATCH 5/5] sha1_file: don't re-scan pack directory for null sha1

2017-11-21 Thread Jeff King
On Tue, Nov 21, 2017 at 11:37:28AM +0900, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > In an ideal world, we'd simply fix all of the callers to
> > notice the null sha1 and avoid passing it to us. But a
> > simple experiment to catch this with a BUG() shows that
> > there are a large number of code paths.
> 
> Well, we can view this (or the alternative you sent later that does
> the same a bit earlier in the function) as "fixing the caller" but
> has already refactord the common logic to a helper function that all
> of these callers call into ;-).

Yes, I'm definitely tempted by that view. :)

What worries me, though, is that callers who lazily propagate the null
sha1 to the lookup functions cannot reasonably tell the difference
between "this object was corrupt or missing" and "we passed the null
sha1, and a missing object is expected".

For example, look at how fetch.c:update_local_ref() looks up objects.
It feeds the old and new sha1 to lookup_commit_reference_gently(), and
if either is NULL, it skips the fast-forward check. That makes sense if
we expect the null sha1 to get translated into a NULL commit. But it
also triggers for a broken ref, and we'd overwrite it (when the right
thing is probably refusing to update).

Here's a runnable example:

-- >8 --
git init parent
git -C parent commit --allow-empty -m base

git clone parent child
git -C parent commit --allow-empty -m more

cd child
rm -f .git/objects/??/*
git fetch
-- 8< --

That final fetch spews a bunch of errors about broken refs, and then
overwrites the value of origin/master, even though it's broken (in this
case it actually is a fast-forward, but the child repo doesn't even know
that).

I'm not sure what the right behavior is, but I'm pretty sure that's not
it. Probably one of:

  - skip updating the ref when we see the breakage

  - ditto, but terminate the whole operation, since we might be deleting
other refs and in a broken repo we're probably best to make as few
changes as possible

  - behave as if it was a non-ff, which would allow "--force" to
overwrite the broken ref. Maybe convenient for fixing things, but
possibly surprising (and it's not that hard to just delete the
broken refs manually before proceeding).

-Peff


Re: [PATCH 5/5] sha1_file: don't re-scan pack directory for null sha1

2017-11-20 Thread Junio C Hamano
Jeff King  writes:

> This is the minimal fix that addresses the performance issues.
> I'd actually have no problem at all declaring that looking up a null
> sha1 is insane, and having the object-lookup routines simply return "no
> such object" without even doing the loose/pack lookup first.
>
> diff --git a/sha1_file.c b/sha1_file.c
> index 8a7c6b7eba..dde0ad101d 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1180,6 +1180,9 @@ int sha1_object_info_extended(const unsigned char 
> *sha1, struct object_info *oi,
>   if (!sha1_loose_object_info(real, oi, flags))
>   return 0;
>  
> + if (is_null_sha1(sha1))
> + return -1;
> +
>   /* Not a loose object; someone else may have just packed it. */
>   if (flags & OBJECT_INFO_QUICK) {
>   return -1;

After queuing this series to an earlier part of 'pu' and resolving a
conflict with jh/fsck-promisors topic, I ended up with a code that
rejects 0{40} a lot earlier, before we try to see if a pack entry
for 0{40} exists, even though the patch that is queued on this topic
is more conservative (i.e. the above one).

Perhaps we would want to use the alternate version that declares the
0{40} is a sentinel that signals that there is no such object in
this topic---that would give us a consistent semantics without
having to adjust jh/fsck-promisors when it becomes ready to be
merged.




Re: [PATCH 5/5] sha1_file: don't re-scan pack directory for null sha1

2017-11-20 Thread Junio C Hamano
Jeff King  writes:

> In an ideal world, we'd simply fix all of the callers to
> notice the null sha1 and avoid passing it to us. But a
> simple experiment to catch this with a BUG() shows that
> there are a large number of code paths.

Well, we can view this (or the alternative you sent later that does
the same a bit earlier in the function) as "fixing the caller" but
has already refactord the common logic to a helper function that all
of these callers call into ;-).



Re: [PATCH 5/5] sha1_file: don't re-scan pack directory for null sha1

2017-11-20 Thread Jeff King
On Mon, Nov 20, 2017 at 12:47:53PM -0800, Stefan Beller wrote:

> > This is the minimal fix that addresses the performance issues.
> > I'd actually have no problem at all declaring that looking up a null
> > sha1 is insane, and having the object-lookup routines simply return "no
> > such object" without even doing the loose/pack lookup first.
> >
> > I'm also fine with going down the BUG() route and fixing the
> > higher-level callers, but it's a big enough task (with little enough
> > real-world impact) that I think it would be worth applying this in the
> > meantime.
> 
> It would have a lot of impact in the future, when new developers
> are hindered mis-using the API. The (unwritten) series with introducing
> BUG() would help a lot in 'holding it right' and I would expect fewer
> performance
> regressions over time.

Yeah, I agree the BUG() route is the nicest. I started trying to hunt
down every code path causing a test failure, though, and soon got
overwhelmed.

I actually think it's not the worst thing in the world to say "it's OK
to look up the null sha1; the result is well-defined, and it does not
exist". It's perhaps philosophically a little ugly, but I think it's
quite practical.

> The patch is impressively small for such a performance gain.
> Personally I think (1) (which essentially means "making null sha1
> work like a regular sha1") is quite an endeavor at this point in time
> for this code base.

Yeah, it would be a crazy amount of effort to try to support that, for
very little gain. If we did want to just declare that the null sha1 does
not exist (even if you happen to have objects/00/00... in your
repository), the patch is similarly easy:

diff --git a/sha1_file.c b/sha1_file.c
index 8a7c6b7eba..754fe7f03e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1152,6 +1152,9 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
lookup_replace_object(sha1) :
sha1;
 
+   if (is_null_sha1(sha1))
+   return -1;
+
if (!oi)
oi = _oi;
 

> As a tangent, a similar but solved problem in the diff code is how
> NUL in user data is treated in xdiff for example, as there we kept
> being careful since the beginning (though I think we don't have tests
> for it, so it might be broken)

There I think it's nice if we can be tolerant of NULs, because they're a
thing that might actually happen in user input. A real null sha1 has a
2^-160 chance of happening, and the design of sha1 is such that it's
hard for somebody to do it intentionally.

You might be able to get up to some mischief with replace objects,
though.

-Peff


Re: [PATCH 5/5] sha1_file: don't re-scan pack directory for null sha1

2017-11-20 Thread Stefan Beller
On Mon, Nov 20, 2017 at 12:35 PM, Jeff King  wrote:
> In theory nobody should ever ask the low-level object code
> for a null sha1. It's used as a sentinel for "no such
> object" in lots of places, so leaking through to this level
> is a sign that the higher-level code is not being careful
> about its error-checking.  In practice, though, quite a few
> code paths seem to rely on the null sha1 lookup failing as a
> way to quietly propagate non-existence (e.g., by feeding it
> to lookup_commit_reference_gently(), which then returns
> NULL).
>
> When this happens, we do two inefficient things:
>
>   1. We actually search for the null sha1 in packs and in
>  the loose object directory.
>
>   2. When we fail to find it, we re-scan the pack directory
>  in case a simultaneous repack happened to move it from
>  loose to packed.
>
> It's debatable whether (1) is a good idea or not. The
> performance drop is at least linear in the number of
> lookups, so it's not too bad. And if by some 2^-160th chance
> somebody does have such an object, we probably ought to
> access it. On the other hand, enough code paths treat the
> null sha1 specially that the object probably isn't usable,
> anyway.
>
> Problem (2), on the other hand, is a pretty severe
> performance issue. If you have a large number of packs,
> rescanning the pack directory can be very expensive. And it
> only helps in the case that you have an object with the null
> sha1 _and_ somebody was simultaneously repacking it. The
> tradeoff is certainly a bad one there.
>
> In an ideal world, we'd simply fix all of the callers to
> notice the null sha1 and avoid passing it to us. But a
> simple experiment to catch this with a BUG() shows that
> there are a large number of code paths.
>
> In the meantime, let's address problem (2). That's the
> minimal fix we can do to make the performance problems go
> away. p5551 shows this off (when a fetched ref is new, the
> "old" sha1 is 0{40}, which ends up being passed for
> fast-forward checks, the status table abbreviations, etc):
>
>   TestHEAD^ HEAD
>   
>   5551.4: fetch   5.51(5.03+0.48)   0.17(0.10+0.06) -96.9%
>
> We could address (1), too, but there's not much performance
> improvement left to make.
>
> Signed-off-by: Jeff King 
> ---
> This is the minimal fix that addresses the performance issues.
> I'd actually have no problem at all declaring that looking up a null
> sha1 is insane, and having the object-lookup routines simply return "no
> such object" without even doing the loose/pack lookup first.
>
> I'm also fine with going down the BUG() route and fixing the
> higher-level callers, but it's a big enough task (with little enough
> real-world impact) that I think it would be worth applying this in the
> meantime.

It would have a lot of impact in the future, when new developers
are hindered mis-using the API. The (unwritten) series with introducing
BUG() would help a lot in 'holding it right' and I would expect fewer
performance
regressions over time.

The patch is impressively small for such a performance gain.
Personally I think (1) (which essentially means "making null sha1
work like a regular sha1") is quite an endeavor at this point in time
for this code base.

As a tangent, a similar but solved problem in the diff code is how
NUL in user data is treated in xdiff for example, as there we kept
being careful since the beginning (though I think we don't have tests
for it, so it might be broken)

Stefan


[PATCH 5/5] sha1_file: don't re-scan pack directory for null sha1

2017-11-20 Thread Jeff King
In theory nobody should ever ask the low-level object code
for a null sha1. It's used as a sentinel for "no such
object" in lots of places, so leaking through to this level
is a sign that the higher-level code is not being careful
about its error-checking.  In practice, though, quite a few
code paths seem to rely on the null sha1 lookup failing as a
way to quietly propagate non-existence (e.g., by feeding it
to lookup_commit_reference_gently(), which then returns
NULL).

When this happens, we do two inefficient things:

  1. We actually search for the null sha1 in packs and in
 the loose object directory.

  2. When we fail to find it, we re-scan the pack directory
 in case a simultaneous repack happened to move it from
 loose to packed.

It's debatable whether (1) is a good idea or not. The
performance drop is at least linear in the number of
lookups, so it's not too bad. And if by some 2^-160th chance
somebody does have such an object, we probably ought to
access it. On the other hand, enough code paths treat the
null sha1 specially that the object probably isn't usable,
anyway.

Problem (2), on the other hand, is a pretty severe
performance issue. If you have a large number of packs,
rescanning the pack directory can be very expensive. And it
only helps in the case that you have an object with the null
sha1 _and_ somebody was simultaneously repacking it. The
tradeoff is certainly a bad one there.

In an ideal world, we'd simply fix all of the callers to
notice the null sha1 and avoid passing it to us. But a
simple experiment to catch this with a BUG() shows that
there are a large number of code paths.

In the meantime, let's address problem (2). That's the
minimal fix we can do to make the performance problems go
away. p5551 shows this off (when a fetched ref is new, the
"old" sha1 is 0{40}, which ends up being passed for
fast-forward checks, the status table abbreviations, etc):

  TestHEAD^ HEAD
  
  5551.4: fetch   5.51(5.03+0.48)   0.17(0.10+0.06) -96.9%

We could address (1), too, but there's not much performance
improvement left to make.

Signed-off-by: Jeff King 
---
This is the minimal fix that addresses the performance issues.
I'd actually have no problem at all declaring that looking up a null
sha1 is insane, and having the object-lookup routines simply return "no
such object" without even doing the loose/pack lookup first.

I'm also fine with going down the BUG() route and fixing the
higher-level callers, but it's a big enough task (with little enough
real-world impact) that I think it would be worth applying this in the
meantime.

 sha1_file.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/sha1_file.c b/sha1_file.c
index 8a7c6b7eba..dde0ad101d 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1180,6 +1180,9 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
if (!sha1_loose_object_info(real, oi, flags))
return 0;
 
+   if (is_null_sha1(sha1))
+   return -1;
+
/* Not a loose object; someone else may have just packed it. */
if (flags & OBJECT_INFO_QUICK) {
return -1;
-- 
2.15.0.494.g79a8547723