Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode

2013-07-14 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 To cat-file we could add an option like --sha1-only or --literal or
 --no-dwim (... better names are failing me) which would skip *all*
 dwimming of 40-character strings.  It would also assume that any shorter
 strings are abbreviated SHA-1s and fail if they are not.  This would be
 a nice feature by itself (these are object names, dammit, and don't try
 to tell me differently!) and would have the additional small advantage
 of speeding up lookups of abbreviated SHA-1s, which (regardless of your
 patch) otherwise go through the whole DWIM process.

 I can see in theory that somebody might want that, but I am having a
 hard time thinking of a practical use.

Would it be a good alternative to call get_sha1_hex() to catch the
most common case (reading from rev-list output, for example) and
then let the more general get_sha1() to let extended SHA-1 to be
handled?

 IOW, it seems like a poor default, and we are choosing it only because
 of backwards compatibility. I guess another option is to switch the
 default with the usual deprecation dance.

I agree that did you mean the unreadable refname or 40-hex object?
turned on everywhere get_sha1() is called is a very poor default.  I
wonder if we can limit it only to the end-user input somehow at the
API level.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode

2013-07-14 Thread Jeff King
On Sun, Jul 14, 2013 at 08:45:37PM -0700, Junio C Hamano wrote:

  To cat-file we could add an option like --sha1-only or --literal or
  --no-dwim (... better names are failing me) which would skip *all*
  dwimming of 40-character strings.  It would also assume that any shorter
  strings are abbreviated SHA-1s and fail if they are not.  This would be
  a nice feature by itself (these are object names, dammit, and don't try
  to tell me differently!) and would have the additional small advantage
  of speeding up lookups of abbreviated SHA-1s, which (regardless of your
  patch) otherwise go through the whole DWIM process.
 
  I can see in theory that somebody might want that, but I am having a
  hard time thinking of a practical use.
 
 Would it be a good alternative to call get_sha1_hex() to catch the
 most common case (reading from rev-list output, for example) and
 then let the more general get_sha1() to let extended SHA-1 to be
 handled?

For a 40-byte sha1, that _should_ be what get_sha1 does (i.e., go more
or less directly to the 40-hex code path, and return). And that's
basically what happens now, except that after we do so, we now have the
extra oh, is it also a refname? check.

For a shortened sha1, I don't think it would have the same behavior.
Right now, I believe the order is to treat a short sha1 as a possible
refname, and only if that fails consider it as a short sha1.

  IOW, it seems like a poor default, and we are choosing it only because
  of backwards compatibility. I guess another option is to switch the
  default with the usual deprecation dance.
 
 I agree that did you mean the unreadable refname or 40-hex object?
 turned on everywhere get_sha1() is called is a very poor default.  I
 wonder if we can limit it only to the end-user input somehow at the
 API level.

It is easy to do on top of my patch (just flip the default on the switch
I introduced, and turn it back on in whichever code paths are
appropriate).  But the question is: what is end-user input? Do you mean
command-line arguments to rev-list and friends?

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode

2013-07-14 Thread Jeff King
On Fri, Jul 12, 2013 at 12:30:07PM +0200, Michael Haggerty wrote:

 But with particular respect to git cat-file, I see problems:
 
 1. get_ref_snapshot() would have to read all loose and packed refs
 within the specified subtree, because loose refs have to be read before
 packed refs.  So the call would be expensive if there are a lot of loose
 refs.  And DWIM wouldn't know in advance where the references might be,
 so it would have to set prefix=.  If many refs are looked up, then it
 would presumably be worth it.  But if only a couple of lookups are done
 and there are a lot of loose refs, then using a cache would probably
 slow things down.
 
 The slowdown could be ameliorated by adding some more intelligence, for
 example only populating the loose refs cache after a certain number of
 lookups have already been done.

I had assumed we would load the loose ref cache progressively by
directory. Of course that only helps avoiding refs/foo/* when you are
interested in refs/heads/foo. If your refs/heads/* is very large,
you have to load all of it, and at some point it is cheaper to do a few
stat() calls than to actually readdir() the whole directory. On the
other hand, that is basically how packed-refs work now (if we hit any
ref, we have to load the whole file), and repos with many refs would
typically pack them all anyway.

 2. A git cat-file --batch process can be long-lived.  What guarantees
 would users expect regarding its lookup results?

I hadn't really intended this for general lookups, but just to speed up
the refname warning, where being out-of-date is more acceptable (since
the warning is a purely informational hint).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode

2013-07-12 Thread Michael Haggerty
On 07/12/2013 08:20 AM, Jeff King wrote:
 A common use of cat-file --batch-check is to feed a list
 of objects from rev-list --objects or a similar command.
 In this instance, all of our input objects are 40-byte sha1
 ids. However, cat-file has always allowed arbitrary revision
 specifiers, and feeds the result to get_sha1().
 
 Fortunately, get_sha1() recognizes a 40-byte sha1 before
 doing any hard work trying to look up refs, meaning this
 scenario should end up spending very little time converting
 the input into an object sha1. However, since 798c35f
 (get_sha1: warn about full or short object names that look
 like refs, 2013-05-29), when we encounter this case, we
 spend the extra effort to do a refname lookup anyway, just
 to print a warning. This is further exacerbated by ca91993
 (get_packed_ref_cache: reload packed-refs file when it
 changes, 2013-06-20), which makes individual ref lookup more
 expensive by requiring a stat() of the packed-refs file for
 each missing ref.
 
 With no patches, this is the time it takes to run:
 
   $ git rev-list --objects --all objects
   $ time git cat-file --batch-check='%(objectname)' objects
 
 on the linux.git repository:
 
   real1m13.494s
   user0m25.924s
   sys 0m47.532s
 
 If we revert ca91993, the packed-refs up-to-date check, it
 gets a little better:
 
   real0m54.697s
   user0m21.692s
   sys 0m32.916s
 
 but we are still spending quite a bit of time on ref lookup
 (and we would not want to revert that patch, anyway, which
 has correctness issues).  If we revert 798c35f, disabling
 the warning entirely, we get a much more reasonable time:
 
   real0m7.452s
   user0m6.836s
   sys 0m0.608s
 
 This patch does the moral equivalent of this final case (and
 gets similar speedups). We introduce a global flag that
 callers of get_sha1() can use to avoid paying the price for
 the warning.
 
 Signed-off-by: Jeff King p...@peff.net
 ---
 The solution feels a little hacky, but I'm not sure there is a better
 one short of reverting the warning entirely.
 
 We could also tie it to warn_ambiguous_refs (or add another config
 variable), but I don't think that makes sense. It is not about do I
 care about ambiguities in this repository, but rather I am going to
 do a really large number of sha1 resolutions, and I do not want to pay
 the price in this particular code path for the warning).
 
 There may be other sites that resolve a large number of refs and run
 into this, but I couldn't think of any. Doing for_each_ref would not
 have the same problem, as we already know they are refs there.

ISTM that callers of --batch-check would usually know whether they are
feeding it SHA-1s or arbitrary object specifiers.  (And in most cases
when there are a large number of them, they are probably all SHA-1s).

So instead of framing this change as skip object refname ambiguity
check (i.e., sacrifice some correctness for performance), perhaps it
would be less hacky to offer the following new feature:

To cat-file we could add an option like --sha1-only or --literal or
--no-dwim (... better names are failing me) which would skip *all*
dwimming of 40-character strings.  It would also assume that any shorter
strings are abbreviated SHA-1s and fail if they are not.  This would be
a nice feature by itself (these are object names, dammit, and don't try
to tell me differently!) and would have the additional small advantage
of speeding up lookups of abbreviated SHA-1s, which (regardless of your
patch) otherwise go through the whole DWIM process.

I understand that such an option alone would leave some old scripts
suffering the performance regression caused by the check, but in most
cases it would be easy to fix them.

Michael

-- 
Michael Haggerty
mhag...@alum.mit.edu
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode

2013-07-12 Thread Jeff King
On Fri, Jul 12, 2013 at 10:47:46AM +0200, Michael Haggerty wrote:

  The solution feels a little hacky, but I'm not sure there is a better
  one short of reverting the warning entirely.
  
  We could also tie it to warn_ambiguous_refs (or add another config
  variable), but I don't think that makes sense. It is not about do I
  care about ambiguities in this repository, but rather I am going to
  do a really large number of sha1 resolutions, and I do not want to pay
  the price in this particular code path for the warning).
  
  There may be other sites that resolve a large number of refs and run
  into this, but I couldn't think of any. Doing for_each_ref would not
  have the same problem, as we already know they are refs there.
 
 ISTM that callers of --batch-check would usually know whether they are
 feeding it SHA-1s or arbitrary object specifiers.  (And in most cases
 when there are a large number of them, they are probably all SHA-1s).

Thanks for bringing this up; I had meant to outline this option in the
cover letter, but forgot when posting.

If were designing cat-file from scratch, I'd consider having --batch
take just 40-hex sha1s in the first place. Since we can't do that now
for backwards compatibility, having an option is the best we can do
there.

But having to use such an option to avoid a 10x slowdown just strikes me
as ugly for two reasons:

  1. It's part of the user-visible interface, and it seems like an extra
 wtf? moment when somebody wonders why their script is painfully
 slow, and why they need a magic option to fix it. We're cluttering
 the interface forever.

  2. It's a sign that the refname check was put in for a specific
 performance profile (resolving one or a few refs), but didn't
 consider resolving a large number. I'm wondering what other
 performance regressions it's possible to trigger.

 So instead of framing this change as skip object refname ambiguity
 check (i.e., sacrifice some correctness for performance), perhaps it
 would be less hacky to offer the following new feature:

I wouldn't frame it as correctness for performance at all. The check
does not impact behavior at all, and is purely informational (to catch
quite a rare situation, too). I'd argue that the check simply isn't that
interesting in this case in the first place.

 To cat-file we could add an option like --sha1-only or --literal or
 --no-dwim (... better names are failing me) which would skip *all*
 dwimming of 40-character strings.  It would also assume that any shorter
 strings are abbreviated SHA-1s and fail if they are not.  This would be
 a nice feature by itself (these are object names, dammit, and don't try
 to tell me differently!) and would have the additional small advantage
 of speeding up lookups of abbreviated SHA-1s, which (regardless of your
 patch) otherwise go through the whole DWIM process.

I can see in theory that somebody might want that, but I am having a
hard time thinking of a practical use.

 I understand that such an option alone would leave some old scripts
 suffering the performance regression caused by the check, but in most
 cases it would be easy to fix them.

I'm less worried about old scripts, and more worried about carrying
around a confusing option forever, especially when I expect most
invocations to use it (perhaps my experience is biased, but I use
cat-file --batch quite a lot in scripting, and it has almost always been
on the output of rev-list).

IOW, it seems like a poor default, and we are choosing it only because
of backwards compatibility. I guess another option is to switch the
default with the usual deprecation dance.

Yet another option is to consider what the check is doing, and
accomplish the same thing in a different way. The real pain is that we
are individually trying to resolve each object by hitting the filesystem
(and doing lots of silly checks on the refname format, when we know it
must be valid).

We don't actually care in this case if the ref list is up to date (we
are not trying to update or read a ref, but only know if it exists, and
raciness is OK). IOW, could we replace the dwim_ref call for the warning
with something that directly queries the ref cache?

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode

2013-07-12 Thread Michael Haggerty
On 07/12/2013 11:22 AM, Jeff King wrote:
 Yet another option is to consider what the check is doing, and
 accomplish the same thing in a different way. The real pain is that we
 are individually trying to resolve each object by hitting the filesystem
 (and doing lots of silly checks on the refname format, when we know it
 must be valid).
 
 We don't actually care in this case if the ref list is up to date (we
 are not trying to update or read a ref, but only know if it exists, and
 raciness is OK). IOW, could we replace the dwim_ref call for the warning
 with something that directly queries the ref cache?

I think it would be quite practical to add an API something like

struct ref_snapshot *get_ref_snapshot(const char *prefix)
void release_ref_snapshot(struct ref_snapshot *)
int lookup_ref(struct ref_snapshot *, const char *refname,
   unsigned char *sha1, int *flags)

where prefix is the part of the refs tree that you want included in the
snapshot (e.g., refs/heads) and ref_snapshot is probably opaque
outside of the refs module.

Symbolic refs, which are currently not stored in the ref_cache, would
have to be added because otherwise we would have to do all of the
lookups anyway.

I think this would be a good step to take for many reasons, including
because it would be another useful step in the direction of ref
transactions.

But with particular respect to git cat-file, I see problems:

1. get_ref_snapshot() would have to read all loose and packed refs
within the specified subtree, because loose refs have to be read before
packed refs.  So the call would be expensive if there are a lot of loose
refs.  And DWIM wouldn't know in advance where the references might be,
so it would have to set prefix=.  If many refs are looked up, then it
would presumably be worth it.  But if only a couple of lookups are done
and there are a lot of loose refs, then using a cache would probably
slow things down.

The slowdown could be ameliorated by adding some more intelligence, for
example only populating the loose refs cache after a certain number of
lookups have already been done.

2. A git cat-file --batch process can be long-lived.  What guarantees
would users expect regarding its lookup results?  Currently, its ref
lookups reflect the state of the repo at the moment the commit
identifier is written into the pipe.  Using a cache like this would mean
that ref lookups would always reflect the snapshot taken at the start of
the git cat-file run, regardless of whether the script using it might
have added or modified some references since then.  I think this would
have to be considered a regression.

Michael

-- 
Michael Haggerty
mhag...@alum.mit.edu
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html