Re: [WIP PATCH] Manual rename correction

2012-08-02 Thread Nguyen Thai Ngoc Duy
On Thu, Aug 2, 2012 at 4:27 AM, Jeff King p...@peff.net wrote:
 Yes, if you go with a commit-based approach, you can do either notes or
 in-commit messages. In other words, I would break the solutions down as:

   1. Store sha1+sha1 - score mapping (i.e., what I suggested). This is
  fundamentally a global store, not a per-commit store. For storage,
  you can do one (or a combination) of:

  a. Store the mapping in some local file. Fast, but can't be shared.

  b. Store the mapping in a note (probably indexed by the destination
 blob sha1). Less efficient, but easy to share.

 I implemented (1a). Implementing (1b) would be easy, but for a full-on
 cache (especially for -C), I think the resulting size might be
 prohibitive.

(1a) is good regardless rename overrides. Why don't you polish and
submit it? We can set some criteria to limit the cache size while
keeping computation reasonably low. Caching rename scores for file
pairs that has file size larger than a limit is one. Rename matrix
size could also be a candidate. We could even cache just rename scores
for recent commits (i.e. close to heads) only with the assumption that
people diff/apply recent commits more often.


 All solutions under (2) suffer from the same problem: they are accurate
 only for a single diff. For other diffs, you would either have to not
 use the feature, or you would be stuck traversing the history and
 assigning a temporary file identity (e.g., given commits A-B-C, and in
 A-B we rename foo to bar, the diff between A and C could discover
 that A's foo corresponds to C's bar).

Yeah. If we go with manual overrides, I expect users to deal with
these manually too. IOW they'll need to create a mapping for A-C
themselves. We can help detect that there are manual overrides in some
cases, like merge, and let users know that manual overrides are
ignored. For merge, I think we can just check for all commits while
traversing looking for bases.

 For this reason, I'm not sure that stored overrides like this are
 generally useful in the long run. I think storage is useful for
 _caching_ the results, because it doesn't have to be perfect; it just
 helps with some repetitive queries. Whereas for overriding, I think it
 is much more interesting to override _particular_ diff. E.g., to say I
 am merging X and Y, and please pretend that Y renamed foo to bar
 when you do rename detection.

 And in that sense, your git log example can be considered a
 special-case of this: you are saying that the diff from $commit to
 $commit^ is done frequently, so rather than saying please pretend...
 each time, you would like to store the information forever. And storing
 it in the commit message or a note is one way of doing that.

Yep, specifying rename overrides between two trees is probably better.

 I don't think there's anything fundamentally _wrong_ with that, but I
 kind of question its usefulness. In other words, what is the point in
 doing so? If it is inform the user that semantically the commit did a
 rename, even though the content changed enough that rename detection
 does not find it, then I would argue that you should simply state it in
 the commit message (or in a human-readable git-note, if it was only
 realized after the fact).

 But there is not much point in making it machine-readable, since the
 interesting machine-readable things we do with renames are:

   1. Show the diff against the rename src, which can often be easier to
  read. Except that if rename detection did not find it, it is
  probably _not_ going to be easier to read.

Probably. Still it helps git log --follow to follow the correct
track in the 1% case that rename detection does go wrong.


   2. Applying content to the destination of a merge. But you're almost
  never doing the diff between a commit and its parent, so the
  information would be useless.

Having a way to interfere rename detection, even manually, could be
good in this case if it reduces conflicts. We could feed rename
overrides using command line.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP PATCH] Manual rename correction

2012-08-02 Thread Jeff King
On Wed, Aug 01, 2012 at 03:10:55PM -0700, Junio C Hamano wrote:

 Jeff King p...@peff.net writes:
 
  On Tue, Jul 31, 2012 at 11:01:27PM -0700, Junio C Hamano wrote:
  ...
  As we still have the pathname in this codepath, I am wondering if we
  would benefit from custom content hash that knows the nature of
  payload than the built-in similarity estimator, driven by the
  attribute mechanism (if the latter is the case, that is).
 
  Hmm. Interesting. But I don't think that attributes are a good fit here.
  They are pathname based, so how do I apply anything related to
  similarity of a particular version by pathname? IOW, how does it apply
  in one tree but not another?
 
 When you move porn/0001.jpg in the preimage to naughty/1.jpg in
 the postimage, they both can hit *.jpg contentid=jpeg line in the
 top-level .gitattribute file, and the contentid driver for jpeg type
 may strip exif and hash the remainder bits in the image to come up
 with a token you can use in a similar way as object ID is used in
 the exact rename detection phase.
 
 Just thinking aloud.

Ah, I see. That still feels like way too specific a use case to me. A
much more general use case to me would be a contentid driver which
splits the file into multiple chunks (which can be concatenated to
arrive at the original content), and marks chunks as OK to delta or
not able to delta.  In other words, a content-specific version of the
bup-style splitting that people have proposed.

Assuming we split a jpeg into its EXIF bits (+delta) and its image bits
(-delta), then you could do a fast rename or pack-objects comparison
between two such files (in fact, with chunked object storage,
pack-objects can avoid looking at the image parts at all).

However, it may be the case that such smart splitting is not
necessary, as stupid and generic bup-style splitting may be enough. I
really need to start playing with the patches you wrote last year that
started in that direction.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP PATCH] Manual rename correction

2012-08-02 Thread Jeff King
On Thu, Aug 02, 2012 at 07:08:25PM +0700, Nguyen Thai Ngoc Duy wrote:

  I implemented (1a). Implementing (1b) would be easy, but for a full-on
  cache (especially for -C), I think the resulting size might be
  prohibitive.
 
 (1a) is good regardless rename overrides. Why don't you polish and
 submit it? We can set some criteria to limit the cache size while
 keeping computation reasonably low. Caching rename scores for file
 pairs that has file size larger than a limit is one. Rename matrix
 size could also be a candidate. We could even cache just rename scores
 for recent commits (i.e. close to heads) only with the assumption that
 people diff/apply recent commits more often.

I'll polish and share it. I'm still not 100% sure it's a good idea,
because introducing an on-disk cache means we need to _manage_ that
cache. How big will it be? Who will prune it when it gets too big? By
what criteria? And so on.

But if it's all hidden behind a config option, then it won't hurt people
who don't use it. And people who do use it can gather data on how the
caches grow.

  All solutions under (2) suffer from the same problem: they are accurate
  only for a single diff. For other diffs, you would either have to not
  use the feature, or you would be stuck traversing the history and
  assigning a temporary file identity (e.g., given commits A-B-C, and in
  A-B we rename foo to bar, the diff between A and C could discover
  that A's foo corresponds to C's bar).
 
 Yeah. If we go with manual overrides, I expect users to deal with
 these manually too. IOW they'll need to create a mapping for A-C
 themselves. We can help detect that there are manual overrides in some
 cases, like merge, and let users know that manual overrides are
 ignored. For merge, I think we can just check for all commits while
 traversing looking for bases.

Yeah, merges are a special case, in that we know the diff we perform
will always have a direct-ancestor relationship (since it is always
between a tip and the merge base).

  But there is not much point in making it machine-readable, since the
  interesting machine-readable things we do with renames are:
 
1. Show the diff against the rename src, which can often be easier to
   read. Except that if rename detection did not find it, it is
   probably _not_ going to be easier to read.
 
 Probably. Still it helps git log --follow to follow the correct
 track in the 1% case that rename detection does go wrong.

Thanks. I didn't think of --follow, but that is a good counterpoint to
my argument.

2. Applying content to the destination of a merge. But you're almost
   never doing the diff between a commit and its parent, so the
   information would be useless.
 
 Having a way to interfere rename detection, even manually, could be
 good in this case if it reduces conflicts. We could feed rename
 overrides using command line.

Yeah. I think I'd start with letting you feed pairs to diff_options,
give it a command-line option to see how useful it is, and then later on
consider a mechanism for extracting those pairs automatically from
commits or notes.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP PATCH] Manual rename correction

2012-08-02 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 On Wed, Aug 01, 2012 at 03:10:55PM -0700, Junio C Hamano wrote:
 ...
 When you move porn/0001.jpg in the preimage to naughty/1.jpg in
 the postimage, they both can hit *.jpg contentid=jpeg line in the
 top-level .gitattribute file, and the contentid driver for jpeg type
 may strip exif and hash the remainder bits in the image to come up
 with a token you can use in a similar way as object ID is used in
 the exact rename detection phase.
 
 Just thinking aloud.

 Ah, I see. That still feels like way too specific a use case to me. A
 much more general use case to me would be a contentid driver which
 splits the file into multiple chunks (which can be concatenated to
 arrive at the original content), and marks chunks as OK to delta or
 not able to delta.  In other words, a content-specific version of the
 bup-style splitting that people have proposed.

 Assuming we split a jpeg into its EXIF bits (+delta) and its image bits
 (-delta), then you could do a fast rename or pack-objects comparison
 between two such files (in fact, with chunked object storage,
 pack-objects can avoid looking at the image parts at all).

 However, it may be the case that such smart splitting is not
 necessary, as stupid and generic bup-style splitting may be enough. I
 really need to start playing with the patches you wrote last year that
 started in that direction.

I wasn't interested in packing split object representation,
actually.  The idea was still within the context of rename.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP PATCH] Manual rename correction

2012-08-02 Thread Jeff King
On Thu, Aug 02, 2012 at 03:51:17PM -0700, Junio C Hamano wrote:

  On Wed, Aug 01, 2012 at 03:10:55PM -0700, Junio C Hamano wrote:
  ...
  When you move porn/0001.jpg in the preimage to naughty/1.jpg in
  the postimage, they both can hit *.jpg contentid=jpeg line in the
  top-level .gitattribute file, and the contentid driver for jpeg type
  may strip exif and hash the remainder bits in the image to come up
  with a token you can use in a similar way as object ID is used in
  the exact rename detection phase.
  
  Just thinking aloud.
 
  Ah, I see. That still feels like way too specific a use case to me. A
  much more general use case to me would be a contentid driver which
  splits the file into multiple chunks (which can be concatenated to
  arrive at the original content), and marks chunks as OK to delta or
  not able to delta.  In other words, a content-specific version of the
  bup-style splitting that people have proposed.
 
  Assuming we split a jpeg into its EXIF bits (+delta) and its image bits
  (-delta), then you could do a fast rename or pack-objects comparison
  between two such files (in fact, with chunked object storage,
  pack-objects can avoid looking at the image parts at all).
 
  However, it may be the case that such smart splitting is not
  necessary, as stupid and generic bup-style splitting may be enough. I
  really need to start playing with the patches you wrote last year that
  started in that direction.
 
 I wasn't interested in packing split object representation,
 actually.  The idea was still within the context of rename.

But it would work for rename, too. If you want to compare two files, the
driver would give you back { sha1_exif (+delta), sha1_image (-delta) }
for each file. You know the size of each chunk and the size of the total
file.

Then you would just compare sha1_image for each entry. If they match,
then you have a lower bound on similarity of image_chunk_size /
total_size. If they don't, then you have an upper bound of similarity of
1-(image_chunk_size/total_size). In the former case, you can get the
exact similarity by doing a real delta on the sha1_exif content. In the
latter case, you can either exit early (if you are already below the
similarity threshold, which is likely), or possibly do the delta on the
sha1_exif content to get an exact value.

But either way, you never had to do a direct comparison between the big
image data; you only needed to know the sha1s. And as a bonus, if you
did want to cache results, you can have an O(# of blobs) cache of the
chunked sha1s of the chunked form (because the information is immutable
for a given sha1 and content driver). Whereas by caching the result of
estimate_similarity, our worst-case cache is the square of that (because
we are storing sha1 pairs).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP PATCH] Manual rename correction

2012-08-01 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 @@ -175,6 +177,11 @@ static int estimate_similarity(struct diff_filespec *src,
   if (max_size * (MAX_SCORE-minimum_score)  delta_size * MAX_SCORE)
   return 0;
  
 + hashcpy(pair.one, src-sha1);
 + hashcpy(pair.two, dst-sha1);
 + if (rename_cache_get(pair, score))
 + return score;
 +

Random thoughts.

Even though your rename cache could be used to reject pairing that
the similarity estimator would otherwise give high score, I would
imagine that in practice, people would always use the mechanism to
boost the similarity score of desired pairing.  This conjecture has
a few interesting implications.

 - As we track of only the top NUM_CANDIDATE_PER_DST rename src for
   each dst (see record_if_better()), you should be able to first
   see if pairs that have dst exist in your rename cache, and
   iterate over the src,dst pairs, filling m[] with srcs that
   appear in this particular invocation of diff.

 - If you find NUM_CANDIDATE_PER_DST srcs from your rename cache,
   you wouldn't have to run estimate_similarity() at all, but that
   is very unlikely.  We could however declare that user configured
   similarity boost always wins computed ones, and skip estimation
   for a dst for which you find an entry in the rename cache.

 - As entries in rename cache that record high scores have names of
   similar blobs, pack-objects may be able to take advantage of
   this information.

 - If you declare blobs A and B are similar, it is likely that blobs
   C, D, E, ... that are created by making a series of small tweaks
   to B are also similar.  Would it make more sense to introduce a
   concept of set of similar blobs instead of recording pairwise
   scores for (A,B), (A,C), (A,D), ... (B,C), (B,D), ...?  If so,
   the body of per-dst loop in diffcore_rename() may become:

if (we know where dst came from)
continue;
if (dst belongs to a known blob family) {
for (each src in rename_src[]) {
if (src belongs to the same blob family as dst)
record it in m[];
}
}
if (the above didn't record anything in m[]) {
... existing estimate_similarity() code ...
}

Regarding your rename-and-tweak-exif photo sets, is the issue that
there are too many rename src/dst candidates and filling a large
matrix takes a lot of time, or tweaking exif makes the contents
unnecessarily dissimilar and causes the similarity detection to
fail?  As we still have the pathname in this codepath, I am
wondering if we would benefit from custom content hash that knows
the nature of payload than the built-in similarity estimator, driven
by the attribute mechanism (if the latter is the case, that is).
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP PATCH] Manual rename correction

2012-08-01 Thread Junio C Hamano
Nguyen Thai Ngoc Duy pclo...@gmail.com writes:

 Yes. This is probably cosmetics only, but without path information, we
 leave it to chance to decide which A to pair with B and C (in the
 A-B, A-C case above). Wrong path might lead to funny effects (i'm
 thinking of git log --follow).

Isn't that why A,B and A,C can have different scores per object
name pair?  And if you mean by B and C the paths not object names,
and the blob at B and C are indeed identical, why would it matter?

 There's also the problem with transferring this information. With
 git-notes I think I can transfer it (though not automatically). How do
 we transfer sha1 map (that you mentioned in the commit generation mail
 in this thread)?

 I wasn't clear. This is about transferring info across repositories.

It is simple to define a portable format to record Jeff's rename
cache database (it could be a text blob full of x40 x40 score
lines), point such a blob with a special ref, and write a Porcelain
to transfer and merge across repositories, e.g.

git fetch $there refs/rename-cache
new=$(
(
git cat-file blob refs/rename-cache
git cat-file blob FETCH_HEAD
) | sort | uniq | git hash-object -w --stdin)
git update-ref refs/rename-cache $new

And as Jeff said, that is orthogonal to the issue of what is being
stored.  The contents of refs/rename-cache blob could be compiled
into a binary form for quicker access at runtime in the Porcelain.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP PATCH] Manual rename correction

2012-08-01 Thread Nguyen Thai Ngoc Duy
On Wed, Aug 1, 2012 at 1:09 PM, Junio C Hamano gits...@pobox.com wrote:
 Nguyen Thai Ngoc Duy pclo...@gmail.com writes:

 Yes. This is probably cosmetics only, but without path information, we
 leave it to chance to decide which A to pair with B and C (in the
 A-B, A-C case above). Wrong path might lead to funny effects (i'm
 thinking of git log --follow).

 Isn't that why A,B and A,C can have different scores per object
 name pair?  And if you mean by B and C the paths not object names,
 and the blob at B and C are indeed identical, why would it matter?

I don't see how scores affect that. Suppose I have two trees and a
rename-cache file:

$ git ls-tree -r HEAD^
100644 blob d00491path1/foo
100644 blob d00491path2/bar
$ git ls-tree -r HEAD
100644 blob 0cfbf0path1/fooo
100644 blob 00750epath2/barr
$ cat rename-cache
d00491 0cfbf0 score 1
d00491 00750e score 2

How can I be sure git diff HEAD^! will rename path1/foo =
path1/fooo and path2/bar = path2/barr, not path1/foo = path2/barr
and path2/bar = path1/fooo?
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP PATCH] Manual rename correction

2012-08-01 Thread Jeff King
On Wed, Aug 01, 2012 at 01:34:23PM +0700, Nguyen Thai Ngoc Duy wrote:

 On Wed, Aug 1, 2012 at 1:09 PM, Junio C Hamano gits...@pobox.com wrote:
  Nguyen Thai Ngoc Duy pclo...@gmail.com writes:
 
  Yes. This is probably cosmetics only, but without path information, we
  leave it to chance to decide which A to pair with B and C (in the
  A-B, A-C case above). Wrong path might lead to funny effects (i'm
  thinking of git log --follow).
 
  Isn't that why A,B and A,C can have different scores per object
  name pair?  And if you mean by B and C the paths not object names,
  and the blob at B and C are indeed identical, why would it matter?
 
 I don't see how scores affect that. Suppose I have two trees and a
 rename-cache file:
 
 $ git ls-tree -r HEAD^
 100644 blob d00491path1/foo
 100644 blob d00491path2/bar
 $ git ls-tree -r HEAD
 100644 blob 0cfbf0path1/fooo
 100644 blob 00750epath2/barr
 $ cat rename-cache
 d00491 0cfbf0 score 1
 d00491 00750e score 2
 
 How can I be sure git diff HEAD^! will rename path1/foo =
 path1/fooo and path2/bar = path2/barr, not path1/foo = path2/barr
 and path2/bar = path1/fooo?

You can't. A pathless mapping can never represent that. I think my (and
Junio's argument) is more like: why would you care? And that gets back
to the statement I just made elsewhere in the thread. Which is that you
need to consider the purpose of the rename detection. For generating
diffs and for merging, it doesn't matter in the above case. If the point
is to communicate some semantics of the change (e.g., if your change was
double the final letter of each filename, and also make a small change),
then that is what the commit message is for.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP PATCH] Manual rename correction

2012-08-01 Thread Jeff King
On Tue, Jul 31, 2012 at 11:01:27PM -0700, Junio C Hamano wrote:

  @@ -175,6 +177,11 @@ static int estimate_similarity(struct diff_filespec 
  *src,
  if (max_size * (MAX_SCORE-minimum_score)  delta_size * MAX_SCORE)
  return 0;
   
  +   hashcpy(pair.one, src-sha1);
  +   hashcpy(pair.two, dst-sha1);
  +   if (rename_cache_get(pair, score))
  +   return score;
  +
 
 Random thoughts.
 
 Even though your rename cache could be used to reject pairing that
 the similarity estimator would otherwise give high score, I would
 imagine that in practice, people would always use the mechanism to
 boost the similarity score of desired pairing.  This conjecture has
 a few interesting implications.

I agree that is the probable use. But low scores might also have a
purpose in preventing useless work from being done (IOW, storing a 0
score is unlikely to happen manually, but it does let us avoid repeating
the calculation).

As you might have guessed, I am really more interested in caching and
performance gains than in manual overrides.

  - As we track of only the top NUM_CANDIDATE_PER_DST rename src for
each dst (see record_if_better()), you should be able to first
see if pairs that have dst exist in your rename cache, and
iterate over the src,dst pairs, filling m[] with srcs that
appear in this particular invocation of diff.

Yes, although I don't know if it makes a big difference. We still run
estimate_similarity on the candidates to see if they make it into
record_if_better, and that is the expensive part. I don't think you want
to change that either (e.g., by assuming that cached scores will always
be in the top N; it would depend on the particular tree that the cached
scores came from).

Of course, that changes if it is not a cache at all, and simply a manual
override. Or if manual overrides are stored separately and take effect
first.

But I do not think you even want to care about NUM_CANDIDATE_PER_DST for
manual overrides. You simply want to take the highest scored one,
consider the it done, and not run estimate_similarity at all. Even if
you only had 1 (and NUM_CANDIDATE_PER_DST were 4), there is still no
point in looking at the others. The manual override should beat it,
regardless of score.

  - If you find NUM_CANDIDATE_PER_DST srcs from your rename cache,
you wouldn't have to run estimate_similarity() at all, but that
is very unlikely.  We could however declare that user configured
similarity boost always wins computed ones, and skip estimation
for a dst for which you find an entry in the rename cache.

Right. See above.

  - As entries in rename cache that record high scores have names of
similar blobs, pack-objects may be able to take advantage of
this information.

Yeah, although I suspect it is not as big a win as you might hope.  In
practice, you're going to diff a lot fewer objects than pack-objects
would consider, so most of the pack-objects candidate pairs will not
have a cache entry.  Which means that you really need to not rely on the
cache (since it is very likely that the best candidate is still to be
found), and you still do lots of computation.

You could cache the result of comparisons done by pack-objects, but that
cache ends up being large. You might remember that one of my very first
patches was a cache for recording negative delta finds (so we try to
delta A against B and find that it is not a good match, and record that
information). Even that cache was huge, and we ended up discarding it in
favor of Linus's much more sensible if two things are in the same pack
and not delta'd, then either they are not a good match, or something
else is in better heuristic. But one with full-on similarity scores
would be even bigger.

  - If you declare blobs A and B are similar, it is likely that blobs
C, D, E, ... that are created by making a series of small tweaks
to B are also similar.  Would it make more sense to introduce a
concept of set of similar blobs instead of recording pairwise
scores for (A,B), (A,C), (A,D), ... (B,C), (B,D), ...?  If so,
the body of per-dst loop in diffcore_rename() may become:

Yes, this is the transitive bit I mentioned elsewhere. I think there are
dragons lurking there, as you end up with the edges of the set not
_really_ being that close to each other. You'd probably want some kind
of weighted association, like if A-B had 80% similarity, and B-C had
90% similarity, then A-C would be 72%. But even that is not accurate;
it is just a lower bound. The differences between A-B and B-C might
overlap, and A-C might be much more similar.

You can't know the true value without actually finding the score for
A-C.  And since we output the scores, they should probably be accurate.

 Regarding your rename-and-tweak-exif photo sets, is the issue that
 there are too many rename src/dst candidates and filling a large
 matrix takes a lot of time, or tweaking exif makes the contents
 unnecessarily 

Re: [WIP PATCH] Manual rename correction

2012-08-01 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 On Tue, Jul 31, 2012 at 11:01:27PM -0700, Junio C Hamano wrote:

  - As entries in rename cache that record high scores have names of
similar blobs, pack-objects may be able to take advantage of
this information.

 Yeah, although I suspect it is not as big a win as you might hope.  In
 practice, you're going to diff a lot fewer objects than pack-objects
 would consider, so most of the pack-objects candidate pairs will not
 have a cache entry.  Which means that you really need to not rely on the
 cache (since it is very likely that the best candidate is still to be
 found), and you still do lots of computation.

 You could cache the result of comparisons done by pack-objects, but that
 cache ends up being large. You might remember that one of my very first
 patches was a cache for recording negative delta finds (so we try to
 delta A against B and find that it is not a good match, and record that
 information). Even that cache was huge, and we ended up discarding it in
 favor of Linus's much more sensible if two things are in the same pack
 and not delta'd, then either they are not a good match, or something
 else is in better heuristic. But one with full-on similarity scores
 would be even bigger.

Yeah, I forgot about your negative cache.  Also the older companion
heuristics if a delta and its base are in an existing pack, use
that delta works well enough to make a positive cache unnecessary
(and the rename similarity cache can only be a subset of such a
cache).

C.f. http://thread.gmane.org/gmane.comp.version-control.git/16223/focus=16267

  - If you declare blobs A and B are similar, it is likely that blobs
C, D, E, ... that are created by making a series of small tweaks
to B are also similar.  Would it make more sense to introduce a
concept of set of similar blobs instead of recording pairwise
scores for (A,B), (A,C), (A,D), ... (B,C), (B,D), ...?  If so,
the body of per-dst loop in diffcore_rename() may become:

 Yes, this is the transitive bit I mentioned elsewhere. I think there are
 dragons lurking there, as you end up with the edges of the set not
 _really_ being that close to each other. You'd probably want some kind
 of weighted association, like if A-B had 80% similarity, and B-C had
 90% similarity, then A-C would be 72%.

I recall we discussed the transitivity when we were designing the
basic rename thing, and rejected it after we realized that it was
unworkable.  You may have a blob A, perform a large-ish edit to
produce B, and A and B may be similar by only 60%.  You then may
start from the same A, perform another large-ish edit to produce C,
and A and C may be similar by only 55%.  Depending on the nature of
these two large-ish changes, B and C may be totally dissimilar, or
if the changes are more or less in the same direction, they may be
95% similar.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP PATCH] Manual rename correction

2012-07-31 Thread Junio C Hamano
Nguyen Thai Ngoc Duy pclo...@gmail.com writes:

 The above output is done with git diff --manual-rename=foo A B
 and foo contains (probably not in the best format though)

 -- 8 --
 attr.c dir.c
 dir.c attr.c
 -- 8 --
 ...
 Comments?

It is a good direction to go in, I would think, to give users a way
to explicitly tell that in comparison between these two trees, I
know path B in the postimage corresponds to path A in the preimage.

I however wonder why you did this as a separate function that only
does the explicitly marked ones.  Probably it was easier as a POC to
do it this way, and that is fine.

The real version should do this in the same diffcore_rename()
function, by excluding the paths that the user explicitly told you
about from the the automatic matching logic, and instead matching
them up manually; then you can let the remainder of the paths be
paired by the existing code.

Notice how the non-nullness of rename_dst[i].pair is used as a cue
to skip the similarity computation in the expensive matrix part of
diffcore_rename()?  That comes from find_exact_renames() that is
called earlier in the function.  I would imagine that your logic
would fit _before_ we call find_exact_renames() as a call to a new
helper function (e.g. record_explicit_renames() perhaps).
Anything that reduces the cost in the matrix part should come
earlier, as that reduces the number of pairs we would need to try
matching up.

We might want to introduce a way to express the similarity score for
such a filepair that was manually constructed when showing the
result, though.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP PATCH] Manual rename correction

2012-07-31 Thread Jeff King
On Tue, Jul 31, 2012 at 09:32:49AM -0700, Junio C Hamano wrote:

 Nguyen Thai Ngoc Duy pclo...@gmail.com writes:
 
  The above output is done with git diff --manual-rename=foo A B
  and foo contains (probably not in the best format though)
 
  -- 8 --
  attr.c dir.c
  dir.c attr.c
  -- 8 --
  ...
  Comments?
 
 It is a good direction to go in, I would think, to give users a way
 to explicitly tell that in comparison between these two trees, I
 know path B in the postimage corresponds to path A in the preimage.

I do not think that is the right direction. Let's imagine that I have a
commit A and I annotate it (via notes or whatever) to say between
A^^{tree} and A^{tree}, foo.c became bar.c. That will help me when
doing git show or git log. But it will not help me when I later try
to merge A (or its descendent). In that case, I will compute the diff
between A and the merge-base (or worse, some descendent of A and the
merge-base), and I will miss this hint entirely.

A much better hint is to annotate pairs of sha1s, to say do not bother
doing inexact rename correlation on this pair; I promise that they have
value N. Then it will find that pair no matter which trees or commits
are being diffed, and it will do so relatively inexpensively[1].

That is not fool-proof, of course. You might have a manual rename from
sha1 X to sha1 Y, and then a slight modification to Y to make Z. So you
would want some kind of transitivity to notice that X and Z correlate.
I think you could model it as a graph problem; sha1s are nodes, and each
this is a rename pair of annotated sha1s has an edge between them.
They are the same file if there is a path.

Of course that gives you bizarre and counter-intuitive results, because
X and Z might not actually be that similar. And that is why we have
rename detection in the first place. The idea of file identity (which
this fundamentally is) leads to these sorts of weird results.

I'm sure you could get better results by weakening the transitivity
according to the rename score, or something like that. But now you are
getting pretty complex.

-Peff

[1] We could actually cache rename results by storing pairs of sha1s
along with their rename score, and should be able to get a good
speedup (we are still src*dst in comparing, but now the comparison
is a simple table lookup rather than loading the blobs and computing
the differences). If we had such a cache, then manually marking a
rename would just be a matter of priming the cache with your manual
entries.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP PATCH] Manual rename correction

2012-07-31 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 A much better hint is to annotate pairs of sha1s, to say do not bother
 doing inexact rename correlation on this pair; I promise that they have
 value N.

Surely.  And I suspect that the patch to the current codebase to do
so would be much less impact if you go that way.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP PATCH] Manual rename correction

2012-07-31 Thread Jeff King
On Tue, Jul 31, 2012 at 01:20:42PM -0700, Junio C Hamano wrote:

 Jeff King p...@peff.net writes:
 
  A much better hint is to annotate pairs of sha1s, to say do not bother
  doing inexact rename correlation on this pair; I promise that they have
  value N.
 
 Surely.  And I suspect that the patch to the current codebase to do
 so would be much less impact if you go that way.

Yes. You may remember I wrote a generic caching subsystem last summer
when we were talking about caching commit generations. Plugging in a new
map type to map sha1 pairs to 32-bit integers was pretty simple, and
that gives the basis for a rename cache.

It's fairly unimpressive on git.git. My best-of-five for git log
--format=%H --raw -M went from 5.83s to 5.74s, which is pretty much
within the run-to-run noise. The resulting cache was 155K.

However, it's easy to come up with much more pathological cases. I have
a really ugly rename-and-tweak-tags commit on my photo repository, and
those blobs are relatively big. My timings for git show on that were:

  before: 49.724s
  after, run 1: 54.904s
  after, run 2:  0.117s

Which is pretty darn nice. The resulting cache is 5.3M (the repository
itself is in the gigabytes, but that's not really relevant; the cache
will obviously scale with the number of paths, not with the size of the
blobs).

It would also work for copies, too, of course. Here are the results of
git log --format=%H --raw -M -C -C on git.git:

  before: 1m35s
  after, run 1: 39.7s
  after, run 2: 39.5s

So it does make much more of a difference for copies, which is obvious;
git is doing a lot more work for us to cache. At the same time, our
cache is much bigger: 32M. Yikes.

My cache is fairly naive, in that it literally stores 44 bytes of
src_sha1, dst_sha1, score for each entry. At the cost of more
complexity, you could store each src_sha1 once, followed by a set of
dst_sha1, score pairs. I also didn't take any special care to avoid
duplicates of X, Y and Y, X (since presumably these renames would be
commutative). I'm not sure it is necessary, though; I think the copy
machinery already suppresses this when entries are in both source and
destination lists.

So I don't know. It can definitely speed up some operations, but at the
cost of a non-trivial cache on disk. I'll spare you all of the generic
caching infrastructure, but the actual patch to rename looks like this
(just to give a sense of where the hooks go):

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 216a7a4..db70878 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -6,6 +6,7 @@
 #include diffcore.h
 #include hash.h
 #include progress.h
+#include metadata-cache.h
 
 /* Table of rename/copy destinations */
 
@@ -137,7 +138,8 @@ static int estimate_similarity(struct diff_filespec *src,
 */
unsigned long max_size, delta_size, base_size, src_copied, 
literal_added;
unsigned long delta_limit;
-   int score;
+   uint32_t score;
+   struct sha1pair pair;
 
/* We deal only with regular files.  Symlink renames are handled
 * only when they are exact matches --- in other words, no edits
@@ -175,6 +177,11 @@ static int estimate_similarity(struct diff_filespec *src,
if (max_size * (MAX_SCORE-minimum_score)  delta_size * MAX_SCORE)
return 0;
 
+   hashcpy(pair.one, src-sha1);
+   hashcpy(pair.two, dst-sha1);
+   if (rename_cache_get(pair, score))
+   return score;
+
if (!src-cnt_data  diff_populate_filespec(src, 0))
return 0;
if (!dst-cnt_data  diff_populate_filespec(dst, 0))
@@ -195,6 +202,7 @@ static int estimate_similarity(struct diff_filespec *src,
score = 0; /* should not happen */
else
score = (int)(src_copied * MAX_SCORE / max_size);
+   rename_cache_set(pair, score);
return score;
 }
 
-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP PATCH] Manual rename correction

2012-07-31 Thread Nguyen Thai Ngoc Duy
On Wed, Aug 1, 2012 at 2:23 AM, Jeff King p...@peff.net wrote:
 It is a good direction to go in, I would think, to give users a way
 to explicitly tell that in comparison between these two trees, I
 know path B in the postimage corresponds to path A in the preimage.

 I do not think that is the right direction. Let's imagine that I have a
 commit A and I annotate it (via notes or whatever) to say between
 A^^{tree} and A^{tree}, foo.c became bar.c. That will help me when
 doing git show or git log. But it will not help me when I later try
 to merge A (or its descendent). In that case, I will compute the diff
 between A and the merge-base (or worse, some descendent of A and the
 merge-base), and I will miss this hint entirely.

 A much better hint is to annotate pairs of sha1s, to say do not bother
 doing inexact rename correlation on this pair; I promise that they have
 value N.

I haven't had time to think it through yet but I throw my thoughts in
any way. I actually went with your approach first. But it's more
difficult to control the renaming. Assume we want to tell git to
rename SHA-1 A to SHA-1 B. What happens if we have two As in the
source tree and two Bs in the target tree? What happens if two As and
one B, or one A and two Bs? What if a user defines A - B and A - C,
and we happen to have two As in source tree and B and C in target
tree?

There's also the problem with transferring this information. With
git-notes I think I can transfer it (though not automatically). How do
we transfer sha1 map (that you mentioned in the commit generation mail
in this thread)?

 Then it will find that pair no matter which trees or commits
 are being diffed, and it will do so relatively inexpensively[1].

But does that happen often in practice? I mean diff-ing two arbitrary
trees and expect rename correction. I disregarded it as git log is
my main case, but I'm just a single user..
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP PATCH] Manual rename correction

2012-07-31 Thread Nguyen Thai Ngoc Duy
On Wed, Aug 1, 2012 at 9:01 AM, Jeff King p...@peff.net wrote:
 On Wed, Aug 01, 2012 at 08:10:12AM +0700, Nguyen Thai Ngoc Duy wrote:

  I do not think that is the right direction. Let's imagine that I have a
  commit A and I annotate it (via notes or whatever) to say between
  A^^{tree} and A^{tree}, foo.c became bar.c. That will help me when
  doing git show or git log. But it will not help me when I later try
  to merge A (or its descendent). In that case, I will compute the diff
  between A and the merge-base (or worse, some descendent of A and the
  merge-base), and I will miss this hint entirely.
 
  A much better hint is to annotate pairs of sha1s, to say do not bother
  doing inexact rename correlation on this pair; I promise that they have
  value N.

 I haven't had time to think it through yet but I throw my thoughts in
 any way. I actually went with your approach first. But it's more
 difficult to control the renaming. Assume we want to tell git to
 rename SHA-1 A to SHA-1 B. What happens if we have two As in the
 source tree and two Bs in the target tree? What happens if two As and
 one B, or one A and two Bs? What if a user defines A - B and A - C,
 and we happen to have two As in source tree and B and C in target
 tree?

 Yes, it disregards path totally. But if you had the exact same movement
 of content from one path to another in one instance, and it is
 considered a rename, wouldn't it also be a rename in a second instance?

Yes. This is probably cosmetics only, but without path information, we
leave it to chance to decide which A to pair with B and C (in the
A-B, A-C case above). Wrong path might lead to funny effects (i'm
thinking of git log --follow).

 There's also the problem with transferring this information. With
 git-notes I think I can transfer it (though not automatically). How do
 we transfer sha1 map (that you mentioned in the commit generation mail
 in this thread)?

I wasn't clear. This is about transferring info across repositories.

 That is orthogonal to the issue of what is being stored. I chose my
 mmap'd disk implementation because it is very fast, which makes it nice
 for a performance cache. But you could store the same thing in git-notes
 (indexed by dst sha1, I guess, and then pointing to a blob of (src,
 score) pairs.

 If you want to include path-based hints in a commit, I'd say that using
 some micro-format in the commit message would be the simplest thing.

Rename correction is after the commit is created. I don't think we can
recreate commits.

 But
 that has been discussed before; ultimately the problem is that it only
 covers _one_ diff that we do with that commit (it is probably the most
 common, of course, but it doesn't cover them all).

How about we generate sha1 mapping from commit hints? We try to take
advantage of path hints when we can. Else we fall back to sha-1
mapping. This way we can transfer commit hints as git-notes to another
repo, then regenerate sha-1 mapping there. No need to transfer sha1
maps.

  Then it will find that pair no matter which trees or commits
  are being diffed, and it will do so relatively inexpensively[1].

 But does that happen often in practice? I mean diff-ing two arbitrary
 trees and expect rename correction. I disregarded it as git log is
 my main case, but I'm just a single user..

 It happens every time merge-recursive does rename detection, which
 includes git merge but also things like cherry-pick.

Thanks. I'll look into merge/cherry-pick.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html