Re: [WIP PATCH] Manual rename correction
On Thu, Aug 2, 2012 at 4:27 AM, Jeff King p...@peff.net wrote: Yes, if you go with a commit-based approach, you can do either notes or in-commit messages. In other words, I would break the solutions down as: 1. Store sha1+sha1 - score mapping (i.e., what I suggested). This is fundamentally a global store, not a per-commit store. For storage, you can do one (or a combination) of: a. Store the mapping in some local file. Fast, but can't be shared. b. Store the mapping in a note (probably indexed by the destination blob sha1). Less efficient, but easy to share. I implemented (1a). Implementing (1b) would be easy, but for a full-on cache (especially for -C), I think the resulting size might be prohibitive. (1a) is good regardless rename overrides. Why don't you polish and submit it? We can set some criteria to limit the cache size while keeping computation reasonably low. Caching rename scores for file pairs that has file size larger than a limit is one. Rename matrix size could also be a candidate. We could even cache just rename scores for recent commits (i.e. close to heads) only with the assumption that people diff/apply recent commits more often. All solutions under (2) suffer from the same problem: they are accurate only for a single diff. For other diffs, you would either have to not use the feature, or you would be stuck traversing the history and assigning a temporary file identity (e.g., given commits A-B-C, and in A-B we rename foo to bar, the diff between A and C could discover that A's foo corresponds to C's bar). Yeah. If we go with manual overrides, I expect users to deal with these manually too. IOW they'll need to create a mapping for A-C themselves. We can help detect that there are manual overrides in some cases, like merge, and let users know that manual overrides are ignored. For merge, I think we can just check for all commits while traversing looking for bases. For this reason, I'm not sure that stored overrides like this are generally useful in the long run. I think storage is useful for _caching_ the results, because it doesn't have to be perfect; it just helps with some repetitive queries. Whereas for overriding, I think it is much more interesting to override _particular_ diff. E.g., to say I am merging X and Y, and please pretend that Y renamed foo to bar when you do rename detection. And in that sense, your git log example can be considered a special-case of this: you are saying that the diff from $commit to $commit^ is done frequently, so rather than saying please pretend... each time, you would like to store the information forever. And storing it in the commit message or a note is one way of doing that. Yep, specifying rename overrides between two trees is probably better. I don't think there's anything fundamentally _wrong_ with that, but I kind of question its usefulness. In other words, what is the point in doing so? If it is inform the user that semantically the commit did a rename, even though the content changed enough that rename detection does not find it, then I would argue that you should simply state it in the commit message (or in a human-readable git-note, if it was only realized after the fact). But there is not much point in making it machine-readable, since the interesting machine-readable things we do with renames are: 1. Show the diff against the rename src, which can often be easier to read. Except that if rename detection did not find it, it is probably _not_ going to be easier to read. Probably. Still it helps git log --follow to follow the correct track in the 1% case that rename detection does go wrong. 2. Applying content to the destination of a merge. But you're almost never doing the diff between a commit and its parent, so the information would be useless. Having a way to interfere rename detection, even manually, could be good in this case if it reduces conflicts. We could feed rename overrides using command line. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP PATCH] Manual rename correction
On Wed, Aug 01, 2012 at 03:10:55PM -0700, Junio C Hamano wrote: Jeff King p...@peff.net writes: On Tue, Jul 31, 2012 at 11:01:27PM -0700, Junio C Hamano wrote: ... As we still have the pathname in this codepath, I am wondering if we would benefit from custom content hash that knows the nature of payload than the built-in similarity estimator, driven by the attribute mechanism (if the latter is the case, that is). Hmm. Interesting. But I don't think that attributes are a good fit here. They are pathname based, so how do I apply anything related to similarity of a particular version by pathname? IOW, how does it apply in one tree but not another? When you move porn/0001.jpg in the preimage to naughty/1.jpg in the postimage, they both can hit *.jpg contentid=jpeg line in the top-level .gitattribute file, and the contentid driver for jpeg type may strip exif and hash the remainder bits in the image to come up with a token you can use in a similar way as object ID is used in the exact rename detection phase. Just thinking aloud. Ah, I see. That still feels like way too specific a use case to me. A much more general use case to me would be a contentid driver which splits the file into multiple chunks (which can be concatenated to arrive at the original content), and marks chunks as OK to delta or not able to delta. In other words, a content-specific version of the bup-style splitting that people have proposed. Assuming we split a jpeg into its EXIF bits (+delta) and its image bits (-delta), then you could do a fast rename or pack-objects comparison between two such files (in fact, with chunked object storage, pack-objects can avoid looking at the image parts at all). However, it may be the case that such smart splitting is not necessary, as stupid and generic bup-style splitting may be enough. I really need to start playing with the patches you wrote last year that started in that direction. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP PATCH] Manual rename correction
On Thu, Aug 02, 2012 at 07:08:25PM +0700, Nguyen Thai Ngoc Duy wrote: I implemented (1a). Implementing (1b) would be easy, but for a full-on cache (especially for -C), I think the resulting size might be prohibitive. (1a) is good regardless rename overrides. Why don't you polish and submit it? We can set some criteria to limit the cache size while keeping computation reasonably low. Caching rename scores for file pairs that has file size larger than a limit is one. Rename matrix size could also be a candidate. We could even cache just rename scores for recent commits (i.e. close to heads) only with the assumption that people diff/apply recent commits more often. I'll polish and share it. I'm still not 100% sure it's a good idea, because introducing an on-disk cache means we need to _manage_ that cache. How big will it be? Who will prune it when it gets too big? By what criteria? And so on. But if it's all hidden behind a config option, then it won't hurt people who don't use it. And people who do use it can gather data on how the caches grow. All solutions under (2) suffer from the same problem: they are accurate only for a single diff. For other diffs, you would either have to not use the feature, or you would be stuck traversing the history and assigning a temporary file identity (e.g., given commits A-B-C, and in A-B we rename foo to bar, the diff between A and C could discover that A's foo corresponds to C's bar). Yeah. If we go with manual overrides, I expect users to deal with these manually too. IOW they'll need to create a mapping for A-C themselves. We can help detect that there are manual overrides in some cases, like merge, and let users know that manual overrides are ignored. For merge, I think we can just check for all commits while traversing looking for bases. Yeah, merges are a special case, in that we know the diff we perform will always have a direct-ancestor relationship (since it is always between a tip and the merge base). But there is not much point in making it machine-readable, since the interesting machine-readable things we do with renames are: 1. Show the diff against the rename src, which can often be easier to read. Except that if rename detection did not find it, it is probably _not_ going to be easier to read. Probably. Still it helps git log --follow to follow the correct track in the 1% case that rename detection does go wrong. Thanks. I didn't think of --follow, but that is a good counterpoint to my argument. 2. Applying content to the destination of a merge. But you're almost never doing the diff between a commit and its parent, so the information would be useless. Having a way to interfere rename detection, even manually, could be good in this case if it reduces conflicts. We could feed rename overrides using command line. Yeah. I think I'd start with letting you feed pairs to diff_options, give it a command-line option to see how useful it is, and then later on consider a mechanism for extracting those pairs automatically from commits or notes. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP PATCH] Manual rename correction
Jeff King p...@peff.net writes: On Wed, Aug 01, 2012 at 03:10:55PM -0700, Junio C Hamano wrote: ... When you move porn/0001.jpg in the preimage to naughty/1.jpg in the postimage, they both can hit *.jpg contentid=jpeg line in the top-level .gitattribute file, and the contentid driver for jpeg type may strip exif and hash the remainder bits in the image to come up with a token you can use in a similar way as object ID is used in the exact rename detection phase. Just thinking aloud. Ah, I see. That still feels like way too specific a use case to me. A much more general use case to me would be a contentid driver which splits the file into multiple chunks (which can be concatenated to arrive at the original content), and marks chunks as OK to delta or not able to delta. In other words, a content-specific version of the bup-style splitting that people have proposed. Assuming we split a jpeg into its EXIF bits (+delta) and its image bits (-delta), then you could do a fast rename or pack-objects comparison between two such files (in fact, with chunked object storage, pack-objects can avoid looking at the image parts at all). However, it may be the case that such smart splitting is not necessary, as stupid and generic bup-style splitting may be enough. I really need to start playing with the patches you wrote last year that started in that direction. I wasn't interested in packing split object representation, actually. The idea was still within the context of rename. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP PATCH] Manual rename correction
On Thu, Aug 02, 2012 at 03:51:17PM -0700, Junio C Hamano wrote: On Wed, Aug 01, 2012 at 03:10:55PM -0700, Junio C Hamano wrote: ... When you move porn/0001.jpg in the preimage to naughty/1.jpg in the postimage, they both can hit *.jpg contentid=jpeg line in the top-level .gitattribute file, and the contentid driver for jpeg type may strip exif and hash the remainder bits in the image to come up with a token you can use in a similar way as object ID is used in the exact rename detection phase. Just thinking aloud. Ah, I see. That still feels like way too specific a use case to me. A much more general use case to me would be a contentid driver which splits the file into multiple chunks (which can be concatenated to arrive at the original content), and marks chunks as OK to delta or not able to delta. In other words, a content-specific version of the bup-style splitting that people have proposed. Assuming we split a jpeg into its EXIF bits (+delta) and its image bits (-delta), then you could do a fast rename or pack-objects comparison between two such files (in fact, with chunked object storage, pack-objects can avoid looking at the image parts at all). However, it may be the case that such smart splitting is not necessary, as stupid and generic bup-style splitting may be enough. I really need to start playing with the patches you wrote last year that started in that direction. I wasn't interested in packing split object representation, actually. The idea was still within the context of rename. But it would work for rename, too. If you want to compare two files, the driver would give you back { sha1_exif (+delta), sha1_image (-delta) } for each file. You know the size of each chunk and the size of the total file. Then you would just compare sha1_image for each entry. If they match, then you have a lower bound on similarity of image_chunk_size / total_size. If they don't, then you have an upper bound of similarity of 1-(image_chunk_size/total_size). In the former case, you can get the exact similarity by doing a real delta on the sha1_exif content. In the latter case, you can either exit early (if you are already below the similarity threshold, which is likely), or possibly do the delta on the sha1_exif content to get an exact value. But either way, you never had to do a direct comparison between the big image data; you only needed to know the sha1s. And as a bonus, if you did want to cache results, you can have an O(# of blobs) cache of the chunked sha1s of the chunked form (because the information is immutable for a given sha1 and content driver). Whereas by caching the result of estimate_similarity, our worst-case cache is the square of that (because we are storing sha1 pairs). -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP PATCH] Manual rename correction
Jeff King p...@peff.net writes: @@ -175,6 +177,11 @@ static int estimate_similarity(struct diff_filespec *src, if (max_size * (MAX_SCORE-minimum_score) delta_size * MAX_SCORE) return 0; + hashcpy(pair.one, src-sha1); + hashcpy(pair.two, dst-sha1); + if (rename_cache_get(pair, score)) + return score; + Random thoughts. Even though your rename cache could be used to reject pairing that the similarity estimator would otherwise give high score, I would imagine that in practice, people would always use the mechanism to boost the similarity score of desired pairing. This conjecture has a few interesting implications. - As we track of only the top NUM_CANDIDATE_PER_DST rename src for each dst (see record_if_better()), you should be able to first see if pairs that have dst exist in your rename cache, and iterate over the src,dst pairs, filling m[] with srcs that appear in this particular invocation of diff. - If you find NUM_CANDIDATE_PER_DST srcs from your rename cache, you wouldn't have to run estimate_similarity() at all, but that is very unlikely. We could however declare that user configured similarity boost always wins computed ones, and skip estimation for a dst for which you find an entry in the rename cache. - As entries in rename cache that record high scores have names of similar blobs, pack-objects may be able to take advantage of this information. - If you declare blobs A and B are similar, it is likely that blobs C, D, E, ... that are created by making a series of small tweaks to B are also similar. Would it make more sense to introduce a concept of set of similar blobs instead of recording pairwise scores for (A,B), (A,C), (A,D), ... (B,C), (B,D), ...? If so, the body of per-dst loop in diffcore_rename() may become: if (we know where dst came from) continue; if (dst belongs to a known blob family) { for (each src in rename_src[]) { if (src belongs to the same blob family as dst) record it in m[]; } } if (the above didn't record anything in m[]) { ... existing estimate_similarity() code ... } Regarding your rename-and-tweak-exif photo sets, is the issue that there are too many rename src/dst candidates and filling a large matrix takes a lot of time, or tweaking exif makes the contents unnecessarily dissimilar and causes the similarity detection to fail? As we still have the pathname in this codepath, I am wondering if we would benefit from custom content hash that knows the nature of payload than the built-in similarity estimator, driven by the attribute mechanism (if the latter is the case, that is). -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP PATCH] Manual rename correction
Nguyen Thai Ngoc Duy pclo...@gmail.com writes: Yes. This is probably cosmetics only, but without path information, we leave it to chance to decide which A to pair with B and C (in the A-B, A-C case above). Wrong path might lead to funny effects (i'm thinking of git log --follow). Isn't that why A,B and A,C can have different scores per object name pair? And if you mean by B and C the paths not object names, and the blob at B and C are indeed identical, why would it matter? There's also the problem with transferring this information. With git-notes I think I can transfer it (though not automatically). How do we transfer sha1 map (that you mentioned in the commit generation mail in this thread)? I wasn't clear. This is about transferring info across repositories. It is simple to define a portable format to record Jeff's rename cache database (it could be a text blob full of x40 x40 score lines), point such a blob with a special ref, and write a Porcelain to transfer and merge across repositories, e.g. git fetch $there refs/rename-cache new=$( ( git cat-file blob refs/rename-cache git cat-file blob FETCH_HEAD ) | sort | uniq | git hash-object -w --stdin) git update-ref refs/rename-cache $new And as Jeff said, that is orthogonal to the issue of what is being stored. The contents of refs/rename-cache blob could be compiled into a binary form for quicker access at runtime in the Porcelain. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP PATCH] Manual rename correction
On Wed, Aug 1, 2012 at 1:09 PM, Junio C Hamano gits...@pobox.com wrote: Nguyen Thai Ngoc Duy pclo...@gmail.com writes: Yes. This is probably cosmetics only, but without path information, we leave it to chance to decide which A to pair with B and C (in the A-B, A-C case above). Wrong path might lead to funny effects (i'm thinking of git log --follow). Isn't that why A,B and A,C can have different scores per object name pair? And if you mean by B and C the paths not object names, and the blob at B and C are indeed identical, why would it matter? I don't see how scores affect that. Suppose I have two trees and a rename-cache file: $ git ls-tree -r HEAD^ 100644 blob d00491path1/foo 100644 blob d00491path2/bar $ git ls-tree -r HEAD 100644 blob 0cfbf0path1/fooo 100644 blob 00750epath2/barr $ cat rename-cache d00491 0cfbf0 score 1 d00491 00750e score 2 How can I be sure git diff HEAD^! will rename path1/foo = path1/fooo and path2/bar = path2/barr, not path1/foo = path2/barr and path2/bar = path1/fooo? -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP PATCH] Manual rename correction
On Wed, Aug 01, 2012 at 01:34:23PM +0700, Nguyen Thai Ngoc Duy wrote: On Wed, Aug 1, 2012 at 1:09 PM, Junio C Hamano gits...@pobox.com wrote: Nguyen Thai Ngoc Duy pclo...@gmail.com writes: Yes. This is probably cosmetics only, but without path information, we leave it to chance to decide which A to pair with B and C (in the A-B, A-C case above). Wrong path might lead to funny effects (i'm thinking of git log --follow). Isn't that why A,B and A,C can have different scores per object name pair? And if you mean by B and C the paths not object names, and the blob at B and C are indeed identical, why would it matter? I don't see how scores affect that. Suppose I have two trees and a rename-cache file: $ git ls-tree -r HEAD^ 100644 blob d00491path1/foo 100644 blob d00491path2/bar $ git ls-tree -r HEAD 100644 blob 0cfbf0path1/fooo 100644 blob 00750epath2/barr $ cat rename-cache d00491 0cfbf0 score 1 d00491 00750e score 2 How can I be sure git diff HEAD^! will rename path1/foo = path1/fooo and path2/bar = path2/barr, not path1/foo = path2/barr and path2/bar = path1/fooo? You can't. A pathless mapping can never represent that. I think my (and Junio's argument) is more like: why would you care? And that gets back to the statement I just made elsewhere in the thread. Which is that you need to consider the purpose of the rename detection. For generating diffs and for merging, it doesn't matter in the above case. If the point is to communicate some semantics of the change (e.g., if your change was double the final letter of each filename, and also make a small change), then that is what the commit message is for. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP PATCH] Manual rename correction
On Tue, Jul 31, 2012 at 11:01:27PM -0700, Junio C Hamano wrote: @@ -175,6 +177,11 @@ static int estimate_similarity(struct diff_filespec *src, if (max_size * (MAX_SCORE-minimum_score) delta_size * MAX_SCORE) return 0; + hashcpy(pair.one, src-sha1); + hashcpy(pair.two, dst-sha1); + if (rename_cache_get(pair, score)) + return score; + Random thoughts. Even though your rename cache could be used to reject pairing that the similarity estimator would otherwise give high score, I would imagine that in practice, people would always use the mechanism to boost the similarity score of desired pairing. This conjecture has a few interesting implications. I agree that is the probable use. But low scores might also have a purpose in preventing useless work from being done (IOW, storing a 0 score is unlikely to happen manually, but it does let us avoid repeating the calculation). As you might have guessed, I am really more interested in caching and performance gains than in manual overrides. - As we track of only the top NUM_CANDIDATE_PER_DST rename src for each dst (see record_if_better()), you should be able to first see if pairs that have dst exist in your rename cache, and iterate over the src,dst pairs, filling m[] with srcs that appear in this particular invocation of diff. Yes, although I don't know if it makes a big difference. We still run estimate_similarity on the candidates to see if they make it into record_if_better, and that is the expensive part. I don't think you want to change that either (e.g., by assuming that cached scores will always be in the top N; it would depend on the particular tree that the cached scores came from). Of course, that changes if it is not a cache at all, and simply a manual override. Or if manual overrides are stored separately and take effect first. But I do not think you even want to care about NUM_CANDIDATE_PER_DST for manual overrides. You simply want to take the highest scored one, consider the it done, and not run estimate_similarity at all. Even if you only had 1 (and NUM_CANDIDATE_PER_DST were 4), there is still no point in looking at the others. The manual override should beat it, regardless of score. - If you find NUM_CANDIDATE_PER_DST srcs from your rename cache, you wouldn't have to run estimate_similarity() at all, but that is very unlikely. We could however declare that user configured similarity boost always wins computed ones, and skip estimation for a dst for which you find an entry in the rename cache. Right. See above. - As entries in rename cache that record high scores have names of similar blobs, pack-objects may be able to take advantage of this information. Yeah, although I suspect it is not as big a win as you might hope. In practice, you're going to diff a lot fewer objects than pack-objects would consider, so most of the pack-objects candidate pairs will not have a cache entry. Which means that you really need to not rely on the cache (since it is very likely that the best candidate is still to be found), and you still do lots of computation. You could cache the result of comparisons done by pack-objects, but that cache ends up being large. You might remember that one of my very first patches was a cache for recording negative delta finds (so we try to delta A against B and find that it is not a good match, and record that information). Even that cache was huge, and we ended up discarding it in favor of Linus's much more sensible if two things are in the same pack and not delta'd, then either they are not a good match, or something else is in better heuristic. But one with full-on similarity scores would be even bigger. - If you declare blobs A and B are similar, it is likely that blobs C, D, E, ... that are created by making a series of small tweaks to B are also similar. Would it make more sense to introduce a concept of set of similar blobs instead of recording pairwise scores for (A,B), (A,C), (A,D), ... (B,C), (B,D), ...? If so, the body of per-dst loop in diffcore_rename() may become: Yes, this is the transitive bit I mentioned elsewhere. I think there are dragons lurking there, as you end up with the edges of the set not _really_ being that close to each other. You'd probably want some kind of weighted association, like if A-B had 80% similarity, and B-C had 90% similarity, then A-C would be 72%. But even that is not accurate; it is just a lower bound. The differences between A-B and B-C might overlap, and A-C might be much more similar. You can't know the true value without actually finding the score for A-C. And since we output the scores, they should probably be accurate. Regarding your rename-and-tweak-exif photo sets, is the issue that there are too many rename src/dst candidates and filling a large matrix takes a lot of time, or tweaking exif makes the contents unnecessarily
Re: [WIP PATCH] Manual rename correction
Jeff King p...@peff.net writes: On Tue, Jul 31, 2012 at 11:01:27PM -0700, Junio C Hamano wrote: - As entries in rename cache that record high scores have names of similar blobs, pack-objects may be able to take advantage of this information. Yeah, although I suspect it is not as big a win as you might hope. In practice, you're going to diff a lot fewer objects than pack-objects would consider, so most of the pack-objects candidate pairs will not have a cache entry. Which means that you really need to not rely on the cache (since it is very likely that the best candidate is still to be found), and you still do lots of computation. You could cache the result of comparisons done by pack-objects, but that cache ends up being large. You might remember that one of my very first patches was a cache for recording negative delta finds (so we try to delta A against B and find that it is not a good match, and record that information). Even that cache was huge, and we ended up discarding it in favor of Linus's much more sensible if two things are in the same pack and not delta'd, then either they are not a good match, or something else is in better heuristic. But one with full-on similarity scores would be even bigger. Yeah, I forgot about your negative cache. Also the older companion heuristics if a delta and its base are in an existing pack, use that delta works well enough to make a positive cache unnecessary (and the rename similarity cache can only be a subset of such a cache). C.f. http://thread.gmane.org/gmane.comp.version-control.git/16223/focus=16267 - If you declare blobs A and B are similar, it is likely that blobs C, D, E, ... that are created by making a series of small tweaks to B are also similar. Would it make more sense to introduce a concept of set of similar blobs instead of recording pairwise scores for (A,B), (A,C), (A,D), ... (B,C), (B,D), ...? If so, the body of per-dst loop in diffcore_rename() may become: Yes, this is the transitive bit I mentioned elsewhere. I think there are dragons lurking there, as you end up with the edges of the set not _really_ being that close to each other. You'd probably want some kind of weighted association, like if A-B had 80% similarity, and B-C had 90% similarity, then A-C would be 72%. I recall we discussed the transitivity when we were designing the basic rename thing, and rejected it after we realized that it was unworkable. You may have a blob A, perform a large-ish edit to produce B, and A and B may be similar by only 60%. You then may start from the same A, perform another large-ish edit to produce C, and A and C may be similar by only 55%. Depending on the nature of these two large-ish changes, B and C may be totally dissimilar, or if the changes are more or less in the same direction, they may be 95% similar. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP PATCH] Manual rename correction
Nguyen Thai Ngoc Duy pclo...@gmail.com writes: The above output is done with git diff --manual-rename=foo A B and foo contains (probably not in the best format though) -- 8 -- attr.c dir.c dir.c attr.c -- 8 -- ... Comments? It is a good direction to go in, I would think, to give users a way to explicitly tell that in comparison between these two trees, I know path B in the postimage corresponds to path A in the preimage. I however wonder why you did this as a separate function that only does the explicitly marked ones. Probably it was easier as a POC to do it this way, and that is fine. The real version should do this in the same diffcore_rename() function, by excluding the paths that the user explicitly told you about from the the automatic matching logic, and instead matching them up manually; then you can let the remainder of the paths be paired by the existing code. Notice how the non-nullness of rename_dst[i].pair is used as a cue to skip the similarity computation in the expensive matrix part of diffcore_rename()? That comes from find_exact_renames() that is called earlier in the function. I would imagine that your logic would fit _before_ we call find_exact_renames() as a call to a new helper function (e.g. record_explicit_renames() perhaps). Anything that reduces the cost in the matrix part should come earlier, as that reduces the number of pairs we would need to try matching up. We might want to introduce a way to express the similarity score for such a filepair that was manually constructed when showing the result, though. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP PATCH] Manual rename correction
On Tue, Jul 31, 2012 at 09:32:49AM -0700, Junio C Hamano wrote: Nguyen Thai Ngoc Duy pclo...@gmail.com writes: The above output is done with git diff --manual-rename=foo A B and foo contains (probably not in the best format though) -- 8 -- attr.c dir.c dir.c attr.c -- 8 -- ... Comments? It is a good direction to go in, I would think, to give users a way to explicitly tell that in comparison between these two trees, I know path B in the postimage corresponds to path A in the preimage. I do not think that is the right direction. Let's imagine that I have a commit A and I annotate it (via notes or whatever) to say between A^^{tree} and A^{tree}, foo.c became bar.c. That will help me when doing git show or git log. But it will not help me when I later try to merge A (or its descendent). In that case, I will compute the diff between A and the merge-base (or worse, some descendent of A and the merge-base), and I will miss this hint entirely. A much better hint is to annotate pairs of sha1s, to say do not bother doing inexact rename correlation on this pair; I promise that they have value N. Then it will find that pair no matter which trees or commits are being diffed, and it will do so relatively inexpensively[1]. That is not fool-proof, of course. You might have a manual rename from sha1 X to sha1 Y, and then a slight modification to Y to make Z. So you would want some kind of transitivity to notice that X and Z correlate. I think you could model it as a graph problem; sha1s are nodes, and each this is a rename pair of annotated sha1s has an edge between them. They are the same file if there is a path. Of course that gives you bizarre and counter-intuitive results, because X and Z might not actually be that similar. And that is why we have rename detection in the first place. The idea of file identity (which this fundamentally is) leads to these sorts of weird results. I'm sure you could get better results by weakening the transitivity according to the rename score, or something like that. But now you are getting pretty complex. -Peff [1] We could actually cache rename results by storing pairs of sha1s along with their rename score, and should be able to get a good speedup (we are still src*dst in comparing, but now the comparison is a simple table lookup rather than loading the blobs and computing the differences). If we had such a cache, then manually marking a rename would just be a matter of priming the cache with your manual entries. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP PATCH] Manual rename correction
Jeff King p...@peff.net writes: A much better hint is to annotate pairs of sha1s, to say do not bother doing inexact rename correlation on this pair; I promise that they have value N. Surely. And I suspect that the patch to the current codebase to do so would be much less impact if you go that way. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP PATCH] Manual rename correction
On Tue, Jul 31, 2012 at 01:20:42PM -0700, Junio C Hamano wrote: Jeff King p...@peff.net writes: A much better hint is to annotate pairs of sha1s, to say do not bother doing inexact rename correlation on this pair; I promise that they have value N. Surely. And I suspect that the patch to the current codebase to do so would be much less impact if you go that way. Yes. You may remember I wrote a generic caching subsystem last summer when we were talking about caching commit generations. Plugging in a new map type to map sha1 pairs to 32-bit integers was pretty simple, and that gives the basis for a rename cache. It's fairly unimpressive on git.git. My best-of-five for git log --format=%H --raw -M went from 5.83s to 5.74s, which is pretty much within the run-to-run noise. The resulting cache was 155K. However, it's easy to come up with much more pathological cases. I have a really ugly rename-and-tweak-tags commit on my photo repository, and those blobs are relatively big. My timings for git show on that were: before: 49.724s after, run 1: 54.904s after, run 2: 0.117s Which is pretty darn nice. The resulting cache is 5.3M (the repository itself is in the gigabytes, but that's not really relevant; the cache will obviously scale with the number of paths, not with the size of the blobs). It would also work for copies, too, of course. Here are the results of git log --format=%H --raw -M -C -C on git.git: before: 1m35s after, run 1: 39.7s after, run 2: 39.5s So it does make much more of a difference for copies, which is obvious; git is doing a lot more work for us to cache. At the same time, our cache is much bigger: 32M. Yikes. My cache is fairly naive, in that it literally stores 44 bytes of src_sha1, dst_sha1, score for each entry. At the cost of more complexity, you could store each src_sha1 once, followed by a set of dst_sha1, score pairs. I also didn't take any special care to avoid duplicates of X, Y and Y, X (since presumably these renames would be commutative). I'm not sure it is necessary, though; I think the copy machinery already suppresses this when entries are in both source and destination lists. So I don't know. It can definitely speed up some operations, but at the cost of a non-trivial cache on disk. I'll spare you all of the generic caching infrastructure, but the actual patch to rename looks like this (just to give a sense of where the hooks go): diff --git a/diffcore-rename.c b/diffcore-rename.c index 216a7a4..db70878 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -6,6 +6,7 @@ #include diffcore.h #include hash.h #include progress.h +#include metadata-cache.h /* Table of rename/copy destinations */ @@ -137,7 +138,8 @@ static int estimate_similarity(struct diff_filespec *src, */ unsigned long max_size, delta_size, base_size, src_copied, literal_added; unsigned long delta_limit; - int score; + uint32_t score; + struct sha1pair pair; /* We deal only with regular files. Symlink renames are handled * only when they are exact matches --- in other words, no edits @@ -175,6 +177,11 @@ static int estimate_similarity(struct diff_filespec *src, if (max_size * (MAX_SCORE-minimum_score) delta_size * MAX_SCORE) return 0; + hashcpy(pair.one, src-sha1); + hashcpy(pair.two, dst-sha1); + if (rename_cache_get(pair, score)) + return score; + if (!src-cnt_data diff_populate_filespec(src, 0)) return 0; if (!dst-cnt_data diff_populate_filespec(dst, 0)) @@ -195,6 +202,7 @@ static int estimate_similarity(struct diff_filespec *src, score = 0; /* should not happen */ else score = (int)(src_copied * MAX_SCORE / max_size); + rename_cache_set(pair, score); return score; } -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP PATCH] Manual rename correction
On Wed, Aug 1, 2012 at 2:23 AM, Jeff King p...@peff.net wrote: It is a good direction to go in, I would think, to give users a way to explicitly tell that in comparison between these two trees, I know path B in the postimage corresponds to path A in the preimage. I do not think that is the right direction. Let's imagine that I have a commit A and I annotate it (via notes or whatever) to say between A^^{tree} and A^{tree}, foo.c became bar.c. That will help me when doing git show or git log. But it will not help me when I later try to merge A (or its descendent). In that case, I will compute the diff between A and the merge-base (or worse, some descendent of A and the merge-base), and I will miss this hint entirely. A much better hint is to annotate pairs of sha1s, to say do not bother doing inexact rename correlation on this pair; I promise that they have value N. I haven't had time to think it through yet but I throw my thoughts in any way. I actually went with your approach first. But it's more difficult to control the renaming. Assume we want to tell git to rename SHA-1 A to SHA-1 B. What happens if we have two As in the source tree and two Bs in the target tree? What happens if two As and one B, or one A and two Bs? What if a user defines A - B and A - C, and we happen to have two As in source tree and B and C in target tree? There's also the problem with transferring this information. With git-notes I think I can transfer it (though not automatically). How do we transfer sha1 map (that you mentioned in the commit generation mail in this thread)? Then it will find that pair no matter which trees or commits are being diffed, and it will do so relatively inexpensively[1]. But does that happen often in practice? I mean diff-ing two arbitrary trees and expect rename correction. I disregarded it as git log is my main case, but I'm just a single user.. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP PATCH] Manual rename correction
On Wed, Aug 1, 2012 at 9:01 AM, Jeff King p...@peff.net wrote: On Wed, Aug 01, 2012 at 08:10:12AM +0700, Nguyen Thai Ngoc Duy wrote: I do not think that is the right direction. Let's imagine that I have a commit A and I annotate it (via notes or whatever) to say between A^^{tree} and A^{tree}, foo.c became bar.c. That will help me when doing git show or git log. But it will not help me when I later try to merge A (or its descendent). In that case, I will compute the diff between A and the merge-base (or worse, some descendent of A and the merge-base), and I will miss this hint entirely. A much better hint is to annotate pairs of sha1s, to say do not bother doing inexact rename correlation on this pair; I promise that they have value N. I haven't had time to think it through yet but I throw my thoughts in any way. I actually went with your approach first. But it's more difficult to control the renaming. Assume we want to tell git to rename SHA-1 A to SHA-1 B. What happens if we have two As in the source tree and two Bs in the target tree? What happens if two As and one B, or one A and two Bs? What if a user defines A - B and A - C, and we happen to have two As in source tree and B and C in target tree? Yes, it disregards path totally. But if you had the exact same movement of content from one path to another in one instance, and it is considered a rename, wouldn't it also be a rename in a second instance? Yes. This is probably cosmetics only, but without path information, we leave it to chance to decide which A to pair with B and C (in the A-B, A-C case above). Wrong path might lead to funny effects (i'm thinking of git log --follow). There's also the problem with transferring this information. With git-notes I think I can transfer it (though not automatically). How do we transfer sha1 map (that you mentioned in the commit generation mail in this thread)? I wasn't clear. This is about transferring info across repositories. That is orthogonal to the issue of what is being stored. I chose my mmap'd disk implementation because it is very fast, which makes it nice for a performance cache. But you could store the same thing in git-notes (indexed by dst sha1, I guess, and then pointing to a blob of (src, score) pairs. If you want to include path-based hints in a commit, I'd say that using some micro-format in the commit message would be the simplest thing. Rename correction is after the commit is created. I don't think we can recreate commits. But that has been discussed before; ultimately the problem is that it only covers _one_ diff that we do with that commit (it is probably the most common, of course, but it doesn't cover them all). How about we generate sha1 mapping from commit hints? We try to take advantage of path hints when we can. Else we fall back to sha-1 mapping. This way we can transfer commit hints as git-notes to another repo, then regenerate sha-1 mapping there. No need to transfer sha1 maps. Then it will find that pair no matter which trees or commits are being diffed, and it will do so relatively inexpensively[1]. But does that happen often in practice? I mean diff-ing two arbitrary trees and expect rename correction. I disregarded it as git log is my main case, but I'm just a single user.. It happens every time merge-recursive does rename detection, which includes git merge but also things like cherry-pick. Thanks. I'll look into merge/cherry-pick. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html