Re: [PATCH 1/2] blame: large-scale performance rewrite
Shawn Pearce spea...@spearce.org writes: On Sat, Apr 26, 2014 at 10:30 AM, David Kastrup d...@gnu.org wrote: David Kastrup d...@gnu.org writes: Here's some example: dak@lola:/usr/local/tmp/wortliste$ time git blame -n -s wortliste /tmp/wl1 real15m47.118s user14m39.928s sys 1m1.872s Hah, this is quite the torture test. git before your patch is taking 22m11s on my laptop to compute this. (This was with default options, I noticed you passed -s to suppress the author formatting.) dak@lola:/usr/local/tmp/wortliste$ time ../git/git blame -n -s wortliste /tmp/wl2 real3m40.947s user2m40.296s sys 0m59.440s Meanwhile JGit computed in 4m30s on the same hardware. So I guess we are fine. At least the stuff I fixed with regard to performance would seem to be done right in JGit to start with. Its still not as fast as I want it to be. :-) Most of the diff data/CRC is computed over and over because of the blackbox use of xdiff. And then the delta-chain storage is packing stuff based on CRCs as well (not sure whether it keeps them around for unpacking). So there is a lot that could likely be improved while keeping the same basic algorithms, by cracking open the black boxes of the xdiff engine and the delta-chain coding. -- David Kastrup -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] blame: large-scale performance rewrite
On Sat, Apr 26, 2014 at 2:39 PM, David Kastrup d...@gnu.org wrote: At least the stuff I fixed with regard to performance would seem to be done right in JGit to start with. Hah! Its Java. We have to do things right, otherwise its too slow. :-) Its still not as fast as I want it to be. :-) Most of the diff data/CRC is computed over and over because of the blackbox use of xdiff. Yes, I have been thinking about this all week. JGit blame uses HistogramDiff by default instead of MyersDiff. The first stage after we trim common header/trailer from both files is to compute a hash of each line and store those hashes. Those hashes are discarded as the blame algorithm moves to the next commit. Clearly for a commit chain of A - B - C, the hashes computed at B for the A-B compare can be reused for the B-C compare. This is not the case in either git-core or JGit, because the diff algorithm is a block box to the blame algorithm. I think this is what you mean by the CRC being computed again. For any given compare blame has a list of regions it is interested in learning about from the diff algorithm. Anything outside of those regions is useless noise that will be discarded. I have been pondering pushing that region list down into the diff algorithm so it can avoid executing on sections that are not relevant to the caller. At least for HistogramDiff this makes some sense, the algorithm is recursively applied after it finds a longest common subsequence. If one side of the LCS is outside of the region of interest from blame, there is no value in recursing on that portion. If the blame region list covers a small enough portion, it may even make sense to avoid the common header/trailer elimination preprocessing step. Unfortunately that sounds hard as you could be working with a file like a ChangeLog which grows on one of those sides. And then the delta-chain storage is packing stuff based on CRCs as well (not sure whether it keeps them around for unpacking). There are CRCs validated by libz during inflation, but these aren't rechecked once the inflated bytes are cached in that silly undersized 16M delta base cache. So there is a lot that could likely be improved while keeping the same basic algorithms, by cracking open the black boxes of the xdiff engine and the delta-chain coding. The delta chain coding has no relationship to the source file. Currently even plain text files are delta chain coded on a byte basis, not a line basis. Just matching up the delta coding against a source text file to determine lines 0-N are common is costly, since you have a byte range in the delta coding and you want a line range out the end. To make things more challenging, the delta chain coding can be against completely different blobs. In a compare of A-B from the commit graph being walked by blame there is no requirement the delta coding uses this pairing, and it almost certainly never uses this direction (its usually B-A, if its even this pair!). Given your comments in the other patch, I understand why you probably won't be working on blame more. But the above may help someone else that has available time to continue. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] blame: large-scale performance rewrite
Shawn Pearce spea...@spearce.org writes: On Fri, Apr 25, 2014 at 4:56 PM, David Kastrup d...@gnu.org wrote: The previous implementation used a single sorted linear list of blame entries for organizing all partial or completed work. Every subtask had to scan the whole list, with most entries not being relevant to the task. The resulting run-time was quadratic to the number of separate chunks. This change gives every subtask its own data to work with. Subtasks are organized into struct origin chains hanging off particular commits. Commits are organized into a priority queue, processing them in commit date order in order to keep most of the work affecting a particular blob collated even in the presence of an extensive merge history. Without reading the code, this sounds like how JGit runs blame. For large files with a diversified history, a speedup by a factor of 3 or more is not unusual. And JGit was already usually slower than git-core. Now it will be even slower! :-) If your statement about JGit is accurate, it should likely have beat Git for large use cases (where the performance improvements are most important) as O(n) beats O(n^2) in the long run. At any rate, I see that I ended up posting this patch series at the end of the week again which makes for a somewhat lacklustre initial response from those who code Git for a regular living. Apropos: shaking the bugs regarding -M and -C options out of the code had taken a large toll because -M can cause the same or overlapping line regions to be responsible for different target regions and the original code implementing the straightforward blame blew up on the overlap. I spent a _lot_ of time tracking down that problem. As I am lousy focusing on more than one task, and as I don't get a regular paycheck anyway, this will have to remain my last contribution to Git if I am not going to recoup my losses. Patch 2 of this series tries giving the community of Git a serious chance at picking that option (I mean, there are literally millions of Git users around with a sizable number profiting) while not being obnoxious about it. My personal guess is that it will fail regarding both objectives. But then I've been surprised before by other free software communities when trying to make those particular two ends meet. At any rate, feedback about the performance of the patch from users disappointed by regular git blame would be welcome. Apart from the objective measurement of total time, the more subjective impression of interactive/incremental response (like in git gui blame) where the order of results will significantly differ (current git-blame --incremental focuses on getting blames resolved in first-lines-first manner, the proposed git-blame rather works on a newest-commits-first basis which might better match typical use cases) might be worth reporting. -- David Kastrup -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] blame: large-scale performance rewrite
On Sat, Apr 26, 2014 at 12:48 AM, David Kastrup d...@gnu.org wrote: Shawn Pearce spea...@spearce.org writes: On Fri, Apr 25, 2014 at 4:56 PM, David Kastrup d...@gnu.org wrote: The previous implementation used a single sorted linear list of blame entries for organizing all partial or completed work. Every subtask had to scan the whole list, with most entries not being relevant to the task. The resulting run-time was quadratic to the number of separate chunks. This change gives every subtask its own data to work with. Subtasks are organized into struct origin chains hanging off particular commits. Commits are organized into a priority queue, processing them in commit date order in order to keep most of the work affecting a particular blob collated even in the presence of an extensive merge history. Without reading the code, this sounds like how JGit runs blame. For large files with a diversified history, a speedup by a factor of 3 or more is not unusual. And JGit was already usually slower than git-core. Now it will be even slower! :-) If your statement about JGit is accurate, it should likely have beat Git for large use cases (where the performance improvements are most important) as O(n) beats O(n^2) in the long run. Agreed. In a few cases yes, JGit did beat git-core at blame running time. Unfortunately according to my profiling blame performance is still dominated by inflation and scanning of commit and tree objects to identify unmodified blobs and advance to the next scapegoat ancestor. Its entirely possible my blame implementation in JGit is still doing something stupid. Or its possible JIT translated Java just takes longer than natively compiled C. I am including JVM startup and JIT time in the timing comparison with git-core. Apart from the objective measurement of total time, the more subjective impression of interactive/incremental response (like in git gui blame) where the order of results will significantly differ (current git-blame --incremental focuses on getting blames resolved in first-lines-first manner, the proposed git-blame rather works on a newest-commits-first basis which might better match typical use cases) might be worth reporting. Seeing this fill during execution was the initial motivation I had for writing git gui blame. I don't think anyone cares about the order it displays in. In fact ordering my timestamp may be more what the user wants anyway, as you suggest above. Thanks for doing this. Unfortunately I can't read the patch itself as I am also trying to improve JGit's blame code for $DAY_JOB, and JGit is BSD licensed. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] blame: large-scale performance rewrite
Shawn Pearce spea...@spearce.org writes: On Sat, Apr 26, 2014 at 12:48 AM, David Kastrup d...@gnu.org wrote: Shawn Pearce spea...@spearce.org writes: And JGit was already usually slower than git-core. Now it will be even slower! :-) If your statement about JGit is accurate, it should likely have beat Git for large use cases (where the performance improvements are most important) as O(n) beats O(n^2) in the long run. Agreed. In a few cases yes, JGit did beat git-core at blame running time. Unfortunately according to my profiling blame performance is still dominated by inflation and scanning of commit and tree objects to identify unmodified blobs and advance to the next scapegoat ancestor. Oh, the C version is most certainly significantly impacted by that after my patch. One can _significantly_ speed it up by increasing core.deltaBaseCacheLimit from its rather silly value of 16M. If you have a comparable control in JGit and if your benchmarking points to the unpacking, that's where I'd suggest tweaking first. Apart from the objective measurement of total time, the more subjective impression of interactive/incremental response (like in git gui blame) where the order of results will significantly differ (current git-blame --incremental focuses on getting blames resolved in first-lines-first manner, the proposed git-blame rather works on a newest-commits-first basis which might better match typical use cases) might be worth reporting. Seeing this fill during execution was the initial motivation I had for writing git gui blame. It does not look impressively better to me, actually. Probably because git gui blame is running git blame several times. I don't think anyone cares about the order it displays in. In fact ordering my timestamp may be more what the user wants anyway, as you suggest above. What the user wants anyway is that git gui blame notifies git blame of the currently displayed window area whenever that changes, and that git blame then _first_ deals with all chunks with a visible on-screen part. Thanks for doing this. Unfortunately I can't read the patch itself as I am also trying to improve JGit's blame code for $DAY_JOB, and JGit is BSD licensed. Shrug. The patch is functionally equivalent to the previous behavior, the arrangement of linear lists on underlying data structures is hardly copyrightable, and Java implements linear lists differently anyhow. Merging two sorted linear lists is a straightforward operation, splitting a linear list into several others also is. The really tricky/expensive part was realizing that as opposed to target line number ranges, source line number ranges may overlap and/or be duplicate when using -M or -C options. That really messed things up for a long time and was hard to debug. Once I figured out what was going wrong, recoding the respective stuff was straightforward. I doubt that there is much copyrightable material to transfer as I seem to remember that Java does not have anything like a pointer. So the main stuff, juggling with linear lists, would not likely transfer in a reasonably recognizable manner. -- David Kastrup -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] blame: large-scale performance rewrite
Shawn Pearce spea...@spearce.org writes: Thanks for doing this. Unfortunately I can't read the patch itself as I am also trying to improve JGit's blame code for $DAY_JOB, and JGit is BSD licensed. Actually, I'd have suggested asking $EMPLOYER to buy the rights for looking at the code, but as I wrote previously, I'd seriously doubt that he'd get his money's worth for use in a _Java_ implementation. The C code I proposed is good for files with many small changes: I'd suggest benchmarking your JGit code with some of them. If the JGit is still dominated by unpacking, your implementation should be fine. The current C version instead thrashes around digging through its own all-purpose single linear list. Here are two real-world test cases: git://git.savannah.gnu.org/emacs.git git blame [-M / -C] src/xdisp.c http://repo.or.cz/r/wortliste.git git blame [-M / -C] wortliste The latter one is _really_ taking a severe hit from the O(n^2) algorithms. If your benchmarks for that one still point mostly to the unpacking, your jgit blame should be fine regarding the stuff I reimplemented. -- David Kastrup -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] blame: large-scale performance rewrite
On Sat, Apr 26, 2014 at 9:50 AM, David Kastrup d...@gnu.org wrote: Shawn Pearce spea...@spearce.org writes: On Sat, Apr 26, 2014 at 12:48 AM, David Kastrup d...@gnu.org wrote: Shawn Pearce spea...@spearce.org writes: And JGit was already usually slower than git-core. Now it will be even slower! :-) If your statement about JGit is accurate, it should likely have beat Git for large use cases (where the performance improvements are most important) as O(n) beats O(n^2) in the long run. Agreed. In a few cases yes, JGit did beat git-core at blame running time. Unfortunately according to my profiling blame performance is still dominated by inflation and scanning of commit and tree objects to identify unmodified blobs and advance to the next scapegoat ancestor. Oh, the C version is most certainly significantly impacted by that after my patch. One can _significantly_ speed it up by increasing core.deltaBaseCacheLimit from its rather silly value of 16M. If you have a comparable control in JGit and if your benchmarking points to the unpacking, that's where I'd suggest tweaking first. Good point, we have that control but I always forget to play with it during benchmarking. Apart from the objective measurement of total time, the more subjective impression of interactive/incremental response (like in git gui blame) where the order of results will significantly differ (current git-blame --incremental focuses on getting blames resolved in first-lines-first manner, the proposed git-blame rather works on a newest-commits-first basis which might better match typical use cases) might be worth reporting. Seeing this fill during execution was the initial motivation I had for writing git gui blame. It does not look impressively better to me, actually. Probably because git gui blame is running git blame several times. IIRC git gui blame runs blame only twice. Once with the simple blame, and then again with -M -C or something like that. Its a nice display because you can see who moved code here, and then who originally wrote it. Unfortunately it has to be done as two passes as the blame implementation does not know how to compute both in one pass. The move/copy detection is also computationally more expensive per revision visited, so it really slows down the simple who put it here if the two passes were somehow run in a single iteration. I don't think anyone cares about the order it displays in. In fact ordering my timestamp may be more what the user wants anyway, as you suggest above. What the user wants anyway is that git gui blame notifies git blame of the currently displayed window area whenever that changes, and that git blame then _first_ deals with all chunks with a visible on-screen part. That is a really good idea. Unfortunately finding that part of the window in the blame data can still take a while. You may compute other parts of the file anyway while looking for this part, as you had to compare two revisions and scapegoat sections to find out that commit didn't touch the interesting region and keep going back through history. You may get lucky and be able to fill in a recent region quickly, but I somehow suspect giving priority to the visible region in the priority queue won't really help to reduce latency of the visible region for the user. The really tricky/expensive part was realizing that as opposed to target line number ranges, source line number ranges may overlap and/or be duplicate when using -M or -C options. That really messed things up for a long time and was hard to debug. Once I figured out what was going wrong, recoding the respective stuff was straightforward. Right, and JGit blame still is missing the -M and -C options, as I have not implemented those yet. I got basic blame and reverse blame working a few years ago and then stopped working on the code for a while. Now we have interest in improving the latency for $DAY_JOB, so I've been poking at the code again for the last week or so. But that -M and -C thing is still not implemented, and I know its going to be tricky to get right with the way the scapegoating is passed along. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] blame: large-scale performance rewrite
Shawn Pearce spea...@spearce.org writes: Right, and JGit blame still is missing the -M and -C options, as I have not implemented those yet. I got basic blame and reverse blame working a few years ago and then stopped working on the code for a while. Now we have interest in improving the latency for $DAY_JOB, so I've been poking at the code again for the last week or so. But that -M and -C thing is still not implemented, and I know its going to be tricky to get right with the way the scapegoating is passed along. Actually, for -M/-C it would be saner to rip open the whole xdiff blackbox which internally goes to quite some effort in order to produce the linear ordering of a diff, and then -M has to simulate dropping that linear ordering requirement by doing a host of parallel diffs for each chunk. -- David Kastrup -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] blame: large-scale performance rewrite
David Kastrup d...@gnu.org writes: http://repo.or.cz/r/wortliste.git git blame [-M / -C] wortliste The latter one is _really_ taking a severe hit from the O(n^2) algorithms. If your benchmarks for that one still point mostly to the unpacking, your jgit blame should be fine regarding the stuff I reimplemented. Here's some example: dak@lola:/usr/local/tmp/wortliste$ time git blame -n -s wortliste /tmp/wl1 real15m47.118s user14m39.928s sys 1m1.872s dak@lola:/usr/local/tmp/wortliste$ time ../git/git blame -n -s wortliste /tmp/wl2 real3m40.947s user2m40.296s sys 0m59.440s Note how the system time is almost the same. I have some patches which make quite a bit of difference with that (at best, saving about half of the system time), but I have not yet found the silver bullet where I'd be reasonably sure that temporary memory use with non-linear history stays strictly in nice bounds. -- David Kastrup -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] blame: large-scale performance rewrite
On Sat, Apr 26, 2014 at 10:30 AM, David Kastrup d...@gnu.org wrote: David Kastrup d...@gnu.org writes: http://repo.or.cz/r/wortliste.git git blame [-M / -C] wortliste The latter one is _really_ taking a severe hit from the O(n^2) algorithms. If your benchmarks for that one still point mostly to the unpacking, your jgit blame should be fine regarding the stuff I reimplemented. Here's some example: dak@lola:/usr/local/tmp/wortliste$ time git blame -n -s wortliste /tmp/wl1 real15m47.118s user14m39.928s sys 1m1.872s Hah, this is quite the torture test. git before your patch is taking 22m11s on my laptop to compute this. (This was with default options, I noticed you passed -s to suppress the author formatting.) dak@lola:/usr/local/tmp/wortliste$ time ../git/git blame -n -s wortliste /tmp/wl2 real3m40.947s user2m40.296s sys 0m59.440s Meanwhile JGit computed in 4m30s on the same hardware. So I guess we are fine. Its still not as fast as I want it to be. :-) -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] blame: large-scale performance rewrite
The previous implementation used a single sorted linear list of blame entries for organizing all partial or completed work. Every subtask had to scan the whole list, with most entries not being relevant to the task. The resulting run-time was quadratic to the number of separate chunks. This change gives every subtask its own data to work with. Subtasks are organized into struct origin chains hanging off particular commits. Commits are organized into a priority queue, processing them in commit date order in order to keep most of the work affecting a particular blob collated even in the presence of an extensive merge history. For large files with a diversified history, a speedup by a factor of 3 or more is not unusual. Signed-off-by: David Kastrup d...@gnu.org --- builtin/blame.c | 865 +--- 1 file changed, 567 insertions(+), 298 deletions(-) diff --git a/builtin/blame.c b/builtin/blame.c index 88cb799..224f0ff 100644 --- a/builtin/blame.c +++ b/builtin/blame.c @@ -1,7 +1,8 @@ /* * Blame * - * Copyright (c) 2006, Junio C Hamano + * Copyright (c) 2006, 2014 by its authors + * See COPYING for licensing conditions */ #include cache.h @@ -18,7 +19,9 @@ #include cache-tree.h #include string-list.h #include mailmap.h +#include mergesort.h #include parse-options.h +#include prio-queue.h #include utf8.h #include userdiff.h #include line-range.h @@ -83,11 +86,42 @@ static unsigned blame_copy_score; */ struct origin { int refcnt; + /* Record preceding blame record for this blob */ struct origin *previous; + /* origins are put in a list linked via `next' hanging off the +* corresponding commit's util field in order to make finding +* them fast. The presence in this chain does not count +* towards the origin's reference count. It is tempting to +* let it count as long as the commit is pending examination, +* but even under circumstances where the commit will be +* present multiple times in the priority queue of unexamined +* commits, processing the first instance will not leave any +* work requiring the origin data for the second instance. An +* interspersed commit changing that would have to be +* preexisting with a different ancestry and with the same +* commit date in order to wedge itself between two instances +* of the same commit in the priority queue _and_ produce +* blame entries relevant for it. While we don't want to let +* us get tripped up by this case, it certainly does not seem +* worth optimizing for. +*/ + struct origin *next; struct commit *commit; + /* `suspects' contains blame entries that may be attributed to +* this origin's commit or to parent commits. When a commit +* is being processed, all suspects will be moved, either by +* assigning them to an origin in a different commit, or by +* shipping them to the scoreboard's ent list because they +* cannot be attributed to a different commit. +*/ + struct blame_entry *suspects; mmfile_t file; unsigned char blob_sha1[20]; unsigned mode; + /* guilty gets set when shipping any suspects to the final +* blame list instead of other commits +*/ + char guilty; char path[FLEX_ARRAY]; }; @@ -176,10 +210,22 @@ static inline struct origin *origin_incref(struct origin *o) static void origin_decref(struct origin *o) { if (o --o-refcnt = 0) { + struct origin *p, *l = NULL; if (o-previous) origin_decref(o-previous); free(o-file.ptr); - free(o); + /* Should be present exactly once in commit chain */ + for (p = o-commit-util; p; l = p, p = p-next) { + if (p == o) { + if (l) + l-next = p-next; + else + o-commit-util = p-next; + free(o); + return; + } + } + die(internal error in blame::origin_decref); } } @@ -193,8 +239,12 @@ static void drop_origin_blob(struct origin *o) /* * Each group of lines is described by a blame_entry; it can be split - * as we pass blame to the parents. They form a linked list in the - * scoreboard structure, sorted by the target line number. + * as we pass blame to the parents. They are arranged in linked lists + * kept as `suspects' of some unprocessed origin, or entered (when the + * blame origin has been finalized) into the scoreboard structure. + * While the scoreboard structure is only sorted at the end of + * processing (according to final image line number), the
Re: [PATCH 1/2] blame: large-scale performance rewrite
On Fri, Apr 25, 2014 at 4:56 PM, David Kastrup d...@gnu.org wrote: The previous implementation used a single sorted linear list of blame entries for organizing all partial or completed work. Every subtask had to scan the whole list, with most entries not being relevant to the task. The resulting run-time was quadratic to the number of separate chunks. This change gives every subtask its own data to work with. Subtasks are organized into struct origin chains hanging off particular commits. Commits are organized into a priority queue, processing them in commit date order in order to keep most of the work affecting a particular blob collated even in the presence of an extensive merge history. Without reading the code, this sounds like how JGit runs blame. For large files with a diversified history, a speedup by a factor of 3 or more is not unusual. And JGit was already usually slower than git-core. Now it will be even slower! :-) -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html