Re: [PATCH 1/2] blame: large-scale performance rewrite

2014-04-27 Thread David Kastrup
Shawn Pearce spea...@spearce.org writes:

 On Sat, Apr 26, 2014 at 10:30 AM, David Kastrup d...@gnu.org wrote:
 David Kastrup d...@gnu.org writes:

 Here's some example:

 dak@lola:/usr/local/tmp/wortliste$ time git blame -n -s wortliste /tmp/wl1

 real15m47.118s
 user14m39.928s
 sys 1m1.872s

 Hah, this is quite the torture test. git before your patch is taking
 22m11s on my laptop to compute this. (This was with default options, I
 noticed you passed -s to suppress the author formatting.)

 dak@lola:/usr/local/tmp/wortliste$ time ../git/git blame -n -s wortliste 
 /tmp/wl2

 real3m40.947s
 user2m40.296s
 sys 0m59.440s

 Meanwhile JGit computed in 4m30s on the same hardware. So I guess we
 are fine.

At least the stuff I fixed with regard to performance would seem to be
done right in JGit to start with.

 Its still not as fast as I want it to be. :-)

Most of the diff data/CRC is computed over and over because of the
blackbox use of xdiff.  And then the delta-chain storage is packing
stuff based on CRCs as well (not sure whether it keeps them around for
unpacking).  So there is a lot that could likely be improved while
keeping the same basic algorithms, by cracking open the black boxes of
the xdiff engine and the delta-chain coding.

-- 
David Kastrup
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] blame: large-scale performance rewrite

2014-04-27 Thread Shawn Pearce
On Sat, Apr 26, 2014 at 2:39 PM, David Kastrup d...@gnu.org wrote:

 At least the stuff I fixed with regard to performance would seem to be
 done right in JGit to start with.

Hah! Its Java. We have to do things right, otherwise its too slow. :-)

 Its still not as fast as I want it to be. :-)

 Most of the diff data/CRC is computed over and over because of the
 blackbox use of xdiff.

Yes, I have been thinking about this all week.

JGit blame uses HistogramDiff by default instead of MyersDiff. The
first stage after we trim common header/trailer from both files is to
compute a hash of each line and store those hashes. Those hashes are
discarded as the blame algorithm moves to the next commit.

Clearly for a commit chain of A - B - C, the hashes computed at B
for the A-B compare can be reused for the B-C compare. This is not
the case in either git-core or JGit, because the diff algorithm is a
block box to the blame algorithm. I think this is what you mean by the
CRC being computed again.


For any given compare blame has a list of regions it is interested in
learning about from the diff algorithm. Anything outside of those
regions is useless noise that will be discarded. I have been pondering
pushing that region list down into the diff algorithm so it can avoid
executing on sections that are not relevant to the caller. At least
for HistogramDiff this makes some sense, the algorithm is recursively
applied after it finds a longest common subsequence. If one side of
the LCS is outside of the region of interest from blame, there is no
value in recursing on that portion.

If the blame region list covers a small enough portion, it may even
make sense to avoid the common header/trailer elimination
preprocessing step. Unfortunately that sounds hard as you could be
working with a file like a ChangeLog which grows on one of those
sides.


  And then the delta-chain storage is packing
 stuff based on CRCs as well (not sure whether it keeps them around for
 unpacking).

There are CRCs validated by libz during inflation, but these aren't
rechecked once the inflated bytes are cached in that silly undersized
16M delta base cache.

 So there is a lot that could likely be improved while
 keeping the same basic algorithms, by cracking open the black boxes of
 the xdiff engine and the delta-chain coding.

The delta chain coding has no relationship to the source file.
Currently even plain text files are delta chain coded on a byte basis,
not a line basis. Just matching up the delta coding against a source
text file to determine lines 0-N are common is costly, since you have
a byte range in the delta coding and you want a line range out the
end.

To make things more challenging, the delta chain coding can be against
completely different blobs. In a compare of A-B from the commit graph
being walked by blame there is no requirement the delta coding uses
this pairing, and it almost certainly never uses this direction (its
usually B-A, if its even this pair!).


Given your comments in the other patch, I understand why you probably
won't be working on blame more. But the above may help someone else
that has available time to continue.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] blame: large-scale performance rewrite

2014-04-26 Thread David Kastrup
Shawn Pearce spea...@spearce.org writes:

 On Fri, Apr 25, 2014 at 4:56 PM, David Kastrup d...@gnu.org wrote:
 The previous implementation used a single sorted linear list of blame
 entries for organizing all partial or completed work.  Every subtask had
 to scan the whole list, with most entries not being relevant to the
 task.  The resulting run-time was quadratic to the number of separate
 chunks.

 This change gives every subtask its own data to work with.  Subtasks are
 organized into struct origin chains hanging off particular commits.
 Commits are organized into a priority queue, processing them in commit
 date order in order to keep most of the work affecting a particular blob
 collated even in the presence of an extensive merge history.

 Without reading the code, this sounds like how JGit runs blame.

 For large files with a diversified history, a speedup by a factor of 3
 or more is not unusual.

 And JGit was already usually slower than git-core. Now it will be even
 slower! :-)

If your statement about JGit is accurate, it should likely have beat Git
for large use cases (where the performance improvements are most
important) as O(n) beats O(n^2) in the long run.

At any rate, I see that I ended up posting this patch series at the end
of the week again which makes for a somewhat lacklustre initial response
from those who code Git for a regular living.

Apropos: shaking the bugs regarding -M and -C options out of the code
had taken a large toll because -M can cause the same or overlapping line
regions to be responsible for different target regions and the original
code implementing the straightforward blame blew up on the overlap.
I spent a _lot_ of time tracking down that problem.

As I am lousy focusing on more than one task, and as I don't get a
regular paycheck anyway, this will have to remain my last contribution
to Git if I am not going to recoup my losses.

Patch 2 of this series tries giving the community of Git a serious
chance at picking that option (I mean, there are literally millions of
Git users around with a sizable number profiting) while not being
obnoxious about it.

My personal guess is that it will fail regarding both objectives.  But
then I've been surprised before by other free software communities when
trying to make those particular two ends meet.

At any rate, feedback about the performance of the patch from users
disappointed by regular git blame would be welcome.

Apart from the objective measurement of total time, the more
subjective impression of interactive/incremental response (like in git
gui blame) where the order of results will significantly differ (current
git-blame --incremental focuses on getting blames resolved in
first-lines-first manner, the proposed git-blame rather works on a
newest-commits-first basis which might better match typical use cases)
might be worth reporting.

-- 
David Kastrup
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] blame: large-scale performance rewrite

2014-04-26 Thread Shawn Pearce
On Sat, Apr 26, 2014 at 12:48 AM, David Kastrup d...@gnu.org wrote:
 Shawn Pearce spea...@spearce.org writes:

 On Fri, Apr 25, 2014 at 4:56 PM, David Kastrup d...@gnu.org wrote:
 The previous implementation used a single sorted linear list of blame
 entries for organizing all partial or completed work.  Every subtask had
 to scan the whole list, with most entries not being relevant to the
 task.  The resulting run-time was quadratic to the number of separate
 chunks.

 This change gives every subtask its own data to work with.  Subtasks are
 organized into struct origin chains hanging off particular commits.
 Commits are organized into a priority queue, processing them in commit
 date order in order to keep most of the work affecting a particular blob
 collated even in the presence of an extensive merge history.

 Without reading the code, this sounds like how JGit runs blame.

 For large files with a diversified history, a speedup by a factor of 3
 or more is not unusual.

 And JGit was already usually slower than git-core. Now it will be even
 slower! :-)

 If your statement about JGit is accurate, it should likely have beat Git
 for large use cases (where the performance improvements are most
 important) as O(n) beats O(n^2) in the long run.

Agreed.

In a few cases yes, JGit did beat git-core at blame running time.
Unfortunately according to my profiling blame performance is still
dominated by inflation and scanning of commit and tree objects to
identify unmodified blobs and advance to the next scapegoat ancestor.

Its entirely possible my blame implementation in JGit is still doing
something stupid. Or its possible JIT translated Java just takes
longer than natively compiled C. I am including JVM startup and JIT
time in the timing comparison with git-core.

 Apart from the objective measurement of total time, the more
 subjective impression of interactive/incremental response (like in git
 gui blame) where the order of results will significantly differ (current
 git-blame --incremental focuses on getting blames resolved in
 first-lines-first manner, the proposed git-blame rather works on a
 newest-commits-first basis which might better match typical use cases)
 might be worth reporting.

Seeing this fill during execution was the initial motivation I had for
writing git gui blame. I don't think anyone cares about the order it
displays in. In fact ordering my timestamp may be more what the user
wants anyway, as you suggest above.

Thanks for doing this. Unfortunately I can't read the patch itself as
I am also trying to improve JGit's blame code for $DAY_JOB, and JGit
is BSD licensed.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] blame: large-scale performance rewrite

2014-04-26 Thread David Kastrup
Shawn Pearce spea...@spearce.org writes:

 On Sat, Apr 26, 2014 at 12:48 AM, David Kastrup d...@gnu.org wrote:
 Shawn Pearce spea...@spearce.org writes:

 And JGit was already usually slower than git-core. Now it will be
 even slower! :-)

 If your statement about JGit is accurate, it should likely have beat
 Git for large use cases (where the performance improvements are most
 important) as O(n) beats O(n^2) in the long run.

 Agreed.

 In a few cases yes, JGit did beat git-core at blame running time.
 Unfortunately according to my profiling blame performance is still
 dominated by inflation and scanning of commit and tree objects to
 identify unmodified blobs and advance to the next scapegoat ancestor.

Oh, the C version is most certainly significantly impacted by that after
my patch.  One can _significantly_ speed it up by increasing
core.deltaBaseCacheLimit from its rather silly value of 16M.  If you
have a comparable control in JGit and if your benchmarking points to the
unpacking, that's where I'd suggest tweaking first.

 Apart from the objective measurement of total time, the more
 subjective impression of interactive/incremental response (like in
 git gui blame) where the order of results will significantly differ
 (current git-blame --incremental focuses on getting blames resolved
 in first-lines-first manner, the proposed git-blame rather works on a
 newest-commits-first basis which might better match typical use
 cases) might be worth reporting.

 Seeing this fill during execution was the initial motivation I had for
 writing git gui blame.

It does not look impressively better to me, actually.  Probably because
git gui blame is running git blame several times.

 I don't think anyone cares about the order it displays in. In fact
 ordering my timestamp may be more what the user wants anyway, as you
 suggest above.

What the user wants anyway is that git gui blame notifies git blame
of the currently displayed window area whenever that changes, and that
git blame then _first_ deals with all chunks with a visible on-screen
part.

 Thanks for doing this. Unfortunately I can't read the patch itself as
 I am also trying to improve JGit's blame code for $DAY_JOB, and JGit
 is BSD licensed.

Shrug.  The patch is functionally equivalent to the previous behavior,
the arrangement of linear lists on underlying data structures is hardly
copyrightable, and Java implements linear lists differently anyhow.
Merging two sorted linear lists is a straightforward operation,
splitting a linear list into several others also is.

The really tricky/expensive part was realizing that as opposed to target
line number ranges, source line number ranges may overlap and/or be
duplicate when using -M or -C options.  That really messed things up for
a long time and was hard to debug.  Once I figured out what was going
wrong, recoding the respective stuff was straightforward.

I doubt that there is much copyrightable material to transfer as I seem
to remember that Java does not have anything like a pointer.  So the
main stuff, juggling with linear lists, would not likely transfer in a
reasonably recognizable manner.

-- 
David Kastrup
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] blame: large-scale performance rewrite

2014-04-26 Thread David Kastrup
Shawn Pearce spea...@spearce.org writes:

 Thanks for doing this. Unfortunately I can't read the patch itself as
 I am also trying to improve JGit's blame code for $DAY_JOB, and JGit
 is BSD licensed.

Actually, I'd have suggested asking $EMPLOYER to buy the rights for
looking at the code, but as I wrote previously, I'd seriously doubt that
he'd get his money's worth for use in a _Java_ implementation.

The C code I proposed is good for files with many small changes: I'd
suggest benchmarking your JGit code with some of them.  If the JGit is
still dominated by unpacking, your implementation should be fine.  The
current C version instead thrashes around digging through its own
all-purpose single linear list.

Here are two real-world test cases:

git://git.savannah.gnu.org/emacs.git
git blame [-M / -C] src/xdisp.c

http://repo.or.cz/r/wortliste.git
git blame [-M / -C] wortliste

The latter one is _really_ taking a severe hit from the O(n^2)
algorithms.  If your benchmarks for that one still point mostly to the
unpacking, your jgit blame should be fine regarding the stuff
I reimplemented.

-- 
David Kastrup
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] blame: large-scale performance rewrite

2014-04-26 Thread Shawn Pearce
On Sat, Apr 26, 2014 at 9:50 AM, David Kastrup d...@gnu.org wrote:
 Shawn Pearce spea...@spearce.org writes:

 On Sat, Apr 26, 2014 at 12:48 AM, David Kastrup d...@gnu.org wrote:
 Shawn Pearce spea...@spearce.org writes:

 And JGit was already usually slower than git-core. Now it will be
 even slower! :-)

 If your statement about JGit is accurate, it should likely have beat
 Git for large use cases (where the performance improvements are most
 important) as O(n) beats O(n^2) in the long run.

 Agreed.

 In a few cases yes, JGit did beat git-core at blame running time.
 Unfortunately according to my profiling blame performance is still
 dominated by inflation and scanning of commit and tree objects to
 identify unmodified blobs and advance to the next scapegoat ancestor.

 Oh, the C version is most certainly significantly impacted by that after
 my patch.  One can _significantly_ speed it up by increasing
 core.deltaBaseCacheLimit from its rather silly value of 16M.  If you
 have a comparable control in JGit and if your benchmarking points to the
 unpacking, that's where I'd suggest tweaking first.

Good point, we have that control but I always forget to play with it
during benchmarking.

 Apart from the objective measurement of total time, the more
 subjective impression of interactive/incremental response (like in
 git gui blame) where the order of results will significantly differ
 (current git-blame --incremental focuses on getting blames resolved
 in first-lines-first manner, the proposed git-blame rather works on a
 newest-commits-first basis which might better match typical use
 cases) might be worth reporting.

 Seeing this fill during execution was the initial motivation I had for
 writing git gui blame.

 It does not look impressively better to me, actually.  Probably because
 git gui blame is running git blame several times.

IIRC git gui blame runs blame only twice. Once with the simple blame,
and then again with -M -C or something like that. Its a nice display
because you can see who moved code here, and then who originally wrote
it. Unfortunately it has to be done as two passes as the blame
implementation does not know how to compute both in one pass.

The move/copy detection is also computationally more expensive per
revision visited, so it really slows down the simple who put it here
if the two passes were somehow run in a single iteration.

 I don't think anyone cares about the order it displays in. In fact
 ordering my timestamp may be more what the user wants anyway, as you
 suggest above.

 What the user wants anyway is that git gui blame notifies git blame
 of the currently displayed window area whenever that changes, and that
 git blame then _first_ deals with all chunks with a visible on-screen
 part.

That is a really good idea. Unfortunately finding that part of the
window in the blame data can still take a while. You may compute other
parts of the file anyway while looking for this part, as you had to
compare two revisions and scapegoat sections to find out that commit
didn't touch the interesting region and keep going back through
history.

You may get lucky and be able to fill in a recent region quickly, but
I somehow suspect giving priority to the visible region in the
priority queue won't really help to reduce latency of the visible
region for the user.

 The really tricky/expensive part was realizing that as opposed to target
 line number ranges, source line number ranges may overlap and/or be
 duplicate when using -M or -C options.  That really messed things up for
 a long time and was hard to debug.  Once I figured out what was going
 wrong, recoding the respective stuff was straightforward.

Right, and JGit blame still is missing the -M and -C options, as I
have not implemented those yet. I got basic blame and reverse blame
working a few years ago and then stopped working on the code for a
while. Now we have interest in improving the latency for $DAY_JOB, so
I've been poking at the code again for the last week or so.

But that -M and -C thing is still not implemented, and I know its
going to be tricky to get right with the way the scapegoating is
passed along.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] blame: large-scale performance rewrite

2014-04-26 Thread David Kastrup
Shawn Pearce spea...@spearce.org writes:

 Right, and JGit blame still is missing the -M and -C options, as I
 have not implemented those yet. I got basic blame and reverse blame
 working a few years ago and then stopped working on the code for a
 while. Now we have interest in improving the latency for $DAY_JOB, so
 I've been poking at the code again for the last week or so.

 But that -M and -C thing is still not implemented, and I know its
 going to be tricky to get right with the way the scapegoating is
 passed along.

Actually, for -M/-C it would be saner to rip open the whole xdiff
blackbox which internally goes to quite some effort in order to produce
the linear ordering of a diff, and then -M has to simulate dropping that
linear ordering requirement by doing a host of parallel diffs for each
chunk.

-- 
David Kastrup
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] blame: large-scale performance rewrite

2014-04-26 Thread David Kastrup
David Kastrup d...@gnu.org writes:

 http://repo.or.cz/r/wortliste.git
 git blame [-M / -C] wortliste

 The latter one is _really_ taking a severe hit from the O(n^2)
 algorithms.  If your benchmarks for that one still point mostly to the
 unpacking, your jgit blame should be fine regarding the stuff
 I reimplemented.

Here's some example:

dak@lola:/usr/local/tmp/wortliste$ time git blame -n -s wortliste /tmp/wl1

real15m47.118s
user14m39.928s
sys 1m1.872s
dak@lola:/usr/local/tmp/wortliste$ time ../git/git blame -n -s wortliste 
/tmp/wl2

real3m40.947s
user2m40.296s
sys 0m59.440s

Note how the system time is almost the same.  I have some patches which
make quite a bit of difference with that (at best, saving about half of
the system time), but I have not yet found the silver bullet where I'd
be reasonably sure that temporary memory use with non-linear history
stays strictly in nice bounds.

-- 
David Kastrup
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] blame: large-scale performance rewrite

2014-04-26 Thread Shawn Pearce
On Sat, Apr 26, 2014 at 10:30 AM, David Kastrup d...@gnu.org wrote:
 David Kastrup d...@gnu.org writes:

 http://repo.or.cz/r/wortliste.git
 git blame [-M / -C] wortliste

 The latter one is _really_ taking a severe hit from the O(n^2)
 algorithms.  If your benchmarks for that one still point mostly to the
 unpacking, your jgit blame should be fine regarding the stuff
 I reimplemented.

 Here's some example:

 dak@lola:/usr/local/tmp/wortliste$ time git blame -n -s wortliste /tmp/wl1

 real15m47.118s
 user14m39.928s
 sys 1m1.872s

Hah, this is quite the torture test. git before your patch is taking
22m11s on my laptop to compute this. (This was with default options, I
noticed you passed -s to suppress the author formatting.)

 dak@lola:/usr/local/tmp/wortliste$ time ../git/git blame -n -s wortliste 
 /tmp/wl2

 real3m40.947s
 user2m40.296s
 sys 0m59.440s

Meanwhile JGit computed in 4m30s on the same hardware. So I guess we
are fine. Its still not as fast as I want it to be. :-)
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] blame: large-scale performance rewrite

2014-04-25 Thread David Kastrup
The previous implementation used a single sorted linear list of blame
entries for organizing all partial or completed work.  Every subtask had
to scan the whole list, with most entries not being relevant to the
task.  The resulting run-time was quadratic to the number of separate
chunks.

This change gives every subtask its own data to work with.  Subtasks are
organized into struct origin chains hanging off particular commits.
Commits are organized into a priority queue, processing them in commit
date order in order to keep most of the work affecting a particular blob
collated even in the presence of an extensive merge history.

For large files with a diversified history, a speedup by a factor of 3
or more is not unusual.

Signed-off-by: David Kastrup d...@gnu.org
---
 builtin/blame.c | 865 +---
 1 file changed, 567 insertions(+), 298 deletions(-)

diff --git a/builtin/blame.c b/builtin/blame.c
index 88cb799..224f0ff 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -1,7 +1,8 @@
 /*
  * Blame
  *
- * Copyright (c) 2006, Junio C Hamano
+ * Copyright (c) 2006, 2014 by its authors
+ * See COPYING for licensing conditions
  */
 
 #include cache.h
@@ -18,7 +19,9 @@
 #include cache-tree.h
 #include string-list.h
 #include mailmap.h
+#include mergesort.h
 #include parse-options.h
+#include prio-queue.h
 #include utf8.h
 #include userdiff.h
 #include line-range.h
@@ -83,11 +86,42 @@ static unsigned blame_copy_score;
  */
 struct origin {
int refcnt;
+   /* Record preceding blame record for this blob */
struct origin *previous;
+   /* origins are put in a list linked via `next' hanging off the
+* corresponding commit's util field in order to make finding
+* them fast.  The presence in this chain does not count
+* towards the origin's reference count.  It is tempting to
+* let it count as long as the commit is pending examination,
+* but even under circumstances where the commit will be
+* present multiple times in the priority queue of unexamined
+* commits, processing the first instance will not leave any
+* work requiring the origin data for the second instance.  An
+* interspersed commit changing that would have to be
+* preexisting with a different ancestry and with the same
+* commit date in order to wedge itself between two instances
+* of the same commit in the priority queue _and_ produce
+* blame entries relevant for it.  While we don't want to let
+* us get tripped up by this case, it certainly does not seem
+* worth optimizing for.
+*/
+   struct origin *next;
struct commit *commit;
+   /* `suspects' contains blame entries that may be attributed to
+* this origin's commit or to parent commits.  When a commit
+* is being processed, all suspects will be moved, either by
+* assigning them to an origin in a different commit, or by
+* shipping them to the scoreboard's ent list because they
+* cannot be attributed to a different commit.
+*/
+   struct blame_entry *suspects;
mmfile_t file;
unsigned char blob_sha1[20];
unsigned mode;
+   /* guilty gets set when shipping any suspects to the final
+* blame list instead of other commits
+*/
+   char guilty;
char path[FLEX_ARRAY];
 };
 
@@ -176,10 +210,22 @@ static inline struct origin *origin_incref(struct origin 
*o)
 static void origin_decref(struct origin *o)
 {
if (o  --o-refcnt = 0) {
+   struct origin *p, *l = NULL;
if (o-previous)
origin_decref(o-previous);
free(o-file.ptr);
-   free(o);
+   /* Should be present exactly once in commit chain */
+   for (p = o-commit-util; p; l = p, p = p-next) {
+   if (p == o) {
+   if (l)
+   l-next = p-next;
+   else
+   o-commit-util = p-next;
+   free(o);
+   return;
+   }
+   }
+   die(internal error in blame::origin_decref);
}
 }
 
@@ -193,8 +239,12 @@ static void drop_origin_blob(struct origin *o)
 
 /*
  * Each group of lines is described by a blame_entry; it can be split
- * as we pass blame to the parents.  They form a linked list in the
- * scoreboard structure, sorted by the target line number.
+ * as we pass blame to the parents.  They are arranged in linked lists
+ * kept as `suspects' of some unprocessed origin, or entered (when the
+ * blame origin has been finalized) into the scoreboard structure.
+ * While the scoreboard structure is only sorted at the end of
+ * processing (according to final image line number), the 

Re: [PATCH 1/2] blame: large-scale performance rewrite

2014-04-25 Thread Shawn Pearce
On Fri, Apr 25, 2014 at 4:56 PM, David Kastrup d...@gnu.org wrote:
 The previous implementation used a single sorted linear list of blame
 entries for organizing all partial or completed work.  Every subtask had
 to scan the whole list, with most entries not being relevant to the
 task.  The resulting run-time was quadratic to the number of separate
 chunks.

 This change gives every subtask its own data to work with.  Subtasks are
 organized into struct origin chains hanging off particular commits.
 Commits are organized into a priority queue, processing them in commit
 date order in order to keep most of the work affecting a particular blob
 collated even in the presence of an extensive merge history.

Without reading the code, this sounds like how JGit runs blame.

 For large files with a diversified history, a speedup by a factor of 3
 or more is not unusual.

And JGit was already usually slower than git-core. Now it will be even
slower! :-)
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html