Re: [PATCH v8 4/5] Implement line-history search (git log -L)

2013-03-01 Thread Thomas Rast
Junio C Hamano gits...@pobox.com writes:

 Junio C Hamano gits...@pobox.com writes:

 Overall, I like this better than the log --follow hack; as the
 revision traversal is done without any pathspec when being careful
 and slow (aka -M), you do not suffer from the just use a singleton
 pathspec globally regardless of what other history paths are being
 traversed limitation of log --follow.

 The patch series certainly is interesting.

 Having said that, I notice that careful and slow is just too slow
 to be usable even on a small tree like ours.  Try running

 $ git log -M -L:get_name:builtin/describe.c

 and see how long you have to wait until you hit the first line of
 output.

I'll dig some more.  It *should* be essentially the following times
taken together:

  $ time git log --raw -M --topo-order /dev/null

  real0m5.448s
  user0m4.599s
  sys 0m0.794s
  $ time git log -L:get_name:builtin/describe.c /dev/null

  real0m0.832s
  user0m0.796s
  sys 0m0.032s

  $ time git log -L:get_name:builtin-describe.c 81b50f3ce40^ /dev/null

  real0m0.489s
  user0m0.465s
  sys 0m0.022s

So I'm losing a factor of about 4 somewhere, which I can't explain right
now.

It could be improved further if --follow was fixed, because then the
first step should be much faster than diffing *all* the trees.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 4/5] Implement line-history search (git log -L)

2013-03-01 Thread Thomas Rast
Thomas Rast tr...@student.ethz.ch writes:

 Junio C Hamano gits...@pobox.com writes:

 I notice that careful and slow is just too slow
 to be usable even on a small tree like ours.  Try running

 $ git log -M -L:get_name:builtin/describe.c

 and see how long you have to wait until you hit the first line of
 output.

 I'll dig some more.  It *should* be essentially the following times
 taken together:

   $ time git log --raw -M --topo-order /dev/null
   real0m5.448s
[...]
   $ time git log -L:get_name:builtin/describe.c /dev/null
   real0m0.832s
[...]
   $ time git log -L:get_name:builtin-describe.c 81b50f3ce40^ /dev/null
   real0m0.489s
[...]
 So I'm losing a factor of about 4 somewhere, which I can't explain right
 now.

It seems I still don't understand half of this code.

log -M --raw in the above somehow appears to use and optimize for
-M100%, whereas the log -L code is currently written for general args.

However, I couldn't pin down where this happens; I only know from call
graph profiling[1] that log -M --raw never goes through diffcore_std().
And indeed according to the same sort of profiling, log -M -L spends
most of its time within diffcore_std() unpacking blobs to find renames.

Even more confusingly, try_to_follow_renames() _does_ call into
diffcore_std, so there seems to be some merit to calling it after all.

With the hacky patch below,  I get something more reasonable:

  $ time ./git-log -M -L:get_name:builtin/describe.c  /dev/null

  real0m3.794s
  user0m3.734s
  sys 0m0.045s

That's compared to about 35s on my machine without the patch.  It still
calls diffcore_std(), but before that it discards all diff pairs except
those affecting the path(s) we're interested in and any deletions (so
that they can be used as rename sources).  After diffcore_std() it
discards all pairs that we're not interested in.

[1]  valgrind --tool=callgrind --trace-children=yes

-- 8 --
Subject: [PATCH] WIP: speed up log -L... -M

---
 line-log.c | 58 ++
 1 file changed, 54 insertions(+), 4 deletions(-)

diff --git a/line-log.c b/line-log.c
index a74bbaf..b03cc0b 100644
--- a/line-log.c
+++ b/line-log.c
@@ -997,7 +997,52 @@ static void move_diff_queue(struct diff_queue_struct *dst,
DIFF_QUEUE_CLEAR(src);
 }
 
-static void queue_diffs(struct diff_options *opt,
+static void filter_diffs_for_paths(struct line_log_data *range, int 
keep_deletions)
+{
+   int i;
+   struct diff_queue_struct outq;
+   DIFF_QUEUE_CLEAR(outq);
+
+   /* fprintf(stderr, -- filtering:\n); */
+
+   for (i = 0; i  diff_queued_diff.nr; i++) {
+   struct diff_filepair *p = diff_queued_diff.queue[i];
+   struct line_log_data *rg = NULL;
+   /* fprintf(stderr, %-38s\t%-38s\n, (p-one ? p-one-path : 
none), (p-two ? p-two-path : none)); */
+   if (!DIFF_FILE_VALID(p-two)) {
+   if (keep_deletions)
+   diff_q(outq, p);
+   else
+   diff_free_filepair(p);
+   continue;
+   }
+   for (rg = range; rg; rg = rg-next) {
+   if (!strcmp(rg-spec-path, p-two-path))
+   break;
+   }
+   if (rg)
+   diff_q(outq, p);
+   else
+   diff_free_filepair(p);
+   }
+   free(diff_queued_diff.queue);
+   diff_queued_diff = outq;
+}
+
+static inline int diff_might_be_rename(void)
+{
+   int i;
+   for (i = 0; i  diff_queued_diff.nr; i++)
+   if (!DIFF_FILE_VALID(diff_queued_diff.queue[i]-one)) {
+   /* fprintf(stderr, diff_might_be_rename found creation 
of: %s\n, */
+   /*  diff_queued_diff.queue[i]-two-path); */
+   return 1;
+   }
+   return 0;
+}
+
+static void queue_diffs(struct line_log_data *range,
+   struct diff_options *opt,
struct diff_queue_struct *queue,
struct commit *commit, struct commit *parent)
 {
@@ -1013,7 +1058,12 @@ static void queue_diffs(struct diff_options *opt,
 
DIFF_QUEUE_CLEAR(diff_queued_diff);
diff_tree(desc1, desc2, , opt);
-   diffcore_std(opt);
+   if (opt-detect_rename) {
+   filter_diffs_for_paths(range, 1);
+   if (diff_might_be_rename())
+   diffcore_std(opt);
+   filter_diffs_for_paths(range, 0);
+   }
move_diff_queue(queue, diff_queued_diff);
 
if (tree1)
@@ -1297,7 +1347,7 @@ static int process_ranges_ordinary_commit(struct rev_info 
*rev, struct commit *c
if (commit-parents)
parent = commit-parents-item;
 
-   queue_diffs(rev-diffopt, queue, commit, parent);
+   queue_diffs(range, rev-diffopt, queue, commit, 

Re: [PATCH v8 4/5] Implement line-history search (git log -L)

2013-02-28 Thread Junio C Hamano
Thomas Rast tr...@student.ethz.ch writes:

 @@ -138,6 +155,11 @@ Examples
   This makes sense only when following a strict policy of merging all
   topic branches when staying on a single integration branch.
  
 +git log -L '/int main/',/^}/:main.c::
 +
 + Shows how the function `main()` in the file 'main.c' evolved
 + over time.

Perhaps have ^ before that int?

 diff --git a/line-log.c b/line-log.c
 index b167b00..789b7c4 100644
 --- a/line-log.c
 +++ b/line-log.c
 @@ -1,6 +1,442 @@
  #include git-compat-util.h
 +#include cache.h
 +#include tag.h
 +#include blob.h
 +#include tree.h
 +#include diff.h
 +#include commit.h
 +#include decorate.h
 +#include revision.h
 +#include xdiff-interface.h
 +#include strbuf.h
 +#include log-tree.h
 +#include graph.h
  #include line-log.h
  
 +void range_set_grow (struct range_set *rs, size_t extra)

Lose the extra SP (everywhere in this file).

 +{
 + ALLOC_GROW(rs-ranges, rs-nr + extra, rs-alloc);
 +}
 +
 +/* Either initialization would be fine */
 +#define RANGE_SET_INIT {0}
 +
 +void range_set_init (struct range_set *rs, size_t prealloc)
 +{
 + rs-alloc = rs-nr = 0;
 + rs-ranges = NULL;
 + if (prealloc)
 + range_set_grow(rs, prealloc);
 +}
 +
 +void range_set_release (struct range_set *rs)
 +{
 + free(rs-ranges);
 + rs-alloc = rs-nr = 0;
 + rs-ranges = NULL;
 +}
 +
 +/* dst must be uninitialized! */
 +void range_set_copy (struct range_set *dst, struct range_set *src)
 +{
 + range_set_init(dst, src-nr);
 + memcpy(dst-ranges, src-ranges, src-nr*sizeof(struct range_set));
 + dst-nr = src-nr;
 +}
 +void range_set_move (struct range_set *dst, struct range_set *src)
 +{
 + range_set_release(dst);
 + dst-ranges = src-ranges;
 + dst-nr = src-nr;
 + dst-alloc = src-alloc;
 + src-ranges = NULL;
 + src-alloc = src-nr = 0;
 +}
 +
 +/* tack on a _new_ range _at the end_ */
 +void range_set_append (struct range_set *rs, long a, long b)
 +{
 + assert(a = b);
 + assert(rs-nr == 0 || rs-ranges[rs-nr-1].end = a);
 + range_set_grow(rs, 1);
 + rs-ranges[rs-nr].start = a;
 + rs-ranges[rs-nr].end = b;
 + rs-nr++;
 +}
 +
 +static int range_cmp (const void *_r, const void *_s)
 +{
 + const struct range *r = _r;
 + const struct range *s = _s;
 +
 + /* this could be simply 'return r.start-s.start', but for the types */
 + if (r-start == s-start)
 + return 0;
 + if (r-start  s-start)
 + return -1;
 + return 1;
 +}
 +
 +/*
 + * Helper: In-place pass of sorting and merging the ranges in the
 + * range set, to re-establish the invariants after another operation
 + *
 + * NEEDSWORK currently not needed
 + */
 +static void sort_and_merge_range_set (struct range_set *rs)
 +{
 + int i;
 + int o = 1; /* output cursor */
 +
 + qsort(rs-ranges, rs-nr, sizeof(struct range), range_cmp);
 +
 + for (i = 1; i  rs-nr; i++) {
 + if (rs-ranges[i].start = rs-ranges[o-1].end) {
 + rs-ranges[o-1].end = rs-ranges[i].end;
 + } else {
 + rs-ranges[o].start = rs-ranges[i].start;
 + rs-ranges[o].end = rs-ranges[i].end;
 + o++;
 + }
 + }
 + assert(o = rs-nr);
 + rs-nr = o;
 +}
 +
 +/*
 + * Union of range sets (i.e., sets of line numbers).  Used to merge
 + * them when searches meet at a common ancestor.
 + *
 + * This is also where the ranges are consolidated into canonical form:
 + * overlapping and adjacent ranges are merged, and empty ranges are
 + * removed.
 + */
 +static void range_set_union (struct range_set *out,
 +  struct range_set *a, struct range_set *b)
 +{
 + int i = 0, j = 0, o = 0;
 + struct range *ra = a-ranges;
 + struct range *rb = b-ranges;
 + /* cannot make an alias of out-ranges: it may change during grow */
 +
 + assert(out-nr == 0);
 + while (i  a-nr || j  b-nr) {
 + struct range *new;
 + if (i  a-nr  j  b-nr) {
 + if (ra[i].start  rb[j].start)
 + new = ra[i++];
 + else if (ra[i].start  rb[j].start)
 + new = rb[j++];
 + else if (ra[i].end  rb[j].end)
 + new = ra[i++];
 + else
 + new = rb[j++];
 + } else if (i  a-nr)  /* b exhausted */
 + new = ra[i++];
 + else   /* a exhausted */
 + new = rb[j++];
 + if (new-start == new-end)
 + ; /* empty range */
 + else if (!o || out-ranges[o-1].end  new-start) {
 + range_set_grow(out, 1);
 + out-ranges[o].start = new-start;
 + out-ranges[o].end = new-end;
 + o++;
 + } else if 

Re: [PATCH v8 4/5] Implement line-history search (git log -L)

2013-02-28 Thread Thomas Rast
Junio C Hamano gits...@pobox.com writes:

 +/*
 + * NEEDSWORK: manually building a diff here is not the Right
 + * Thing(tm).  log -L should be built into the diff pipeline.

 I am not sure about this design, and do not necessarily agree that
 wedging this to the diff pipeline is the right future direction.

 I have a feeling that log -L should actually be built around
 blame.  You let blame to hit the first parent to take the blame,
 and then turn around to show a single diff-tree between the child
 and the parent with whatever other diff pipeline gizmo the user can
 give you from the command line.  The blame also tells you what the
 interesting line range were at the first parent commit you found,
 so you can re-run the same thing with an updated range.

Hrm, now that you mention it, this is actually a brilliant idea.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 4/5] Implement line-history search (git log -L)

2013-02-28 Thread Junio C Hamano
Thomas Rast tr...@student.ethz.ch writes:

 Junio C Hamano gits...@pobox.com writes:

 +/*
 + * NEEDSWORK: manually building a diff here is not the Right
 + * Thing(tm).  log -L should be built into the diff pipeline.

 I am not sure about this design, and do not necessarily agree that
 wedging this to the diff pipeline is the right future direction.

 I have a feeling that log -L should actually be built around
 blame.  You let blame to hit the first parent to take the blame,
 and then turn around to show a single diff-tree between the child
 and the parent with whatever other diff pipeline gizmo the user can
 give you from the command line.  The blame also tells you what the
 interesting line range were at the first parent commit you found,
 so you can re-run the same thing with an updated range.

 Hrm, now that you mention it, this is actually a brilliant idea.

I don't know.  That is just me handwaving without giving a serious
end-to-end thought.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 4/5] Implement line-history search (git log -L)

2013-02-28 Thread Thomas Rast
Junio C Hamano gits...@pobox.com writes:

 Thomas Rast tr...@student.ethz.ch writes:

 Junio C Hamano gits...@pobox.com writes:

 +/*
 + * NEEDSWORK: manually building a diff here is not the Right
 + * Thing(tm).  log -L should be built into the diff pipeline.

 I am not sure about this design, and do not necessarily agree that
 wedging this to the diff pipeline is the right future direction.

 I have a feeling that log -L should actually be built around
 blame.  You let blame to hit the first parent to take the blame,
 and then turn around to show a single diff-tree between the child
 and the parent with whatever other diff pipeline gizmo the user can
 give you from the command line.  The blame also tells you what the
 interesting line range were at the first parent commit you found,
 so you can re-run the same thing with an updated range.

 Hrm, now that you mention it, this is actually a brilliant idea.

 I don't know.  That is just me handwaving without giving a serious
 end-to-end thought.

Having thought about it for some time, I think we need to figure out
if/what can be shared.  I can't shake off the feeling that *something*
should be common between blame and log -L, but I can't exactly say what
so far.

Your suggestion of looking at the first blame hit is almost there.  (And
in fact if it did work, it should be rather easy to prototype from blame
--incremental.)  However, it works only for additions, not removals.
Lines that were removed do not show up in blame at all, and since a
patch can also _only_ remove lines, blame would not find it even if we
adjust the blamed range at every found commit.

It may be possible to fix that by doing a reverse blame, but I suspect
that runs into yet more trouble when trying to reverse-blame from two
different sides of history.


Then there's a different issue if the order of code flips.  Suppose you
have

  A1
  A2
  B1
  B2

and later change to

  B1
  B2
  A1
  A2

Diffs can fundamentally not express this; they'll only see one side as
unchanged, and the other as completely new.  Ideally we'd be able to
track this case correctly -- as blame does? -- no matter which part is
within our tracked range.


Anyway, I believe this should be booked under future improvements.
I've had it in my own tree for ages and it already does the right thing
most of the time :-)

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html