Re: [PATCH v2 18/19] tree-diff: rework diff_tree() to generate diffs for multiparent cases as well

2014-04-07 Thread Junio C Hamano
Kirill Smelkov k...@navytux.spb.ru writes:

 The following
 ...
 maybe looks a bit simpler, but calls tree_entry_pathcmp twice more times.

 Besides for important nparent=1 case we were not calling
 tree_entry_pathcmp at all and here we'll call it once, which would slow
 execution down a bit, as base_name_compare shows measurable enough in profile.
 To avoid that we'll need to add 'if (i==imin) continue' and this won't
 be so simple then. And for general nparent case, as I've said, we'll be
 calling tree_entry_pathcmp twice more times...

 Because of all that I'd suggest to go with my original version.

OK.

 ... After some break on the topic,
 with a fresh eye I see a lot of confusion goes from the notation I've
 chosen initially (because of how I was reasoning about it on paper, when
 it was in flux) - i.e. xi for x[imin] and also using i as looping
 variable. And also because xi was already used for x[imin] I've used
 another letter 'k' denoting all other x'es, which leads to confusion...


 I propose we do the following renaming to clarify things:

 A/a -  T/t (to match resulting tree t name in the code)
 X/x -  P/p (to match parents trees tp in the code)
 i   -  imin(so that i would be free for other tasks)

 then the above (with a prologue) would look like

  8 
  *   T P1   Pn
  *   - --
  *  |t|   |p1| |pn|
  *  |-|   |--| ... |--|  imin = argmin(p1...pn)
  *  | |   |  | |  |
  *  |-|   |--| |--|
  *  |.|   |. | |. |
  *   . ..
  *   . ..
  *
  * at any time there could be 3 cases:
  *
  *  1)  t  p[imin];
  *  2)  t  p[imin];
  *  3)  t = p[imin].
  *
  * Schematic deduction of what every case means, and what to do, follows:
  *
  * 1)  t  p[imin]  -  ∀j t ∉ Pj  -  +t ∈ D(T,Pj)  -  D += +t;  t↓
  *
  * 2)  t  p[imin]
  *
  * 2.1) ∃j: pj  p[imin]  -  -p[imin] ∉ D(T,Pj)  -  D += ø;  ∀ 
 pi=p[imin]  pi↓
  * 2.2) ∀i  pi = p[imin]  -  pi ∉ T  -  -pi ∈ D(T,Pi)  -  D += 
 -p[imin];  ∀i pi↓
  *
  * 3)  t = p[imin]
  *
  * 3.1) ∃j: pj  p[imin]  -  +t ∈ D(T,Pj)  -  only pi=p[imin] remains 
 to investigate
  * 3.2) pi = p[imin]  -  investigate δ(t,pi)
  *  |
  *  |
  *  v
  *
  * 3.1+3.2) looking at δ(t,pi) ∀i: pi=p[imin] - if all != ø  -
  *
  *   ⎧δ(t,pi)  - if pi=p[imin]
  *  -  D += ⎨
  *   ⎩+t - if pip[imin]
  *
  *
  * in any case t↓  ∀ pi=p[imin]  pi↓
 ...
 now xk is gone and i matches p[i] (= pi) etc so variable names correlate
 to algorithm description better.

 Does that maybe clarify things?

That sounds more consistent (modulo perhaps s/argmin/min/ at the
beginning?).

 P.S. Sorry for maybe some crept-in mistakes - I've tried to verify it
 thoroughly, but am too sleepy to be completely sure. On the other hand I
 think and hope the patch should be ok.

Thanks and do not be sorry for mistakes---we have the review
process exactly for catching them.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 18/19] tree-diff: rework diff_tree() to generate diffs for multiparent cases as well

2014-04-07 Thread Junio C Hamano
Kirill Smelkov k...@navytux.spb.ru writes:

  +  if (!DIFF_OPT_TST(opt, FIND_COPIES_HARDER)) {
  +  for (i = 0; i  nparent; ++i)
  +  if (tp[i].entry.mode  S_IFXMIN_NEQ)
  +  goto skip_emit_tp;
  +  }
  +
  +  p = emit_path(p, base, opt, nparent,
  +  /*t=*/NULL, tp, imin);
  +
  +  skip_emit_tp:
  +  /* ∀ xk=ximin  xk↓ */
  +  update_tp_entries(tp, nparent);
 
 There are parents whose path sort earlier than what is in 't'
 (i.e. they were lost in the result---we would want to show
 removal).  What makes us jump to the skip label?
 
 We are looking at path in 't', and some parents have paths that
 sort earlier than that path.  We will not go to skip label if
 any one of the parent's entry sorts after some other parent (or
 the parent in question has ran out its entries), which means we
 show the entry from the parents only when all the parents have
 that same path, which is missing from 't'.
 
 I am not sure if I am reading this correctly, though.
 
 For the two-way diff, the above degenerates to show all parent
 entries that come before the first entry in 't', which is correct.
 For the combined diff, the current intersect_paths() makes sure that
 each path appears in all the pair-wise diff between t and tp[],
 which again means that the above logic match the current behaviour.

 Yes, correct (modulo we *will* go to skip label if any one of the
 parent's entry sorts after some other parent). By definition of combined
 diff we show a path only if it shows in every diff D(T,Pi), and if 

 2.1) ∃j: pj  p[imin]  -  -p[imin] ∉ D(T,Pj)  -  D += ø;  ∀ 
 pi=p[imin]  pi↓

 some pj sorts after p[imin] that would mean that Pj does not have
 p[imin] and since t  p[imin] (which means T does not have p[imin]
 either) diff D(T,Pj) does not have p[imin]. And because of that we know
 the whole combined-diff will not have p[imin] as, by definition,
 combined diff is sets intersection and one of the sets does not have
 that path.

   ( In usual words p[imin] is not changed between Pj..T - it was
 e.g. removed in Pj~, so merging parents to T does not bring any new
 information wrt path p[imin] and that is why we do not want to show
 p[imin] in combined-diff output - no new change about that path )

 So nothing to append to the output, and update minimum tree entries,
 preparing for the next step.

That's all in line with the current and traditional definition of
combined diff.

This is a tangent that is outside the scope of this current topic,
but I wonder if you found it disturbing that we treat the result 't'
that has a path and the result 't' that does not have a path with
respect to a parent that does not have the path in a somewhat
assymmetric way.

With a merge M between commits A and B, where they all have the same
path with different contents, we obviously show that path in the
combined diff format.  A merge N that records exactly the same tree
as M that merges the same commits A and B plus another commit C that
does not have that path still shows the combined diff, with one
extra column to express everything in the result N has been added
with respect to C which did not have the path at all.

However, a merge O between the same commits A and B, where A and B
have a path and O loses it, shows the path in the combined format.
A merge P among the same A, B and an extra parent C that does not
have that path ceases to show it (this is the assymmetry).

It is a natural extension of Do not show the path when the result
matches one of the parent rule, and in this case the result P takes
contents, the path does not exist, from one parent C, so it is
internally consistent, and I originally designed it that way on
purpose, but somehow it feels a bit strange.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 18/19] tree-diff: rework diff_tree() to generate diffs for multiparent cases as well

2014-04-07 Thread Kirill Smelkov
On Mon, Apr 07, 2014 at 10:29:46AM -0700, Junio C Hamano wrote:
 Kirill Smelkov k...@navytux.spb.ru writes:
 
  The following
  ...
  maybe looks a bit simpler, but calls tree_entry_pathcmp twice more times.
 
  Besides for important nparent=1 case we were not calling
  tree_entry_pathcmp at all and here we'll call it once, which would slow
  execution down a bit, as base_name_compare shows measurable enough in 
  profile.
  To avoid that we'll need to add 'if (i==imin) continue' and this won't
  be so simple then. And for general nparent case, as I've said, we'll be
  calling tree_entry_pathcmp twice more times...
 
  Because of all that I'd suggest to go with my original version.
 
 OK.

Thanks.

  ... After some break on the topic,
  with a fresh eye I see a lot of confusion goes from the notation I've
  chosen initially (because of how I was reasoning about it on paper, when
  it was in flux) - i.e. xi for x[imin] and also using i as looping
  variable. And also because xi was already used for x[imin] I've used
  another letter 'k' denoting all other x'es, which leads to confusion...
 
 
  I propose we do the following renaming to clarify things:
 
  A/a -  T/t (to match resulting tree t name in the code)
  X/x -  P/p (to match parents trees tp in the code)
  i   -  imin(so that i would be free for other tasks)
 
  then the above (with a prologue) would look like
 
   8 
   *   T P1   Pn
   *   - --
   *  |t|   |p1| |pn|
   *  |-|   |--| ... |--|  imin = argmin(p1...pn)
   *  | |   |  | |  |
   *  |-|   |--| |--|
   *  |.|   |. | |. |
   *   . ..
   *   . ..
   *
   * at any time there could be 3 cases:
   *
   *  1)  t  p[imin];
   *  2)  t  p[imin];
   *  3)  t = p[imin].
   *
   * Schematic deduction of what every case means, and what to do, follows:
   *
   * 1)  t  p[imin]  -  ∀j t ∉ Pj  -  +t ∈ D(T,Pj)  -  D += +t;  t↓
   *
   * 2)  t  p[imin]
   *
   * 2.1) ∃j: pj  p[imin]  -  -p[imin] ∉ D(T,Pj)  -  D += ø;  ∀ 
  pi=p[imin]  pi↓
   * 2.2) ∀i  pi = p[imin]  -  pi ∉ T  -  -pi ∈ D(T,Pi)  -  D += 
  -p[imin];  ∀i pi↓
   *
   * 3)  t = p[imin]
   *
   * 3.1) ∃j: pj  p[imin]  -  +t ∈ D(T,Pj)  -  only pi=p[imin] 
  remains to investigate
   * 3.2) pi = p[imin]  -  investigate δ(t,pi)
   *  |
   *  |
   *  v
   *
   * 3.1+3.2) looking at δ(t,pi) ∀i: pi=p[imin] - if all != ø  -
   *
   *   ⎧δ(t,pi)  - if pi=p[imin]
   *  -  D += ⎨
   *   ⎩+t - if pip[imin]
   *
   *
   * in any case t↓  ∀ pi=p[imin]  pi↓
  ...
  now xk is gone and i matches p[i] (= pi) etc so variable names correlate
  to algorithm description better.
 
  Does that maybe clarify things?
 
 That sounds more consistent (modulo perhaps s/argmin/min/ at the
 beginning?).

Thanks. argmin is there on purpose - min(p1...pn) is the minimal p, and
argmin(p1...pn) is imin such that p[imin] is minimal. As we are finding
the index of the minimal element we should use argmin.


  P.S. Sorry for maybe some crept-in mistakes - I've tried to verify it
  thoroughly, but am too sleepy to be completely sure. On the other hand I
  think and hope the patch should be ok.
 
 Thanks and do not be sorry for mistakes---we have the review
 process exactly for catching them.

Thanks, I appreciate that.


On Mon, Apr 07, 2014 at 11:07:47AM -0700, Junio C Hamano wrote:
 Kirill Smelkov k...@navytux.spb.ru writes:
 
   +if (!DIFF_OPT_TST(opt, FIND_COPIES_HARDER)) {
   +for (i = 0; i  nparent; ++i)
   +if (tp[i].entry.mode  
   S_IFXMIN_NEQ)
   +goto skip_emit_tp;
   +}
   +
   +p = emit_path(p, base, opt, nparent,
   +/*t=*/NULL, tp, imin);
   +
   +skip_emit_tp:
   +/* ∀ xk=ximin  xk↓ */
   +update_tp_entries(tp, nparent);
  
  There are parents whose path sort earlier than what is in 't'
  (i.e. they were lost in the result---we would want to show
  removal).  What makes us jump to the skip label?
  
  We are looking at path in 't', and some parents have paths that
  sort earlier than that path.  We will not go to skip label if
  any one of the parent's entry sorts after some other parent (or
  the parent in question has ran out its entries), which means we
  show the entry from the parents only when all the parents have
  that same path, which is missing from 't'.
  
  I am not sure if I am reading this correctly, though.
  
  For the two-way diff, the above degenerates to show all parent
  entries that come before the first entry in 't', which is correct.
  For the combined diff, the current 

Re: [PATCH v2 18/19] tree-diff: rework diff_tree() to generate diffs for multiparent cases as well

2014-04-06 Thread Kirill Smelkov
Junio,

First of all thanks a lot for reviewing this patch. I'll reply inline
with corrected version attached in the end.

On Fri, Apr 04, 2014 at 11:42:39AM -0700, Junio C Hamano wrote:
 Kirill Smelkov k...@navytux.spb.ru writes:
 
  +extern
  +struct combine_diff_path *diff_tree_paths(
 
 These two on the same line, please.

Ok

  +   struct combine_diff_path *p, const unsigned char *sha1,
  +   const unsigned char **parent_sha1, int nparent,
  +   struct strbuf *base, struct diff_options *opt);
   extern int diff_tree_sha1(const unsigned char *old, const unsigned char 
  *new,
const char *base, struct diff_options *opt);
  ...
  +/*
  + * convert path - opt-diff_*() callbacks
  + *
  + * emits diff to first parent only, and tells diff tree-walker that we are 
  done
  + * with p and it can be freed.
  + */
  +static int emit_diff_first_parent_only(struct diff_options *opt, struct 
  combine_diff_path *p)
   {
 
 Very straight-forward; good.

Thanks

  +static struct combine_diff_path *path_appendnew(struct combine_diff_path 
  *last,
  +   int nparent, const struct strbuf *base, const char *path, int pathlen,
  +   unsigned mode, const unsigned char *sha1)
  +{
  +   struct combine_diff_path *p;
  +   int len = base-len + pathlen;
  +   int alloclen = combine_diff_path_size(nparent, len);
  +
  +   /* if last-next is !NULL - it is a pre-allocated memory, we can reuse 
  */
  +   p = last-next;
  +   if (p  (alloclen  (intptr_t)p-next)) {
  +   free(p);
  +   p = NULL;
  +   }
  +
  +   if (!p) {
  +   p = xmalloc(alloclen);
  +
  +   /*
  +* until we go to it next round, .next holds how many bytes we
  +* allocated (for faster realloc - we don't need copying old 
  data).
  +*/
  +   p-next = (struct combine_diff_path *)(intptr_t)alloclen;
 
 This reuse of the .next field is somewhat yucky, but it is very
 localized inside a function that has a single callsite to this
 function, so let's let it pass.

I agree it is not pretty, but it was the best approach I could find
for avoiding memory re-allocation without introducing new fields into
`struct combine_diff_path`. And yes, the trick is localized, so let's
let it live.


  +static struct combine_diff_path *emit_path(struct combine_diff_path *p,
  +   struct strbuf *base, struct diff_options *opt, int nparent,
  +   struct tree_desc *t, struct tree_desc *tp,
  +   int imin)
   {
 
 Again, fairly straight-forward and good.

Thanks again.


  +/*
  + * generate paths for combined diff D(sha1,parents_sha1[])
  + ...
  +static struct combine_diff_path *ll_diff_tree_paths(
  +   struct combine_diff_path *p, const unsigned char *sha1,
  +   const unsigned char **parents_sha1, int nparent,
  +   struct strbuf *base, struct diff_options *opt)
  +{
  +   struct tree_desc t, *tp;
  +   void *ttree, **tptree;
  +   int i;
  +
  +   tp = xalloca(nparent * sizeof(tp[0]));
  +   tptree = xalloca(nparent * sizeof(tptree[0]));
  +
  +   /*
  +* load parents first, as they are probably already cached.
  +*
  +* ( log_tree_diff() parses commit-parent before calling here via
  +*   diff_tree_sha1(parent, commit) )
  +*/
  +   for (i = 0; i  nparent; ++i)
  +   tptree[i] = fill_tree_descriptor(tp[i], parents_sha1[i]);
  +   ttree = fill_tree_descriptor(t, sha1);
   
  /* Enable recursion indefinitely */
  opt-pathspec.recursive = DIFF_OPT_TST(opt, RECURSIVE);
   
  for (;;) {
  -   int cmp;
  +   int imin, cmp;
   
  if (diff_can_quit_early(opt))
  break;
  +
  if (opt-pathspec.nr) {
  -   skip_uninteresting(t1, base, opt);
  -   skip_uninteresting(t2, base, opt);
  +   skip_uninteresting(t, base, opt);
  +   for (i = 0; i  nparent; i++)
  +   skip_uninteresting(tp[i], base, opt);
  }
  -   if (!t1.size  !t2.size)
  -   break;
   
  -   cmp = tree_entry_pathcmp(t1, t2);
  +   /* comparing is finished when all trees are done */
  +   if (!t.size) {
  +   int done = 1;
  +   for (i = 0; i  nparent; ++i)
  +   if (tp[i].size) {
  +   done = 0;
  +   break;
  +   }
  +   if (done)
  +   break;
  +   }
  +
  +   /*
  +* lookup imin = argmin(x1...xn),
  +* mark entries whether they =tp[imin] along the way
  +*/
  +   imin = 0;
  +   tp[0].entry.mode = ~S_IFXMIN_NEQ;
  +
  +   for (i = 1; i  nparent; ++i) {
  +   cmp = tree_entry_pathcmp(tp[i], tp[imin]);
  +   if (cmp  0) {
  +   imin = i;
  +   

Re: [PATCH v2 18/19] tree-diff: rework diff_tree() to generate diffs for multiparent cases as well

2014-04-04 Thread Junio C Hamano
Kirill Smelkov k...@navytux.spb.ru writes:

 +extern
 +struct combine_diff_path *diff_tree_paths(

These two on the same line, please.

 + struct combine_diff_path *p, const unsigned char *sha1,
 + const unsigned char **parent_sha1, int nparent,
 + struct strbuf *base, struct diff_options *opt);
  extern int diff_tree_sha1(const unsigned char *old, const unsigned char *new,
 const char *base, struct diff_options *opt);
 ...
 +/*
 + * convert path - opt-diff_*() callbacks
 + *
 + * emits diff to first parent only, and tells diff tree-walker that we are 
 done
 + * with p and it can be freed.
 + */
 +static int emit_diff_first_parent_only(struct diff_options *opt, struct 
 combine_diff_path *p)
  {

Very straight-forward; good.

 +static struct combine_diff_path *path_appendnew(struct combine_diff_path 
 *last,
 + int nparent, const struct strbuf *base, const char *path, int pathlen,
 + unsigned mode, const unsigned char *sha1)
 +{
 + struct combine_diff_path *p;
 + int len = base-len + pathlen;
 + int alloclen = combine_diff_path_size(nparent, len);
 +
 + /* if last-next is !NULL - it is a pre-allocated memory, we can reuse 
 */
 + p = last-next;
 + if (p  (alloclen  (intptr_t)p-next)) {
 + free(p);
 + p = NULL;
 + }
 +
 + if (!p) {
 + p = xmalloc(alloclen);
 +
 + /*
 +  * until we go to it next round, .next holds how many bytes we
 +  * allocated (for faster realloc - we don't need copying old 
 data).
 +  */
 + p-next = (struct combine_diff_path *)(intptr_t)alloclen;

This reuse of the .next field is somewhat yucky, but it is very
localized inside a function that has a single callsite to this
function, so let's let it pass.

 +static struct combine_diff_path *emit_path(struct combine_diff_path *p,
 + struct strbuf *base, struct diff_options *opt, int nparent,
 + struct tree_desc *t, struct tree_desc *tp,
 + int imin)
  {

Again, fairly straight-forward and good.

 +/*
 + * generate paths for combined diff D(sha1,parents_sha1[])
 + ...
 +static struct combine_diff_path *ll_diff_tree_paths(
 + struct combine_diff_path *p, const unsigned char *sha1,
 + const unsigned char **parents_sha1, int nparent,
 + struct strbuf *base, struct diff_options *opt)
 +{
 + struct tree_desc t, *tp;
 + void *ttree, **tptree;
 + int i;
 +
 + tp = xalloca(nparent * sizeof(tp[0]));
 + tptree = xalloca(nparent * sizeof(tptree[0]));
 +
 + /*
 +  * load parents first, as they are probably already cached.
 +  *
 +  * ( log_tree_diff() parses commit-parent before calling here via
 +  *   diff_tree_sha1(parent, commit) )
 +  */
 + for (i = 0; i  nparent; ++i)
 + tptree[i] = fill_tree_descriptor(tp[i], parents_sha1[i]);
 + ttree = fill_tree_descriptor(t, sha1);
  
   /* Enable recursion indefinitely */
   opt-pathspec.recursive = DIFF_OPT_TST(opt, RECURSIVE);
  
   for (;;) {
 - int cmp;
 + int imin, cmp;
  
   if (diff_can_quit_early(opt))
   break;
 +
   if (opt-pathspec.nr) {
 - skip_uninteresting(t1, base, opt);
 - skip_uninteresting(t2, base, opt);
 + skip_uninteresting(t, base, opt);
 + for (i = 0; i  nparent; i++)
 + skip_uninteresting(tp[i], base, opt);
   }
 - if (!t1.size  !t2.size)
 - break;
  
 - cmp = tree_entry_pathcmp(t1, t2);
 + /* comparing is finished when all trees are done */
 + if (!t.size) {
 + int done = 1;
 + for (i = 0; i  nparent; ++i)
 + if (tp[i].size) {
 + done = 0;
 + break;
 + }
 + if (done)
 + break;
 + }
 +
 + /*
 +  * lookup imin = argmin(x1...xn),
 +  * mark entries whether they =tp[imin] along the way
 +  */
 + imin = 0;
 + tp[0].entry.mode = ~S_IFXMIN_NEQ;
 +
 + for (i = 1; i  nparent; ++i) {
 + cmp = tree_entry_pathcmp(tp[i], tp[imin]);
 + if (cmp  0) {
 + imin = i;
 + tp[i].entry.mode = ~S_IFXMIN_NEQ;
 + }
 + else if (cmp == 0) {
 + tp[i].entry.mode = ~S_IFXMIN_NEQ;
 + }
 + else {
 + tp[i].entry.mode |= S_IFXMIN_NEQ;
 + }
 + }
 +
 + /* fixup markings for entries before imin */
 + for (i 

Re: [PATCH v2 18/19] tree-diff: rework diff_tree() to generate diffs for multiparent cases as well

2014-03-27 Thread Kirill Smelkov
On Mon, Feb 24, 2014 at 08:21:50PM +0400, Kirill Smelkov wrote:
[...]
 not changed:
 
 - low-level helpers are still named with __ prefix as, imho, that is the 
 best
   convention to name such helpers, without sacrificing signal/noise ratio. All
   of them are now static though.

Please find attached corrected version of this patch with
__diff_tree_sha1() renamed to ll_diff_tree_sha1() and other identifiers
corrected similarly for consistency with Git codebase style.

Thanks,
Kirill

(please keep author email)
 8 
From: Kirill Smelkov k...@mns.spb.ru
Date: Mon, 24 Feb 2014 20:21:50 +0400
Subject: [PATCH v3] tree-diff: rework diff_tree() to generate diffs for 
multiparent cases as well

Previously diff_tree(), which is now named ll_diff_tree_sha1(), was
generating diff_filepair(s) for two trees t1 and t2, and that was
usually used for a commit as t1=HEAD~, and t2=HEAD - i.e. to see changes
a commit introduces.

In Git, however, we have fundamentally built flexibility in that a
commit can have many parents - 1 for a plain commit, 2 for a simple merge,
but also more than 2 for merging several heads at once.

For merges there is a so called combine-diff, which shows diff, a merge
introduces by itself, omitting changes done by any parent. That works
through first finding paths, that are different to all parents, and then
showing generalized diff, with separate columns for +/- for each parent.
The code lives in combine-diff.c .

There is an impedance mismatch, however, in that a commit could
generally have any number of parents, and that while diffing trees, we
divide cases for 2-tree diffs and more-than-2-tree diffs. I mean there
is no special casing for multiple parents commits in e.g.
revision-walker .

That impedance mismatch *hurts* *performance* *badly* for generating
combined diffs - in combine-diff: optimize combine_diff_path
sets intersection I've already removed some slowness from it, but from
the timings provided there, it could be seen, that combined diffs still
cost more than an order of magnitude more cpu time, compared to diff for
usual commits, and that would only be an optimistic estimate, if we take
into account that for e.g. linux.git there is only one merge for several
dozens of plain commits.

That slowness comes from the fact that currently, while generating
combined diff, a lot of time is spent computing diff(commit,commit^2)
just to only then intersect that huge diff to almost small set of files
from diff(commit,commit^1).

That's because at present, to compute combine-diff, for first finding
paths, that every parent touches, we use the following combine-diff
property/definition:

D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn)  (w.r.t. paths)

where

D(A,P1...Pn) is combined diff between commit A, and parents Pi

and

D(A,Pi) is usual two-tree diff Pi..A

So if any of that D(A,Pi) is huge, tracting 1 n-parent combine-diff as n
1-parent diffs and intersecting results will be slow.

And usually, for linux.git and other topic-based workflows, that
D(A,P2) is huge, because, if merge-base of A and P2, is several dozens
of merges (from A, via first parent) below, that D(A,P2) will be diffing
sum of merges from several subsystems to 1 subsystem.

The solution is to avoid computing n 1-parent diffs, and to find
changed-to-all-parents paths via scanning A's and all Pi's trees
simultaneously, at each step comparing their entries, and based on that
comparison, populate paths result, and deduce we could *skip*
*recursing* into subdirectories, if at least for 1 parent, sha1 of that
dir tree is the same as in A. That would save us from doing significant
amount of needless work.

Such approach is very similar to what diff_tree() does, only there we
deal with scanning only 2 trees simultaneously, and for n+1 tree, the
logic is a bit more complex:

D(A,X1...Xn) calculation scheme
---

D(A,X1...Xn) = D(A,X1) ^ ... ^ D(A,Xn)   (regarding resulting paths set)

 D(A,Xj) - diff between A..Xj
 D(A,X1...Xn)- combined diff from A to parents X1,...,Xn

We start from all trees, which are sorted, and compare their entries in
lock-step:

  A X1   Xn
  - --
 |a|   |x1| |xn|
 |-|   |--| ... |--|  i = argmin(x1...xn)
 | |   |  | |  |
 |-|   |--| |--|
 |.|   |. | |. |
  . ..
  . ..

at any time there could be 3 cases:

 1)  a  xi;
 2)  a  xi;
 3)  a = xi.

Schematic deduction of what every case means, and what to do, follows:

1)  a  xi  -  ∀j a ∉ Xj  -  +a ∈ D(A,Xj)  -  D += +a;  a↓

2)  a  xi

2.1) ∃j: xj  xi  -  -xi ∉ D(A,Xj)  -  D += ø;  ∀ xk=xi  xk↓
2.2) ∀j  xj = xi  -  xj ∉ A  -  -xj ∈ D(A,Xj)  -  D += -xi;  ∀j xj↓

3)  a = xi

3.1) ∃j: xj  xi  -  +a ∈ D(A,Xj)  -  only xk=xi remains to investigate
3.2) xj = xi  -  investigate δ(a,xj)
 |
 |
 v

3.1+3.2) looking at δ(a,xk) ∀k: xk=xi - if all != ø  -

   

[PATCH v2 18/19] tree-diff: rework diff_tree() to generate diffs for multiparent cases as well

2014-02-24 Thread Kirill Smelkov
Previously diff_tree(), which is now named __diff_tree_sha1(), was
generating diff_filepair(s) for two trees t1 and t2, and that was
usually used for a commit as t1=HEAD~, and t2=HEAD - i.e. to see changes
a commit introduces.

In Git, however, we have fundamentally built flexibility in that a
commit can have many parents - 1 for a plain commit, 2 for a simple merge,
but also more than 2 for merging several heads at once.

For merges there is a so called combine-diff, which shows diff, a merge
introduces by itself, omitting changes done by any parent. That works
through first finding paths, that are different to all parents, and then
showing generalized diff, with separate columns for +/- for each parent.
The code lives in combine-diff.c .

There is an impedance mismatch, however, in that a commit could
generally have any number of parents, and that while diffing trees, we
divide cases for 2-tree diffs and more-than-2-tree diffs. I mean there
is no special casing for multiple parents commits in e.g.
revision-walker .

That impedance mismatch *hurts* *performance* *badly* for generating
combined diffs - in combine-diff: optimize combine_diff_path
sets intersection I've already removed some slowness from it, but from
the timings provided there, it could be seen, that combined diffs still
cost more than an order of magnitude more cpu time, compared to diff for
usual commits, and that would only be an optimistic estimate, if we take
into account that for e.g. linux.git there is only one merge for several
dozens of plain commits.

That slowness comes from the fact that currently, while generating
combined diff, a lot of time is spent computing diff(commit,commit^2)
just to only then intersect that huge diff to almost small set of files
from diff(commit,commit^1).

That's because at present, to compute combine-diff, for first finding
paths, that every parent touches, we use the following combine-diff
property/definition:

D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn)  (w.r.t. paths)

where

D(A,P1...Pn) is combined diff between commit A, and parents Pi

and

D(A,Pi) is usual two-tree diff Pi..A

So if any of that D(A,Pi) is huge, tracting 1 n-parent combine-diff as n
1-parent diffs and intersecting results will be slow.

And usually, for linux.git and other topic-based workflows, that
D(A,P2) is huge, because, if merge-base of A and P2, is several dozens
of merges (from A, via first parent) below, that D(A,P2) will be diffing
sum of merges from several subsystems to 1 subsystem.

The solution is to avoid computing n 1-parent diffs, and to find
changed-to-all-parents paths via scanning A's and all Pi's trees
simultaneously, at each step comparing their entries, and based on that
comparison, populate paths result, and deduce we could *skip*
*recursing* into subdirectories, if at least for 1 parent, sha1 of that
dir tree is the same as in A. That would save us from doing significant
amount of needless work.

Such approach is very similar to what diff_tree() does, only there we
deal with scanning only 2 trees simultaneously, and for n+1 tree, the
logic is a bit more complex:

D(A,X1...Xn) calculation scheme
---

D(A,X1...Xn) = D(A,X1) ^ ... ^ D(A,Xn)   (regarding resulting paths set)

 D(A,Xj) - diff between A..Xj
 D(A,X1...Xn)- combined diff from A to parents X1,...,Xn

We start from all trees, which are sorted, and compare their entries in
lock-step:

  A X1   Xn
  - --
 |a|   |x1| |xn|
 |-|   |--| ... |--|  i = argmin(x1...xn)
 | |   |  | |  |
 |-|   |--| |--|
 |.|   |. | |. |
  . ..
  . ..

at any time there could be 3 cases:

 1)  a  xi;
 2)  a  xi;
 3)  a = xi.

Schematic deduction of what every case means, and what to do, follows:

1)  a  xi  -  ∀j a ∉ Xj  -  +a ∈ D(A,Xj)  -  D += +a;  a↓

2)  a  xi

2.1) ∃j: xj  xi  -  -xi ∉ D(A,Xj)  -  D += ø;  ∀ xk=xi  xk↓
2.2) ∀j  xj = xi  -  xj ∉ A  -  -xj ∈ D(A,Xj)  -  D += -xi;  ∀j xj↓

3)  a = xi

3.1) ∃j: xj  xi  -  +a ∈ D(A,Xj)  -  only xk=xi remains to investigate
3.2) xj = xi  -  investigate δ(a,xj)
 |
 |
 v

3.1+3.2) looking at δ(a,xk) ∀k: xk=xi - if all != ø  -

  ⎧δ(a,xk)  - if xk=xi
 -  D += ⎨
  ⎩+a - if xkxi

in any case a↓  ∀ xk=xi  xk↓

~

For comparison, here is how diff_tree() works:

D(A,B) calculation scheme
-

A B
- -
   |a|   |b|a  b   -  a ∉ B   -   D(A,B) +=  +aa↓
   |-|   |-|a  b   -  b ∉ A   -   D(A,B) +=  -bb↓
   | |   | |a = b   -  investigate δ(a,b)a↓ b↓
   |-|   |-|
   |.|   |.|
. .
. .



This patch generalizes diff tree-walker to work with arbitrary number of
parents as described above - i.e. now there is a resulting tree t, and
some parents trees tp[i] i=[0..nparent). The generalization builds on