Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning

2014-02-06 Thread Kirill Smelkov
On Wed, Feb 05, 2014 at 02:58:36PM -0800, Junio C Hamano wrote:
 Kirill Smelkov k...@navytux.spb.ru writes:
 
  On Wed, Feb 05, 2014 at 11:42:41AM -0800, Junio C Hamano wrote:
  Kirill Smelkov k...@navytux.spb.ru writes:
  
   I agree object data should be immutable for good. The only thing I'm 
   talking
   about here is mode, which is parsed from a tree buffer and is stored in
   separate field:
  
  Ah, I do not see any problem in that case, then.
  
  Thanks.
 
  Thanks, that simplifies things for me.

Below is a patch which does it. Please apply, if it is ok.


 Surely.
 
 Be careful when traversing N-trees in parallel---you may have to
 watch out for the entry ordering rules that sorts the following
 paths in the order shown:
 
   a
   a-b
 a/b
 a_b
 
 Inside a single tree, you cannot have 'a' and 'a/b' at the same
 time, but one tree may have 'a' (without 'a/b') while another one
 may have 'a/b' (without 'a'), and walking them in parallel has
 historically been one source of funny bugs.

Thanks for the warning. I'm relying here on base_name_compare() and
ordering it imposes on entries and how it differentiates directories and
files, so let's hope this should be ok.

Actual reworking is still in flux though...

Thanks,
Kirill


 8 
From: Kirill Smelkov k...@mns.spb.ru
Date: Thu, 6 Feb 2014 15:36:31 +0400
Subject: [PATCH] Finally switch over tree descriptors to contain a pre-parsed 
entry

This continues 4651ece8 (Switch over tree descriptors to contain a
pre-parsed entry) and moves the only rest computational part

mode = canon_mode(mode)

from tree_entry_extract() to tree entry decode phase - to
decode_tree_entry().

The reason to do it, is that canon_mode() is at least 2 conditional
jumps for regular files, and that could be noticeable should canon_mode()
be invoked several times.

That does not matter for current Git codebase, where typical tree
traversal is

while (t-size) {
sha1 = tree_entry_extract(t, path, mode);
...
update_tree_entry(t);
}

i.e. we do t - sha1,path.mode extraction only once per entry. In such
cases, it does not matter performance-wise, where that mode
canonicalization is done - either once in tree_entry_extract(), or once
in decode_tree_entry() called by update_tree_entry() - it is
approximately the same.

But for future code, which could need to work with several tree_desc's
in parallel, it could be handy to operate on tree_desc descriptors, and
do extracts only when needed, or at all, access only relevant part of
it through structure fields directly.

And for such situations, having canon_mode() be done once in decode
phase is better - we won't need to pay the performance price of 2 extra
conditional jumps on every t-mode access.

So let's move mode canonicalization to decode_tree_entry(). That was the
final bit. Now after tree entry is decoded, it is fully ready and could
be accessed either directly via field, or through tree_entry_extract()
which this time got really totally trivial.

Signed-off-by: Kirill Smelkov k...@mns.spb.ru
---
 tree-walk.c | 2 +-
 tree-walk.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tree-walk.c b/tree-walk.c
index 79dba1d..4dc86c7 100644
--- a/tree-walk.c
+++ b/tree-walk.c
@@ -37,7 +37,7 @@ static void decode_tree_entry(struct tree_desc *desc, const 
char *buf, unsigned
 
/* Initialize the descriptor entry */
desc-entry.path = path;
-   desc-entry.mode = mode;
+   desc-entry.mode = canon_mode(mode);
desc-entry.sha1 = (const unsigned char *)(path + len);
 }
 
diff --git a/tree-walk.h b/tree-walk.h
index ae04b64..ae7fb3a 100644
--- a/tree-walk.h
+++ b/tree-walk.h
@@ -16,7 +16,7 @@ struct tree_desc {
 static inline const unsigned char *tree_entry_extract(struct tree_desc *desc, 
const char **pathp, unsigned int *modep)
 {
*pathp = desc-entry.path;
-   *modep = canon_mode(desc-entry.mode);
+   *modep = desc-entry.mode;
return desc-entry.sha1;
 }
 
-- 
1.9.rc1.181.g641f458

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning

2014-02-05 Thread Kirill Smelkov
On Tue, Feb 04, 2014 at 10:37:24AM -0800, Junio C Hamano wrote:
 Kirill Smelkov k...@mns.spb.ru writes:
 
  if we did not want to reinvent the whole tree walking thing, which
  would add risks for new bugs and burden to maintain this and the
  usual two-tree diff tree walking in sync.
 
  Junio, I understand it is not good to duplicate code and increase
  maintenance risks
  Instead I propose we switch to the new tree walker completely.
 
 I am not fundamentally opposed to the idea. We'll see what happens.

Thanks. Locally, with draft patches with generalized tree-walker, for
`git log --raw --no-abbrev --no-renames` I was able to get to as fast as
with old two-tree walker for navy.git and 1% faster for linux.git and
all tests pass.

Only, before I clean things up, I'd like to ask - would the following
patch be accepted

 8 ---
diff --git a/tree-walk.c b/tree-walk.c
index 79dba1d..4dc86c7 100644
--- a/tree-walk.c
+++ b/tree-walk.c
@@ -37,7 +37,7 @@ static void decode_tree_entry(struct tree_desc *desc, const 
char *buf, unsigned
 
/* Initialize the descriptor entry */
desc-entry.path = path;
-   desc-entry.mode = mode;
+   desc-entry.mode = canon_mode(mode);
desc-entry.sha1 = (const unsigned char *)(path + len);
 }
 
diff --git a/tree-walk.h b/tree-walk.h
index ae04b64..ae7fb3a 100644
--- a/tree-walk.h
+++ b/tree-walk.h
@@ -16,7 +16,7 @@ struct tree_desc {
 static inline const unsigned char *tree_entry_extract(struct tree_desc *desc, 
const char **pathp, unsigned int *modep)
 {
*pathp = desc-entry.path;
-   *modep = canon_mode(desc-entry.mode);
+   *modep = desc-entry.mode;
return desc-entry.sha1;
 }
 8 ---
 
?

The reason I ask is because it is more handy to work with tree_desc
structures, compared to splitting it to sha1/path/mode/pathlen, and
if canon_mode() is done only once, clients don't have to pay the
performance price of canon_mode on each tree_desc extract.

I understand there is fsck, which on invalid tree entry prints the mode,
but maybe somehow fsck should be special, and common interface be tuned
to usual clients?

Thanks,
Kirill
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning

2014-02-05 Thread Junio C Hamano
Kirill Smelkov k...@mns.spb.ru writes:

 Only, before I clean things up, I'd like to ask - would the following
 patch be accepted

  8 ---
 diff --git a/tree-walk.c b/tree-walk.c
 index 79dba1d..4dc86c7 100644
 --- a/tree-walk.c
 +++ b/tree-walk.c
 @@ -37,7 +37,7 @@ static void decode_tree_entry(struct tree_desc *desc, const 
 char *buf, unsigned
  
   /* Initialize the descriptor entry */
   desc-entry.path = path;
 - desc-entry.mode = mode;
 + desc-entry.mode = canon_mode(mode);
   desc-entry.sha1 = (const unsigned char *)(path + len);
  }
  
 diff --git a/tree-walk.h b/tree-walk.h
 index ae04b64..ae7fb3a 100644
 --- a/tree-walk.h
 +++ b/tree-walk.h
 @@ -16,7 +16,7 @@ struct tree_desc {
  static inline const unsigned char *tree_entry_extract(struct tree_desc 
 *desc, const char **pathp, unsigned int *modep)
  {
   *pathp = desc-entry.path;
 - *modep = canon_mode(desc-entry.mode);
 + *modep = desc-entry.mode;
   return desc-entry.sha1;
  }
  8 ---
  
 ?

Doesn't desc point into and walks over the data we read from the
tree object directly?

We try to keep (tree|commit)-buffer intact and that is done pretty
deliberately.  While you are walking a tree or parsing a commit,
somebody else, perhaps called indirectly by a helper function you
call, may call lookup_object() for the same object, get the copy
that has already been read and start using it.  This kind of change
will introduce bugs that are hard to debug unless it is done very
carefully (e.g. starting from making tree.buffer into a const pointer
and propagating constness throughout the system), which might not be
worth the pain.


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning

2014-02-05 Thread Kirill Smelkov
On Wed, Feb 05, 2014 at 09:36:55AM -0800, Junio C Hamano wrote:
 Kirill Smelkov k...@mns.spb.ru writes:
 
  Only, before I clean things up, I'd like to ask - would the following
  patch be accepted
 
   8 ---
  diff --git a/tree-walk.c b/tree-walk.c
  index 79dba1d..4dc86c7 100644
  --- a/tree-walk.c
  +++ b/tree-walk.c
  @@ -37,7 +37,7 @@ static void decode_tree_entry(struct tree_desc *desc, 
  const char *buf, unsigned
   
  /* Initialize the descriptor entry */
  desc-entry.path = path;
  -   desc-entry.mode = mode;
  +   desc-entry.mode = canon_mode(mode);
  desc-entry.sha1 = (const unsigned char *)(path + len);
   }
   
  diff --git a/tree-walk.h b/tree-walk.h
  index ae04b64..ae7fb3a 100644
  --- a/tree-walk.h
  +++ b/tree-walk.h
  @@ -16,7 +16,7 @@ struct tree_desc {
   static inline const unsigned char *tree_entry_extract(struct tree_desc 
  *desc, const char **pathp, unsigned int *modep)
   {
  *pathp = desc-entry.path;
  -   *modep = canon_mode(desc-entry.mode);
  +   *modep = desc-entry.mode;
  return desc-entry.sha1;
   }
   8 ---
   
  ?
 
 Doesn't desc point into and walks over the data we read from the
 tree object directly?
 
 We try to keep (tree|commit)-buffer intact and that is done pretty
 deliberately.  While you are walking a tree or parsing a commit,
 somebody else, perhaps called indirectly by a helper function you
 call, may call lookup_object() for the same object, get the copy
 that has already been read and start using it.  This kind of change
 will introduce bugs that are hard to debug unless it is done very
 carefully (e.g. starting from making tree.buffer into a const pointer
 and propagating constness throughout the system), which might not be
 worth the pain.

I agree object data should be immutable for good. The only thing I'm talking
about here is mode, which is parsed from a tree buffer and is stored in
separate field:

 8  tree-walk.h
struct name_entry {
const unsigned char *sha1;
const char *path;
unsigned int mode;  ---
};

struct tree_desc {
const void *buffer;
struct name_entry entry;
unsigned int size;
};

 8  tree-walk.c
const char *get_mode(const char *str, unsigned int *modep)
{ ... } /* parses symbolic mode from tree buffer to uint */

void decode_tree_entry(struct tree_desc *desc, const char *buf, 
unsigned long size)
{
const char *path;
unsigned int mode, len;

...
path = get_mode(buf, mode);

/* Initialize the descriptor entry */
desc-entry.path = path;
desc-entry.mode = mode;---
desc-entry.sha1 = (const unsigned char *)(path + len);


so we are not talking about modifying object buffers here. All I'm
asking is about canonicalizing _already_ parsed and copied mode on the
fly.

Does that change anything?


Sorry, if I'm maybe misunderstanding something, and thanks,
Kirill
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning

2014-02-05 Thread Junio C Hamano
Kirill Smelkov k...@navytux.spb.ru writes:

 I agree object data should be immutable for good. The only thing I'm talking
 about here is mode, which is parsed from a tree buffer and is stored in
 separate field:

Ah, I do not see any problem in that case, then.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning

2014-02-05 Thread Kirill Smelkov
On Wed, Feb 05, 2014 at 11:42:41AM -0800, Junio C Hamano wrote:
 Kirill Smelkov k...@navytux.spb.ru writes:
 
  I agree object data should be immutable for good. The only thing I'm talking
  about here is mode, which is parsed from a tree buffer and is stored in
  separate field:
 
 Ah, I do not see any problem in that case, then.
 
 Thanks.

Thanks, that simplifies things for me.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning

2014-02-05 Thread Junio C Hamano
Kirill Smelkov k...@navytux.spb.ru writes:

 On Wed, Feb 05, 2014 at 11:42:41AM -0800, Junio C Hamano wrote:
 Kirill Smelkov k...@navytux.spb.ru writes:
 
  I agree object data should be immutable for good. The only thing I'm 
  talking
  about here is mode, which is parsed from a tree buffer and is stored in
  separate field:
 
 Ah, I do not see any problem in that case, then.
 
 Thanks.

 Thanks, that simplifies things for me.

Surely.

Be careful when traversing N-trees in parallel---you may have to
watch out for the entry ordering rules that sorts the following
paths in the order shown:

a
a-b
a/b
a_b

Inside a single tree, you cannot have 'a' and 'a/b' at the same
time, but one tree may have 'a' (without 'a/b') while another one
may have 'a/b' (without 'a'), and walking them in parallel has
historically been one source of funny bugs.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning

2014-02-04 Thread Kirill Smelkov
On Mon, Feb 03, 2014 at 03:39:02PM -0800, Junio C Hamano wrote:
 Junio C Hamano gits...@pobox.com writes:
 
  Kirill Smelkov k...@mns.spb.ru writes:
 
  As was recently shown (c839f1bd combine-diff: optimize
  combine_diff_path sets intersection), combine-diff runs very slowly. In
  that commit we optimized paths sets intersection, but that accounted
  only for ~ 25% of the slowness, and as my tracing showed, for linux.git
  v3.10..v3.11, for merges a lot of time is spent computing
  diff(commit,commit^2) just to only then intersect that huge diff to
  almost small set of files from diff(commit,commit^1).
 
  That's because at present, to compute combine-diff, for first finding
  paths, that every parent touches, we use the following combine-diff
  property/definition:
 
  D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn)  (w.r.t. paths)
 
  where
 
  D(A,P1...Pn) is combined diff between commit A, and parents Pi
 
  and
 
  D(A,Pi) is usual two-tree diff Pi..A
 
  and A ^ B means what???
 
 Silly me; obviously it is the set intersection operator.
 
 We probably could instead use the current set of paths as literal
 pathspec to compute subsequent paths, i.e.
 
   D(A,Pi,PS) is two tree diff P1..A limited to paths PS
 
   D(A,P1...Pn) = D(A,P1,[]) ^
  D(A,P2,D(A,P1,[])) ^
...
  D(A,Pn,D(A,P1...Pn-1))
 
 if we did not want to reinvent the whole tree walking thing, which
 would add risks for new bugs and burden to maintain this and the
 usual two-tree diff tree walking in sync.

Junio, I understand it is not good to duplicate code and increase
maintenance risks. On the other hand, I don't quite like the approach
with keeping current paths - it could work and be faster compared to
what we had, but to me it is still suboptimal, because if the first diff
D(A,P1) is huge then oops. Also, to implement it rationally, without
delving into unneeded recursion, we would need to do the diffing without
recursion, intersect results, and then recurse into overlapping subtrees,
which means tree-walker rework anyway.


Instead I propose we switch to the new tree walker completely. With the
attached patch which draftly shows it (diff_tree is gone, the work is
done through diff_tree_combined_paths, and then the result is
exported to diff_filepairs) all the tests pass. Also, timings for

git log --raw --no-abbrev --no-renames

( without -c - it would be the same - we are not touching that code, it
  would only add, irrelevant-to-the-change constant time )

are

linux.git   v3.10..v3.11became 0.1% _faster_
navy.gitbecame 1.4% slower


That means, that the new tree-walker is correct and should be ok
performance-wise (I had not optimized it at all, yet).

What would you say?

Thanks,
Kirill

P.S. Thanks for commenting on other patches. I'll try to correct them
tomorrow, as I have no more time today and need to run.


 8 
From: Kirill Smelkov k...@mns.spb.ru
Date: Tue, 4 Feb 2014 20:11:16 +0400
Subject: [PATCH] X Show new tree-walker could be used instead of the old one

All tests pass. Timings for git log --raw --no-abrev --no-renames

linux.git   v3.10..v3.11became 0.1% _faster_
navy.gitbecame 1.4% slower
---
 diff.h  |  2 ++
 line-log.c  | 12 +++--
 revision.c  | 16 ++-
 tree-diff.c | 88 +++--
 4 files changed, 90 insertions(+), 28 deletions(-)

diff --git a/diff.h b/diff.h
index 5496560..0a9845a 100644
--- a/diff.h
+++ b/diff.h
@@ -189,8 +189,10 @@ const char *diff_line_prefix(struct diff_options *);
 
 extern const char mime_boundary_leader[];
 
+#if 0
 extern int diff_tree(struct tree_desc *t1, struct tree_desc *t2,
 const char *base, struct diff_options *opt);
+#endif
 extern int diff_tree_sha1(const unsigned char *old, const unsigned char *new,
  const char *base, struct diff_options *opt);
 extern int diff_root_tree_sha1(const unsigned char *new, const char *base,
diff --git a/line-log.c b/line-log.c
index 717638b..5739bbf 100644
--- a/line-log.c
+++ b/line-log.c
@@ -766,6 +766,7 @@ void line_log_init(struct rev_info *rev, const char 
*prefix, struct string_list
}
 }
 
+#if 0
 static void load_tree_desc(struct tree_desc *desc, void **tree,
   const unsigned char *sha1)
 {
@@ -775,6 +776,7 @@ static void load_tree_desc(struct tree_desc *desc, void 
**tree,
die(Unable to read tree (%s), sha1_to_hex(sha1));
init_tree_desc(desc, *tree, size);
 }
+#endif
 
 static int count_parents(struct commit *commit)
 {
@@ -843,17 +845,23 @@ static void queue_diffs(struct line_log_data *range,
struct commit *commit, struct commit *parent)
 {
void *tree1 = NULL, *tree2 = NULL;
-   struct tree_desc desc1, desc2;
+// struct tree_desc desc1, desc2;
 
assert(commit);

Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning

2014-02-04 Thread Junio C Hamano
Kirill Smelkov k...@mns.spb.ru writes:

 if we did not want to reinvent the whole tree walking thing, which
 would add risks for new bugs and burden to maintain this and the
 usual two-tree diff tree walking in sync.

 Junio, I understand it is not good to duplicate code and increase
 maintenance risks
 Instead I propose we switch to the new tree walker completely.

I am not fundamentally opposed to the idea. We'll see what happens.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning

2014-02-03 Thread Junio C Hamano
Kirill Smelkov k...@mns.spb.ru writes:

 As was recently shown (c839f1bd combine-diff: optimize
 combine_diff_path sets intersection), combine-diff runs very slowly. In
 that commit we optimized paths sets intersection, but that accounted
 only for ~ 25% of the slowness, and as my tracing showed, for linux.git
 v3.10..v3.11, for merges a lot of time is spent computing
 diff(commit,commit^2) just to only then intersect that huge diff to
 almost small set of files from diff(commit,commit^1).

 That's because at present, to compute combine-diff, for first finding
 paths, that every parent touches, we use the following combine-diff
 property/definition:

 D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn)  (w.r.t. paths)

 where

 D(A,P1...Pn) is combined diff between commit A, and parents Pi

 and

 D(A,Pi) is usual two-tree diff Pi..A

and A ^ B means what???

I do like the approach of walking the tree entries and stop as
shallowly as possible without recursing.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning

2014-02-03 Thread Junio C Hamano
Junio C Hamano gits...@pobox.com writes:

 Kirill Smelkov k...@mns.spb.ru writes:

 As was recently shown (c839f1bd combine-diff: optimize
 combine_diff_path sets intersection), combine-diff runs very slowly. In
 that commit we optimized paths sets intersection, but that accounted
 only for ~ 25% of the slowness, and as my tracing showed, for linux.git
 v3.10..v3.11, for merges a lot of time is spent computing
 diff(commit,commit^2) just to only then intersect that huge diff to
 almost small set of files from diff(commit,commit^1).

 That's because at present, to compute combine-diff, for first finding
 paths, that every parent touches, we use the following combine-diff
 property/definition:

 D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn)  (w.r.t. paths)

 where

 D(A,P1...Pn) is combined diff between commit A, and parents Pi

 and

 D(A,Pi) is usual two-tree diff Pi..A

 and A ^ B means what???

Silly me; obviously it is the set intersection operator.

We probably could instead use the current set of paths as literal
pathspec to compute subsequent paths, i.e.

D(A,Pi,PS) is two tree diff P1..A limited to paths PS

D(A,P1...Pn) = D(A,P1,[]) ^
   D(A,P2,D(A,P1,[])) ^
   ...
   D(A,Pn,D(A,P1...Pn-1))

if we did not want to reinvent the whole tree walking thing, which
would add risks for new bugs and burden to maintain this and the
usual two-tree diff tree walking in sync.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning

2014-02-03 Thread Junio C Hamano
Kirill Smelkov k...@mns.spb.ru writes:

 + parents_sha1 = xmalloc(nparent * sizeof(parents_sha1[0]));
 + for (i = 0; i  nparent; i++)
 + parents_sha1[i] = parents-sha1[i];
 +
 + /* fake list head, so worker can assume it is non-NULL */
 + struct combine_diff_path paths_head;

decl-after-statement; please fix.

 + paths_head.next = NULL;
 +
 + struct strbuf base;

likewise.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html