Re: [WIP PATCH v2] diff.c: emit moved lines with a different color

2016-09-06 Thread Stefan Beller
On Tue, Sep 6, 2016 at 5:44 AM, Junio C Hamano  wrote:
> By the way, not running xdiff twice would also remove another worry
> I have about correctness, in that the approach depends on xdiff
> machinery to produce byte-for-byte identical result given the same
> pair of input.

As we use different parameters to the xdiff machinery (e.g. context = 1 line)
the output is not  byte-for-byte identical.

> The output may currently be reproducible, but that
> is an unnecessary and an expensive thing to rely on.

My original design was to not store the lines in the hashmap but
only pointers to them, such that the additional memory pressure was
assumed less than storing the whole output of the xdiff machinery.

That point is moot though in the current implementation, so it
would be better indeed if we run the xdiff machinery once and store
all its output and then operate on that, even from a memory perspective.

>
> You may be able to save tons of memory if you do not store the line
> contents duplicated.  The first pass callback can tell the line
> numbers in preimage and postimage [*1*], so your record for a
> removed line could be a pair 
> with whatever hash value you need to throw it into the hash bucket.

Yeah I guess I'll go that way in the next patch then.

>
> I know we use a hash function and a line comparison function that
> are aware of -b/-w comparison in xdiff machinery, but I didn't see
> you use them in your hashtable.  Can you match moved lines when
> operating under various whitespace-change-ignoring modes?

Not yet.

Thanks,
Stefan

>
> Thanks.
>
>
> [Footnote]
>
> *1* You can learn all sort of things from emit_callback structure;
> if you need to pass more data from the caller of xdi_diff_outf()
> to the callback, you can even add new fields to it.
>


Re: [WIP PATCH v2] diff.c: emit moved lines with a different color

2016-09-06 Thread Junio C Hamano
Stefan Beller  writes:

> This new coloring is linear to the size of the patch, i.e. O(number of
> added/removed lines) in memory and for computational efforts I'd
> think it is O(n log n) as inserting into the hashmap is an amortized
> log n.

In addition to that O(n log n) overhead for book-keeping, you are
doing at least twice the amount of work compared to the original, if
you are still running the same xdiff twice to implement the two pass
approach.  That is why I thought this "twice as slow, at least"
needs to be off by default at least in the beginning.

Also there is the additional memory pressure coming from the fact
that the first pass will need to keep all the removed and added
lines in-core for all filepairs.  If you keep the entire diff output
in-core from the first pass, I do not think it would be that much
more memory overhead compared to what you are already doing, so the
cost of running the same xdiff twice is relatively easy to reduce, I
would imagine?  Instead of running the second xdi_diff_outf(), you
can drive your callback function out of what has been queued in the
first pass yourself.  But that is the next step "optimization" once
we got the basics right.

By the way, not running xdiff twice would also remove another worry
I have about correctness, in that the approach depends on xdiff
machinery to produce byte-for-byte identical result given the same
pair of input.  The output may currently be reproducible, but that
is an unnecessary and an expensive thing to rely on.

You may be able to save tons of memory if you do not store the line
contents duplicated.  The first pass callback can tell the line
numbers in preimage and postimage [*1*], so your record for a
removed line could be a pair 
with whatever hash value you need to throw it into the hash bucket.

I know we use a hash function and a line comparison function that
are aware of -b/-w comparison in xdiff machinery, but I didn't see
you use them in your hashtable.  Can you match moved lines when
operating under various whitespace-change-ignoring modes?

Thanks.


[Footnote]

*1* You can learn all sort of things from emit_callback structure;
if you need to pass more data from the caller of xdi_diff_outf()
to the callback, you can even add new fields to it.



Re: [WIP PATCH v2] diff.c: emit moved lines with a different color

2016-09-05 Thread Stefan Beller
On Mon, Sep 5, 2016 at 6:09 PM, Jacob Keller  wrote:
> On Mon, Sep 5, 2016 at 11:57 AM, Junio C Hamano  wrote:
>> Stefan Beller  writes:
>>
>>> diff --git a/Documentation/config.txt b/Documentation/config.txt
>>> index 0bcb679..f4f51c2 100644
>>> --- a/Documentation/config.txt
>>> +++ b/Documentation/config.txt
>>> @@ -980,8 +980,9 @@ color.diff.::
>>>   of `context` (context text - `plain` is a historical synonym),
>>>   `meta` (metainformation), `frag`
>>>   (hunk header), 'func' (function in hunk header), `old` (removed 
>>> lines),
>>> - `new` (added lines), `commit` (commit headers), or `whitespace`
>>> - (highlighting whitespace errors).
>>> + `new` (added lines), `commit` (commit headers), `whitespace`
>>> + (highlighting whitespace errors), `moved-old` (removed lines that
>>> + reappear), `moved-new` (added lines that were removed elsewhere).
>>
>> Could we have a config to disable this rather costly new feature,
>> too?
>
> That seems entirely reasonable, though we *do* have a configuration
> for disabling color altogether.. is there any numbers on how much more
> this costs to compute?

This new coloring is linear to the size of the patch, i.e. O(number of
added/removed lines) in memory and for computational efforts I'd
think it is O(n log n) as inserting into the hashmap is an amortized
log n.

>
>>> +static struct hashmap *duplicates_added;
>>> +static struct hashmap *duplicates_removed;
>>> +static int hash_previous_line_added;
>>> +static int hash_previous_line_removed;
>>
>> I think these should be added as new fields to diff_options
>> structure.
>
> Agreed, those seem like good choices for diff_options.

yup.


Re: [WIP PATCH v2] diff.c: emit moved lines with a different color

2016-09-05 Thread Stefan Beller
On Mon, Sep 5, 2016 at 11:57 AM, Junio C Hamano  wrote:
>> + `new` (added lines), `commit` (commit headers), `whitespace`
>> + (highlighting whitespace errors), `moved-old` (removed lines that
>> + reappear), `moved-new` (added lines that were removed elsewhere).
>
> Could we have a config to disable this rather costly new feature,
> too?

And by config option you mean both a command line parameter
`--color=yes-but-no-move-detection` as well as a diff.color config option?

As it is currently `--color=` with when={always, never,auto},
I don't think we want to add it as another parameter there, so maybe

Re: [WIP PATCH v2] diff.c: emit moved lines with a different color

2016-09-05 Thread Jacob Keller
On Mon, Sep 5, 2016 at 11:57 AM, Junio C Hamano  wrote:
> Stefan Beller  writes:
>
>> diff --git a/Documentation/config.txt b/Documentation/config.txt
>> index 0bcb679..f4f51c2 100644
>> --- a/Documentation/config.txt
>> +++ b/Documentation/config.txt
>> @@ -980,8 +980,9 @@ color.diff.::
>>   of `context` (context text - `plain` is a historical synonym),
>>   `meta` (metainformation), `frag`
>>   (hunk header), 'func' (function in hunk header), `old` (removed lines),
>> - `new` (added lines), `commit` (commit headers), or `whitespace`
>> - (highlighting whitespace errors).
>> + `new` (added lines), `commit` (commit headers), `whitespace`
>> + (highlighting whitespace errors), `moved-old` (removed lines that
>> + reappear), `moved-new` (added lines that were removed elsewhere).
>
> Could we have a config to disable this rather costly new feature,
> too?

That seems entirely reasonable, though we *do* have a configuration
for disabling color altogether.. is there any numbers on how much more
this costs to compute?

>> +static struct hashmap *duplicates_added;
>> +static struct hashmap *duplicates_removed;
>> +static int hash_previous_line_added;
>> +static int hash_previous_line_removed;
>
> I think these should be added as new fields to diff_options
> structure.

Agreed, those seem like good choices for diff_options.

Thanks,
Jake


Re: [WIP PATCH v2] diff.c: emit moved lines with a different color

2016-09-05 Thread Junio C Hamano
Stefan Beller  writes:

> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index 0bcb679..f4f51c2 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -980,8 +980,9 @@ color.diff.::
>   of `context` (context text - `plain` is a historical synonym),
>   `meta` (metainformation), `frag`
>   (hunk header), 'func' (function in hunk header), `old` (removed lines),
> - `new` (added lines), `commit` (commit headers), or `whitespace`
> - (highlighting whitespace errors).
> + `new` (added lines), `commit` (commit headers), `whitespace`
> + (highlighting whitespace errors), `moved-old` (removed lines that
> + reappear), `moved-new` (added lines that were removed elsewhere).

Could we have a config to disable this rather costly new feature,
too?

Also the first and the third level configuration names (the 
is at the third level) used by the core-git do not use dashed-words
format.  Please adhere to the current convention.

> diff --git a/diff.c b/diff.c
> index 534c12e..d37cb4f 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -18,6 +18,7 @@
>  #include "ll-merge.h"
>  #include "string-list.h"
>  #include "argv-array.h"
> +#include "git-compat-util.h"
>  
>  #ifdef NO_FAST_WORKING_DIRECTORY
>  #define FAST_WORKING_DIRECTORY 0
> @@ -42,6 +43,11 @@ static int diff_dirstat_permille_default = 30;
>  static struct diff_options default_diff_options;
>  static long diff_algorithm;
>  
> +static struct hashmap *duplicates_added;
> +static struct hashmap *duplicates_removed;
> +static int hash_previous_line_added;
> +static int hash_previous_line_removed;

I think these should be added as new fields to diff_options
structure.


[WIP PATCH v2] diff.c: emit moved lines with a different color

2016-09-04 Thread Stefan Beller
From: Stefan Beller 

When we color the diff, we'll mark moved lines with a different color.

This is achieved by doing a two passes over the diff. The first pass
will inspect each line of the diff and store the removed lines and the
added lines in its own hash map.
The second pass will check for each added line if that is found in the
set of removed lines. If so it will color the added line differently as
with the new `moved-new` color mode.
For each removed line we check the set of added lines and if found emit
that with the new color `moved-old`.

When detecting the moved lines, we cannot just rely on a line being equal,
but we need to take the context into account to detect when the moves were
reordered as we do not want to color moved but per-mutated lines.

This patch was motivated by e.g. reviewing 3b0c4200 ("apply: move
libified code from builtin/apply.c to apply.{c,h}", 2016-08-08)

Signed-off-by: Stefan Beller 
---

  This adds code only in the diff_flush part, just as you suggested Junio.
  
  I think the design is sound now, as it produces good highlights across files;
  though now the code still sucks.
   * leaking memory; 
   * we run the code to detect moves more often than we need, e.g.
 when we configured an external diff, we don't need to run it.
 Or when we do not use colors at all.
   * I need to think about if we need everything doubled for both add and
   removal; maybe we can use one variable for both cases.


  Jacob wrote:
  
  > In the example, the first and last lines of duplicate copies don't get
  > colored differently, and that threw  me off. I feel like that was
  > maybe not intentional? If it was, can you explain why?

  We need the context to detect permutations, improved version:
  
  http://i.imgur.com/MnaSZ1D.png


  Jakub wrote:
  
  > P.S. BTW. does this work with word-diff?
  
  No (not yet )
  
  Thanks,
  Stefan

 Documentation/config.txt   |   5 +-
 contrib/completion/git-completion.bash |   2 +
 diff.c | 200 -
 diff.h |   5 +-
 4 files changed, 205 insertions(+), 7 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0bcb679..f4f51c2 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -980,8 +980,9 @@ color.diff.::
of `context` (context text - `plain` is a historical synonym),
`meta` (metainformation), `frag`
(hunk header), 'func' (function in hunk header), `old` (removed lines),
-   `new` (added lines), `commit` (commit headers), or `whitespace`
-   (highlighting whitespace errors).
+   `new` (added lines), `commit` (commit headers), `whitespace`
+   (highlighting whitespace errors), `moved-old` (removed lines that
+   reappear), `moved-new` (added lines that were removed elsewhere).
 
 color.decorate.::
Use customized color for 'git log --decorate' output.  `` is one
diff --git a/contrib/completion/git-completion.bash 
b/contrib/completion/git-completion.bash
index 9c8f738..b558d12 100644
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -2115,6 +2115,8 @@ _git_config ()
color.diff.old
color.diff.plain
color.diff.whitespace
+   color.diff.moved-old
+   color.diff.moved-new
color.grep
color.grep.context
color.grep.filename
diff --git a/diff.c b/diff.c
index 534c12e..d37cb4f 100644
--- a/diff.c
+++ b/diff.c
@@ -18,6 +18,7 @@
 #include "ll-merge.h"
 #include "string-list.h"
 #include "argv-array.h"
+#include "git-compat-util.h"
 
 #ifdef NO_FAST_WORKING_DIRECTORY
 #define FAST_WORKING_DIRECTORY 0
@@ -42,6 +43,11 @@ static int diff_dirstat_permille_default = 30;
 static struct diff_options default_diff_options;
 static long diff_algorithm;
 
+static struct hashmap *duplicates_added;
+static struct hashmap *duplicates_removed;
+static int hash_previous_line_added;
+static int hash_previous_line_removed;
+
 static char diff_colors[][COLOR_MAXLEN] = {
GIT_COLOR_RESET,
GIT_COLOR_NORMAL,   /* CONTEXT */
@@ -52,6 +58,8 @@ static char diff_colors[][COLOR_MAXLEN] = {
GIT_COLOR_YELLOW,   /* COMMIT */
GIT_COLOR_BG_RED,   /* WHITESPACE */
GIT_COLOR_NORMAL,   /* FUNCINFO */
+   GIT_COLOR_BLUE, /* NEW MOVED */
+   GIT_COLOR_MAGENTA,  /* OLD MOVED */
 };
 
 static int parse_diff_color_slot(const char *var)
@@ -72,6 +80,10 @@ static int parse_diff_color_slot(const char *var)
return DIFF_WHITESPACE;
if (!strcasecmp(var, "func"))
return DIFF_FUNCINFO;
+   if (!strcasecmp(var, "moved-old"))
+   return DIFF_FILE_MOVED_OLD;
+   if (!strcasecmp(var, "moved-new"))
+   return DIFF_FILE_MOVED_NEW;
return -1;
 }