Re: [WIP PATCH v2] diff.c: emit moved lines with a different color
On Tue, Sep 6, 2016 at 5:44 AM, Junio C Hamanowrote: > By the way, not running xdiff twice would also remove another worry > I have about correctness, in that the approach depends on xdiff > machinery to produce byte-for-byte identical result given the same > pair of input. As we use different parameters to the xdiff machinery (e.g. context = 1 line) the output is not byte-for-byte identical. > The output may currently be reproducible, but that > is an unnecessary and an expensive thing to rely on. My original design was to not store the lines in the hashmap but only pointers to them, such that the additional memory pressure was assumed less than storing the whole output of the xdiff machinery. That point is moot though in the current implementation, so it would be better indeed if we run the xdiff machinery once and store all its output and then operate on that, even from a memory perspective. > > You may be able to save tons of memory if you do not store the line > contents duplicated. The first pass callback can tell the line > numbers in preimage and postimage [*1*], so your record for a > removed line could be a pair > with whatever hash value you need to throw it into the hash bucket. Yeah I guess I'll go that way in the next patch then. > > I know we use a hash function and a line comparison function that > are aware of -b/-w comparison in xdiff machinery, but I didn't see > you use them in your hashtable. Can you match moved lines when > operating under various whitespace-change-ignoring modes? Not yet. Thanks, Stefan > > Thanks. > > > [Footnote] > > *1* You can learn all sort of things from emit_callback structure; > if you need to pass more data from the caller of xdi_diff_outf() > to the callback, you can even add new fields to it. >
Re: [WIP PATCH v2] diff.c: emit moved lines with a different color
Stefan Bellerwrites: > This new coloring is linear to the size of the patch, i.e. O(number of > added/removed lines) in memory and for computational efforts I'd > think it is O(n log n) as inserting into the hashmap is an amortized > log n. In addition to that O(n log n) overhead for book-keeping, you are doing at least twice the amount of work compared to the original, if you are still running the same xdiff twice to implement the two pass approach. That is why I thought this "twice as slow, at least" needs to be off by default at least in the beginning. Also there is the additional memory pressure coming from the fact that the first pass will need to keep all the removed and added lines in-core for all filepairs. If you keep the entire diff output in-core from the first pass, I do not think it would be that much more memory overhead compared to what you are already doing, so the cost of running the same xdiff twice is relatively easy to reduce, I would imagine? Instead of running the second xdi_diff_outf(), you can drive your callback function out of what has been queued in the first pass yourself. But that is the next step "optimization" once we got the basics right. By the way, not running xdiff twice would also remove another worry I have about correctness, in that the approach depends on xdiff machinery to produce byte-for-byte identical result given the same pair of input. The output may currently be reproducible, but that is an unnecessary and an expensive thing to rely on. You may be able to save tons of memory if you do not store the line contents duplicated. The first pass callback can tell the line numbers in preimage and postimage [*1*], so your record for a removed line could be a pair with whatever hash value you need to throw it into the hash bucket. I know we use a hash function and a line comparison function that are aware of -b/-w comparison in xdiff machinery, but I didn't see you use them in your hashtable. Can you match moved lines when operating under various whitespace-change-ignoring modes? Thanks. [Footnote] *1* You can learn all sort of things from emit_callback structure; if you need to pass more data from the caller of xdi_diff_outf() to the callback, you can even add new fields to it.
Re: [WIP PATCH v2] diff.c: emit moved lines with a different color
On Mon, Sep 5, 2016 at 6:09 PM, Jacob Kellerwrote: > On Mon, Sep 5, 2016 at 11:57 AM, Junio C Hamano wrote: >> Stefan Beller writes: >> >>> diff --git a/Documentation/config.txt b/Documentation/config.txt >>> index 0bcb679..f4f51c2 100644 >>> --- a/Documentation/config.txt >>> +++ b/Documentation/config.txt >>> @@ -980,8 +980,9 @@ color.diff.:: >>> of `context` (context text - `plain` is a historical synonym), >>> `meta` (metainformation), `frag` >>> (hunk header), 'func' (function in hunk header), `old` (removed >>> lines), >>> - `new` (added lines), `commit` (commit headers), or `whitespace` >>> - (highlighting whitespace errors). >>> + `new` (added lines), `commit` (commit headers), `whitespace` >>> + (highlighting whitespace errors), `moved-old` (removed lines that >>> + reappear), `moved-new` (added lines that were removed elsewhere). >> >> Could we have a config to disable this rather costly new feature, >> too? > > That seems entirely reasonable, though we *do* have a configuration > for disabling color altogether.. is there any numbers on how much more > this costs to compute? This new coloring is linear to the size of the patch, i.e. O(number of added/removed lines) in memory and for computational efforts I'd think it is O(n log n) as inserting into the hashmap is an amortized log n. > >>> +static struct hashmap *duplicates_added; >>> +static struct hashmap *duplicates_removed; >>> +static int hash_previous_line_added; >>> +static int hash_previous_line_removed; >> >> I think these should be added as new fields to diff_options >> structure. > > Agreed, those seem like good choices for diff_options. yup.
Re: [WIP PATCH v2] diff.c: emit moved lines with a different color
On Mon, Sep 5, 2016 at 11:57 AM, Junio C Hamanowrote: >> + `new` (added lines), `commit` (commit headers), `whitespace` >> + (highlighting whitespace errors), `moved-old` (removed lines that >> + reappear), `moved-new` (added lines that were removed elsewhere). > > Could we have a config to disable this rather costly new feature, > too? And by config option you mean both a command line parameter `--color=yes-but-no-move-detection` as well as a diff.color config option? As it is currently `--color=` with when={always, never,auto}, I don't think we want to add it as another parameter there, so maybe
Re: [WIP PATCH v2] diff.c: emit moved lines with a different color
On Mon, Sep 5, 2016 at 11:57 AM, Junio C Hamanowrote: > Stefan Beller writes: > >> diff --git a/Documentation/config.txt b/Documentation/config.txt >> index 0bcb679..f4f51c2 100644 >> --- a/Documentation/config.txt >> +++ b/Documentation/config.txt >> @@ -980,8 +980,9 @@ color.diff.:: >> of `context` (context text - `plain` is a historical synonym), >> `meta` (metainformation), `frag` >> (hunk header), 'func' (function in hunk header), `old` (removed lines), >> - `new` (added lines), `commit` (commit headers), or `whitespace` >> - (highlighting whitespace errors). >> + `new` (added lines), `commit` (commit headers), `whitespace` >> + (highlighting whitespace errors), `moved-old` (removed lines that >> + reappear), `moved-new` (added lines that were removed elsewhere). > > Could we have a config to disable this rather costly new feature, > too? That seems entirely reasonable, though we *do* have a configuration for disabling color altogether.. is there any numbers on how much more this costs to compute? >> +static struct hashmap *duplicates_added; >> +static struct hashmap *duplicates_removed; >> +static int hash_previous_line_added; >> +static int hash_previous_line_removed; > > I think these should be added as new fields to diff_options > structure. Agreed, those seem like good choices for diff_options. Thanks, Jake
Re: [WIP PATCH v2] diff.c: emit moved lines with a different color
Stefan Bellerwrites: > diff --git a/Documentation/config.txt b/Documentation/config.txt > index 0bcb679..f4f51c2 100644 > --- a/Documentation/config.txt > +++ b/Documentation/config.txt > @@ -980,8 +980,9 @@ color.diff.:: > of `context` (context text - `plain` is a historical synonym), > `meta` (metainformation), `frag` > (hunk header), 'func' (function in hunk header), `old` (removed lines), > - `new` (added lines), `commit` (commit headers), or `whitespace` > - (highlighting whitespace errors). > + `new` (added lines), `commit` (commit headers), `whitespace` > + (highlighting whitespace errors), `moved-old` (removed lines that > + reappear), `moved-new` (added lines that were removed elsewhere). Could we have a config to disable this rather costly new feature, too? Also the first and the third level configuration names (the is at the third level) used by the core-git do not use dashed-words format. Please adhere to the current convention. > diff --git a/diff.c b/diff.c > index 534c12e..d37cb4f 100644 > --- a/diff.c > +++ b/diff.c > @@ -18,6 +18,7 @@ > #include "ll-merge.h" > #include "string-list.h" > #include "argv-array.h" > +#include "git-compat-util.h" > > #ifdef NO_FAST_WORKING_DIRECTORY > #define FAST_WORKING_DIRECTORY 0 > @@ -42,6 +43,11 @@ static int diff_dirstat_permille_default = 30; > static struct diff_options default_diff_options; > static long diff_algorithm; > > +static struct hashmap *duplicates_added; > +static struct hashmap *duplicates_removed; > +static int hash_previous_line_added; > +static int hash_previous_line_removed; I think these should be added as new fields to diff_options structure.
[WIP PATCH v2] diff.c: emit moved lines with a different color
From: Stefan BellerWhen we color the diff, we'll mark moved lines with a different color. This is achieved by doing a two passes over the diff. The first pass will inspect each line of the diff and store the removed lines and the added lines in its own hash map. The second pass will check for each added line if that is found in the set of removed lines. If so it will color the added line differently as with the new `moved-new` color mode. For each removed line we check the set of added lines and if found emit that with the new color `moved-old`. When detecting the moved lines, we cannot just rely on a line being equal, but we need to take the context into account to detect when the moves were reordered as we do not want to color moved but per-mutated lines. This patch was motivated by e.g. reviewing 3b0c4200 ("apply: move libified code from builtin/apply.c to apply.{c,h}", 2016-08-08) Signed-off-by: Stefan Beller --- This adds code only in the diff_flush part, just as you suggested Junio. I think the design is sound now, as it produces good highlights across files; though now the code still sucks. * leaking memory; * we run the code to detect moves more often than we need, e.g. when we configured an external diff, we don't need to run it. Or when we do not use colors at all. * I need to think about if we need everything doubled for both add and removal; maybe we can use one variable for both cases. Jacob wrote: > In the example, the first and last lines of duplicate copies don't get > colored differently, and that threw me off. I feel like that was > maybe not intentional? If it was, can you explain why? We need the context to detect permutations, improved version: http://i.imgur.com/MnaSZ1D.png Jakub wrote: > P.S. BTW. does this work with word-diff? No (not yet ) Thanks, Stefan Documentation/config.txt | 5 +- contrib/completion/git-completion.bash | 2 + diff.c | 200 - diff.h | 5 +- 4 files changed, 205 insertions(+), 7 deletions(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index 0bcb679..f4f51c2 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -980,8 +980,9 @@ color.diff.:: of `context` (context text - `plain` is a historical synonym), `meta` (metainformation), `frag` (hunk header), 'func' (function in hunk header), `old` (removed lines), - `new` (added lines), `commit` (commit headers), or `whitespace` - (highlighting whitespace errors). + `new` (added lines), `commit` (commit headers), `whitespace` + (highlighting whitespace errors), `moved-old` (removed lines that + reappear), `moved-new` (added lines that were removed elsewhere). color.decorate.:: Use customized color for 'git log --decorate' output. `` is one diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash index 9c8f738..b558d12 100644 --- a/contrib/completion/git-completion.bash +++ b/contrib/completion/git-completion.bash @@ -2115,6 +2115,8 @@ _git_config () color.diff.old color.diff.plain color.diff.whitespace + color.diff.moved-old + color.diff.moved-new color.grep color.grep.context color.grep.filename diff --git a/diff.c b/diff.c index 534c12e..d37cb4f 100644 --- a/diff.c +++ b/diff.c @@ -18,6 +18,7 @@ #include "ll-merge.h" #include "string-list.h" #include "argv-array.h" +#include "git-compat-util.h" #ifdef NO_FAST_WORKING_DIRECTORY #define FAST_WORKING_DIRECTORY 0 @@ -42,6 +43,11 @@ static int diff_dirstat_permille_default = 30; static struct diff_options default_diff_options; static long diff_algorithm; +static struct hashmap *duplicates_added; +static struct hashmap *duplicates_removed; +static int hash_previous_line_added; +static int hash_previous_line_removed; + static char diff_colors[][COLOR_MAXLEN] = { GIT_COLOR_RESET, GIT_COLOR_NORMAL, /* CONTEXT */ @@ -52,6 +58,8 @@ static char diff_colors[][COLOR_MAXLEN] = { GIT_COLOR_YELLOW, /* COMMIT */ GIT_COLOR_BG_RED, /* WHITESPACE */ GIT_COLOR_NORMAL, /* FUNCINFO */ + GIT_COLOR_BLUE, /* NEW MOVED */ + GIT_COLOR_MAGENTA, /* OLD MOVED */ }; static int parse_diff_color_slot(const char *var) @@ -72,6 +80,10 @@ static int parse_diff_color_slot(const char *var) return DIFF_WHITESPACE; if (!strcasecmp(var, "func")) return DIFF_FUNCINFO; + if (!strcasecmp(var, "moved-old")) + return DIFF_FILE_MOVED_OLD; + if (!strcasecmp(var, "moved-new")) + return DIFF_FILE_MOVED_NEW; return -1; }