Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-15 Thread Junio C Hamano
Stefan Beller  writes:

> Maybe we can enable Michaels heuristic with the same
> config/command line flag, i.e. "the flag changes its algorithm"?

I think that is a very sensible proposal.  After all, the name
diff.compactionHeuristic only tells us what part of the diff process
the heuristic is used, and does not say anything about what the
heuristics does.  It is neutral between "take a blank as a hint" vs
"take indentation leveles as a hint".
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-15 Thread Stefan Beller
> Is there a case where the compaction heuristic produces a better result
> than this indent heuristic? AFAICT, you have not found one, and I'd be
> surprised if there is one, because this _seems_ like a superset
> generally. I suppose there is always the possibility that the empirical
> knobs behave badly in some particular case that the compaction heuristic
> just happens to get right, but it should be quite rare.

This is how I understand it as well. I would not mind to remove the
blank-line-suggested-split heuristic.

Maybe we can enable Michaels heuristic with the same
config/command line flag, i.e. "the flag changes its algorithm"?
Then people, who read the prior announcement don't have to
do anything, while we keep it as an experimental feature for the
next release and auto-on it after that?

We could also be a bit more aggressive and auto-on the new
heuristic with the old heuristic removed and we only have an
(undocumented) emergency-off knob for one or two releases?

Thanks,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-14 Thread Jacob Keller
On Sat, Aug 13, 2016 at 8:59 AM, Junio C Hamano  wrote:
> Jeff King  writes:
>
>> So assuming everything I just said isn't complete bollocks, I think we
>> can move to a future where nobody uses the compaction heuristic. And
>> there are three ways to deal with that:
>>
>>   1. The knob and feature stay. It might be useful for somebody who
>>  wants to experiment in the future.
>>
>>   2. The knob and feature go away completely. It was an experiment, but
>>  now we have something more useful.
>>
>>   3. The feature goes away, but the knob stays as noop, or maybe as an
>>  alias for the indent heuristic, just because we did ship a version
>>  that accepts "--compaction-heuristic", and maybe somebody somewhere
>>  put it in a script?
>>
>> I think I'd be in favor of (2).
>
> I am all for (2) [*1*]
>

I also am in favor of (2). I understand the reasoning for maintaining
compatibility, but this was a known experimental feature that was
unlikely used by many people. Even if it was, these are the very sorts
of people who should be aware that the experimental feature is going
away. It reduces code complexity if it just goes away, and I believe
the new heuristic is much better (Thank you Michael!)

As for a knob on the new feature, I think it can become the default
with a way to disable the feature via command line. I'm not really
sure it needs a config option at all.

> This and the previous "take a blank line as a hint" are both
> heuristics.  As long as the resulting code does not tax runtime
> performance visibly and improves the resulting output 99% of the
> time, there is no reason to leave end-users a knob.  "Among 9 hunks
> in this patch that touch hello.c, 7 are made much more readable but
> 2 are worse" cannot even be helped with a command line option.
>

Yea I agree. It might be worth having it disabled via the stable patch
IDs (? I don't know if we guarantee this?) but otherwise I don't see
it being important either way. I would vote for a way to disable it
via command line just because we *are* changing behavior here. But I
don't think it needs to be a config option at all.

>
> [Footnote]
>
> *1* I am also strongly against (3), if only to teach people a
> lesson ;-).
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-13 Thread Junio C Hamano
Jeff King  writes:

> So assuming everything I just said isn't complete bollocks, I think we
> can move to a future where nobody uses the compaction heuristic. And
> there are three ways to deal with that:
>
>   1. The knob and feature stay. It might be useful for somebody who
>  wants to experiment in the future.
>
>   2. The knob and feature go away completely. It was an experiment, but
>  now we have something more useful.
>
>   3. The feature goes away, but the knob stays as noop, or maybe as an
>  alias for the indent heuristic, just because we did ship a version
>  that accepts "--compaction-heuristic", and maybe somebody somewhere
>  put it in a script?
>
> I think I'd be in favor of (2).

I am all for (2) [*1*]

This and the previous "take a blank line as a hint" are both
heuristics.  As long as the resulting code does not tax runtime
performance visibly and improves the resulting output 99% of the
time, there is no reason to leave end-users a knob.  "Among 9 hunks
in this patch that touch hello.c, 7 are made much more readable but
2 are worse" cannot even be helped with a command line option.


[Footnote]

*1* I am also strongly against (3), if only to teach people a
lesson ;-).
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-13 Thread Jeff King
On Sat, Aug 13, 2016 at 01:25:05AM +0200, Michael Haggerty wrote:

> > These two flags are mutually exclusive in the xdiff code, so we should
> > probably handle that here.
> > 
> > TBH, I do not care that much what:
> > 
> >   [diff]
> >   compactionHeuristic = true
> >   indentHeuristic = true
> > 
> > does. But right now:
> > 
> >   git config diff.compactionHeuristic true
> >   git show --indent-heuristic
> > 
> > still prefers the compaction heuristic, which I think is objectively
> > wrong.
> 
> I wasn't worrying about that yet, given that these two features are both
> still experimental. I also have a strong inkling that at most one of
> them needs to be made permanent. I propose that I repair the semantics
> in the simplest way possible for now while we decide on the long-term
> plan, which might conceivably be:
> 
> * keep both options permanently
> * keep only one option permanently
> * choose one heuristic and use it always (i.e., make it part of the new
> standard one-and-only diff algorithm)
> * discard both heuristics (I hope not!)
> 
> After we've decided on that, *then* let's decide on a suitable UI and
> implement it before we declare either feature non-experimental.

Is there a case where the compaction heuristic produces a better result
than this indent heuristic? AFAICT, you have not found one, and I'd be
surprised if there is one, because this _seems_ like a superset
generally. I suppose there is always the possibility that the empirical
knobs behave badly in some particular case that the compaction heuristic
just happens to get right, but it should be quite rare.

So assuming everything I just said isn't complete bollocks, I think we
can move to a future where nobody uses the compaction heuristic. And
there are three ways to deal with that:

  1. The knob and feature stay. It might be useful for somebody who
 wants to experiment in the future.

  2. The knob and feature go away completely. It was an experiment, but
 now we have something more useful.

  3. The feature goes away, but the knob stays as noop, or maybe as an
 alias for the indent heuristic, just because we did ship a version
 that accepts "--compaction-heuristic", and maybe somebody somewhere
 put it in a script?

I think I'd be in favor of (2). It doesn't seem likely enough for people
to experiment with to merit a run-time knob; they can always patch and
build if they want to do so. And (3) just seems like a pain for
something that was only shipped in one version and was kind of
experimental, and was unlikely to end up in scripts (much more likely is
that people set the config, but that's easier to ignore). But it does
violate our usual backwards-compatibility rules.

So if we assume that indent is useful and compaction goes away, the only
questions are "does indent it become the default" and "if so, does it
still get a knob". I'd say "yes" to both. Making the new behavior the
default was what we planned to do with compaction until we saw that it
regressed some cases. But as a new feature, it's nice for users to be
able to easily disable it to see if it's causing a problem (or to see
what a big improvement it is!).

I think we could get by with just a command-line option for that
purpose, rather than a config option; that saves a lot of effort in
having porcelains manually propagate the config option when they call a
plumbing diff-tree.

I guess the only users that leaves out are ones who really want stable
backwards-compatible diff. I guess "patch --stable" is one such user
(but that one we could handle internally). But let's say you had a code
review system that attached comments to lines of a diff. You might want
to disable the feature entirely to avoid invalidating old comments.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-12 Thread Michael Haggerty
On 08/04/2016 06:55 PM, Stefan Beller wrote:
> [...]
> I have just reread the scoring function and I think you could pull out the
> `score=indent` assignment (it is always assigned except for indent <0)
> 
> if (indent == -1)
>score = 0;
> else
>score = indent;
> ... lots of bonus computation below, which in its current 
> implementation
> have lots of "score = indent;" lines as well.

Yes. An earlier version of the heuristic used different indent values in
different situations, but that's gone away so the code can be made
simpler now. I'll make the change.

Thanks,
Michael

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-12 Thread Michael Haggerty
On 08/04/2016 09:52 PM, Junio C Hamano wrote:
> Michael Haggerty  writes:
>> +#define START_OF_FILE_BONUS 9
>> +#define END_OF_FILE_BONUS 46
>> +#define TOTAL_BLANK_WEIGHT 4
>> +#define PRE_BLANK_WEIGHT 16
>> +#define RELATIVE_INDENT_BONUS -1
>> +#define RELATIVE_INDENT_HAS_BLANK_BONUS 15
>> +#define RELATIVE_OUTDENT_BONUS -19
>> +#define RELATIVE_OUTDENT_HAS_BLANK_BONUS 2
> 
> When I read up to here, I thought "Heh, isn't the opposite of INdent
> DEdent?" and then saw this:
> 
>> +#define RELATIVE_DEDENT_BONUS -63
>> +#define RELATIVE_DEDENT_HAS_BLANK_BONUS 50
> 
> It turns out that you mean by OUTdent a line that indents further
> (if I am reading the code correctly).  Is that obvious to everybody?

I'll comment it better.

>> +/* Bonuses based on the location of blank lines: */
>> +bonus += TOTAL_BLANK_WEIGHT * total_blanks;
>> +bonus += PRE_BLANK_WEIGHT * m->pre_blank;
> 
> This and ...
> 
>> +} else if (indent > m->pre_indent) {
>> +/*
>> + * The line is indented more than its predecessor. Score it 
>> based
>> + * on the larger indent:
>> + */
>> +score = indent;
>> +bonus += RELATIVE_INDENT_BONUS;
>> +bonus += RELATIVE_INDENT_HAS_BLANK_BONUS * any_blanks;
>> +} else if (indent < m->pre_indent) {
> 
> ... this seems to be indented correctly even after getting quoted,
> which in turn means most of the lines in the added code share
> indent-with-non-tab badness.

The code was copy-pasted from a Python prototype then converted to C :-)

I'll fix the whitespace.

Thanks,
Michael

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-12 Thread Michael Haggerty
On 08/04/2016 09:56 AM, Jeff King wrote:
> On Thu, Aug 04, 2016 at 12:00:36AM +0200, Michael Haggerty wrote:
> 
>> This table shows the number of diff slider groups that were positioned
>> differently than the human-generated values, for various repositories.
>> "default" is the default "git diff" algorithm. "compaction" is Git 2.9.0
>> with the `--compaction-heuristic` option "indent" is an earlier,
> 
> s/option/&./

Thanks, will change.

>>  static int diff_detect_rename_default;
>> +static int diff_indent_heuristic; /* experimental */
>>  static int diff_compaction_heuristic; /* experimental */
> 
> These two flags are mutually exclusive in the xdiff code, so we should
> probably handle that here.
> 
> TBH, I do not care that much what:
> 
>   [diff]
>   compactionHeuristic = true
>   indentHeuristic = true
> 
> does. But right now:
> 
>   git config diff.compactionHeuristic true
>   git show --indent-heuristic
> 
> still prefers the compaction heuristic, which I think is objectively
> wrong.

I wasn't worrying about that yet, given that these two features are both
still experimental. I also have a strong inkling that at most one of
them needs to be made permanent. I propose that I repair the semantics
in the simplest way possible for now while we decide on the long-term
plan, which might conceivably be:

* keep both options permanently
* keep only one option permanently
* choose one heuristic and use it always (i.e., make it part of the new
standard one-and-only diff algorithm)
* discard both heuristics (I hope not!)

After we've decided on that, *then* let's decide on a suitable UI and
implement it before we declare either feature non-experimental.

> [...]
> Speaking of absurd amounts of work, I was curious if there was a
> noticeable performance penalty for using this heuristic [...]

I included some performance numbers in my response to Junio [1].

>> +#define START_OF_FILE_BONUS 9
>> +#define END_OF_FILE_BONUS 46
>> +#define TOTAL_BLANK_WEIGHT 4
>> +#define PRE_BLANK_WEIGHT 16
>> +#define RELATIVE_INDENT_BONUS -1
>> +#define RELATIVE_INDENT_HAS_BLANK_BONUS 15
>> +#define RELATIVE_OUTDENT_BONUS -19
>> +#define RELATIVE_OUTDENT_HAS_BLANK_BONUS 2
>> +#define RELATIVE_DEDENT_BONUS -63
>> +#define RELATIVE_DEDENT_HAS_BLANK_BONUS 50
> 
> I see there is a comment below here mentioning that these are empirical
> voodoo, but it might be worth one at the top (or just moving these below
> the comment) because the comment looks like it's just associated with
> the function (and these are sufficiently bizarre that anybody reading is
> going to double-take on them).

Good idea.

>> +return 10 * score - bonus;
> 
> I don't mind this not "10" not being a #define constant, but after
> reading the exchange between you and Stefan, I think it would be nice to
> describe what it is in a comment. The rest of the function is commented
> so nicely that this one left me thinking "huh?" upon seeing the "10".

Done. Thanks for your review.

Michael

[1]
http://public-inbox.org/git/5fe0edbc-3659-058f-3328-639d1343f...@alum.mit.edu/

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-10 Thread Junio C Hamano
Michael Haggerty  writes:

>> After
>> all, somebody in this file is already scanning each and every line
>> to see where it ends to split the input into records, so perhaps a
>> "right" (if the "theoretical correctness" of the return value from
>> this function mattered, which you wave-away below) optimization
>> could be to precompute it while the lines are broken into records
>> and store it in the "rec" structure?
>
> That would certainly be possible, and would help in cases where there
> are a lot of lines with lots of leading whitespace. You could get nearly
> the same benefit by recording a single bit in struct rec, indicating
> whether the line is blank or not.
>
> But it wouldn't help the worst case described above, where each call to
> `git_indent()` is already very cheap. And I didn't think it was worth
> allocating the extra memory to optimize this heuristic

True.

> I don't know. It seems like a pretty contrived justification for what is
> basically, "your input is too weird for us. We're not going to break our
> necks trying to give you the best possible slider positioning."

True again.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-10 Thread Michael Haggerty
On 08/04/2016 02:04 AM, Stefan Beller wrote:
> On Wed, Aug 3, 2016 at 4:30 PM, Michael Haggerty  wrote:
>> Stefan Beller wrote:
>>> [...]
>>> Rather the 10 describes the ratio of "advanced magic" to pure indentation
>>> based scoring in my understanding.
>>
>> No, it's basically just a number against which the other constants are
>> compared. E.g., if another bonus wants to balance out against exactly
>> one space of indentation, its constant needs to be 10. If it wants to
>> balance out against exactly 5 spaces, its constant needs to be 50. Etc.
> 
> So another interpretation is that the 10 gives the resolution for all other
> constants, i.e. if we keep 10, then we can only give weights in 1/10 of
> "one indent". But the "ideal" weight may not be a multiple of 1/10,
> so we approximate them to the nearest multiple of 1/10.
> 
> If we were to use 1000 here, we could have a higher accuracy of the
> other constants, but probably we do not care about the 3rd decimal place
> for these because they are created heuristically from a corpus that may
> not warrant a precision of constants with a 3rd decimal place.

Not only that. Since all of the inputs to the heuristic are integers,
the outputs are discontinuous and can take only certain discrete values.
And the outputs are only ever compared against one another. So a small
adjustment to the output will only make a difference if it causes the
value to become larger/smaller than another of the possible output
values. So too high a resolution makes no sense at all.

That being said, I didn't actually check in any systematic way whether
the resolution of 10:1 is high enough in practice. But I can say that
the overall performance of the heuristic (in terms of number of errors
counted) remained constant over a relatively large parameter range, so I
think the resolution is sufficient.

Feel free to play with the parameters and/or other heuristics. The tools
and raw data are all published in [1]. Let me know if you need help
getting started.

Michael

[1] https://github.com/mhagger/diff-slider-tools

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-10 Thread Michael Haggerty
On 08/04/2016 09:39 PM, Junio C Hamano wrote:
> Michael Haggerty  writes:
> 
 +   }
 +   /*
 +* We have reached the end of the line without finding any 
 non-space
 +* characters; i.e., the whole line consists of trailing spaces, 
 which we
 +* are not interested in.
 +*/
 +   return -1;
> 
> Not related to Jacob's review, but "the whole line consists of
> trailing spaces" made me read it twice; while it is technically
> correct, "the whole line consists of spaces", or even "this is a
> blank line", would read a lot more easily, at least for me.

Hehe, yes, ETOOMANYWORDS.

>> I was implicitly assuming that such lines would have text somewhere
>> after those 200 spaces (or 25 TABs or whatever). But you're right, the
>> line could consist only of whitespace. Unfortunately, the only way to
>> distinguish these two cases is to read the rest of the line, which is
>> exactly what we *don't* want to do.
> 
> Hmm, why is it exactly what we don't want to do?  Is it a
> performance concern?  In other words, is it because this function is
> called many times to measure the same line multiple times?

Yes, performance is the reason, and especially the desire to avoid
unreasonable runtimes for pathological cases. Thanks for asking, though,
because it's worthwhile to look into this more rigorously.

Suppose there is a slider that can be shifted to any of `numshift`
positions. Then we have to call `measure_split()` `2*numshift` times
(once for the beginning and once for the end of each candidate slider
position).

Suppose there are `numblanks` blank lines in the neighborhood of the
slider. Each time we call `measure_split()`, we count the number of
blank lines before and after the proposed split position. So we end up
calling `get_indent()` `2*numshift*numblanks` times.

Finally, suppose that the blank lines each contain `numws` whitespace
characters. Then each call to `get_indent()` has to do `O(numws)` work.

So altogether, if there were no limits, then the amount of work to
position a slider would scale like

O(numshift * numblanks * numws)

However, the total number of characters in the file might only be

O(numblanks * numws)

So without limits, the amount of work to position sliders could scale by
numshift times the size of the file.

The worst case is pretty easy to achieve. Just create a file consisting
of a million or so LF characters, then add another LF to it. The diff
would be a slider with

numshift = 100
numblanks = 100
numws = 1

so the heuristic would take O(N^2) in the size of the file.

Effectively the limits cap the effective `numblanks` at `2*MAX_BLANKS`
(which is 2*20) and the effective `numws` at `MAX_INDENT` (which is
200), meaning that the maximum effort scales like

numshift * 2*20 * 200

which is still a big number but not absurd. Empirically, for the case
described above, `git diff` takes 104 ms and `git diff
--indent-heuristic` takes 490 ms. I think that's not prohibitive for a
pathological case.

Meanwhile, Myers's algorithm scales like O(ND), where N is the number of
lines and D is the edit distance, so I suppose that it is already
possible to find diffs that are intractable to compute.

> After
> all, somebody in this file is already scanning each and every line
> to see where it ends to split the input into records, so perhaps a
> "right" (if the "theoretical correctness" of the return value from
> this function mattered, which you wave-away below) optimization
> could be to precompute it while the lines are broken into records
> and store it in the "rec" structure?

That would certainly be possible, and would help in cases where there
are a lot of lines with lots of leading whitespace. You could get nearly
the same benefit by recording a single bit in struct rec, indicating
whether the line is blank or not.

But it wouldn't help the worst case described above, where each call to
`git_indent()` is already very cheap. And I didn't think it was worth
allocating the extra memory to optimize this heuristic

* which isn't used all that often in the first place,
* which (for normal inputs) doesn't take a significant amount of time, and
* when the optimization doesn't improve the worst-case scenario (and
thus any DoS potential)

I think the only way to ensure O(size_of_file) runtime in the above case
would be to record, along with each line, how many blank lines
immediately precede and succeed it. You could achieve something like
O(size_of_file lg(size_of_file)) by storing, e.g., the total number of
nonblank lines that precede each line and doing a binary search to find
the nearest non-blank line.

>> But I think it doesn't matter anyway. Such "text" will likely never be
>> read by a human, so it's not a big deal if the slider position is not
>> picked perfectly. And remember, this whole saga is just to improve the
>> aesthetics of the diff. The diff is *correct* 

Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-04 Thread Junio C Hamano
Michael Haggerty  writes:

I agree with Peff about "comment on the voodoo upfront".

> +#define START_OF_FILE_BONUS 9
> +#define END_OF_FILE_BONUS 46
> +#define TOTAL_BLANK_WEIGHT 4
> +#define PRE_BLANK_WEIGHT 16
> +#define RELATIVE_INDENT_BONUS -1
> +#define RELATIVE_INDENT_HAS_BLANK_BONUS 15
> +#define RELATIVE_OUTDENT_BONUS -19
> +#define RELATIVE_OUTDENT_HAS_BLANK_BONUS 2

When I read up to here, I thought "Heh, isn't the opposite of INdent
DEdent?" and then saw this:

> +#define RELATIVE_DEDENT_BONUS -63
> +#define RELATIVE_DEDENT_HAS_BLANK_BONUS 50

It turns out that you mean by OUTdent a line that indents further
(if I am reading the code correctly).  Is that obvious to everybody?

> + /* Bonuses based on the location of blank lines: */
> +bonus += TOTAL_BLANK_WEIGHT * total_blanks;
> + bonus += PRE_BLANK_WEIGHT * m->pre_blank;

This and ...

> +} else if (indent > m->pre_indent) {
> + /*
> +  * The line is indented more than its predecessor. Score it 
> based
> +  * on the larger indent:
> +  */
> + score = indent;
> + bonus += RELATIVE_INDENT_BONUS;
> + bonus += RELATIVE_INDENT_HAS_BLANK_BONUS * any_blanks;
> + } else if (indent < m->pre_indent) {

... this seems to be indented correctly even after getting quoted,
which in turn means most of the lines in the added code share
indent-with-non-tab badness.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-04 Thread Junio C Hamano
Stefan Beller  writes:

> I have just reread the scoring function and I think you could pull out the
> `score=indent` assignment (it is always assigned except for indent <0)
>
> if (indent == -1)
>score = 0;
> else
>score = indent;
> ... lots of bonus computation below, which in its current 
> implementation
> have lots of "score = indent;" lines as well.

Yup.  If each part in this if/else if/... cascade independently sets
complete information (i.e. both "bonus" and "score") necessary for
the final result, then I do not mind the same "score = indent" in
many of them (these case happen to get the same score), but that is
not what we have in this code (i.e. "bonus" has a shared component
that is not affected by thie if/else if/ cascade), so setting score
to indent upfront and reset it to 0 only on a blank line would make
sense.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-04 Thread Junio C Hamano
Michael Haggerty  writes:

>>> +   }
>>> +   /*
>>> +* We have reached the end of the line without finding any non-space
>>> +* characters; i.e., the whole line consists of trailing spaces, 
>>> which we
>>> +* are not interested in.
>>> +*/
>>> +   return -1;

Not related to Jacob's review, but "the whole line consists of
trailing spaces" made me read it twice; while it is technically
correct, "the whole line consists of spaces", or even "this is a
blank line", would read a lot more easily, at least for me.

> I was implicitly assuming that such lines would have text somewhere
> after those 200 spaces (or 25 TABs or whatever). But you're right, the
> line could consist only of whitespace. Unfortunately, the only way to
> distinguish these two cases is to read the rest of the line, which is
> exactly what we *don't* want to do.

Hmm, why is it exactly what we don't want to do?  Is it a
performance concern?  In other words, is it because this function is
called many times to measure the same line multiple times?  After
all, somebody in this file is already scanning each and every line
to see where it ends to split the input into records, so perhaps a
"right" (if the "theoretical correctness" of the return value from
this function mattered, which you wave-away below) optimization
could be to precompute it while the lines are broken into records
and store it in the "rec" structure?

> But I think it doesn't matter anyway. Such "text" will likely never be
> read by a human, so it's not a big deal if the slider position is not
> picked perfectly. And remember, this whole saga is just to improve the
> aesthetics of the diff. The diff is *correct* (e.g., in the sense of
> applicable) regardless of where we position the sliders.

A better argument may be "if the user is truly reading a diff output
for such an unusual "text", it is likely that she has a very wide
display and/or running less -S, and treating such an overindented line
as if it were a blank line would give a result that is more consistent
to what appears on her display", perhaps?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-04 Thread Stefan Beller
On Thu, Aug 4, 2016 at 12:56 AM, Jeff King  wrote:
> On Thu, Aug 04, 2016 at 12:00:36AM +0200, Michael Haggerty wrote:
>
>> This table shows the number of diff slider groups that were positioned
>> differently than the human-generated values, for various repositories.
>> "default" is the default "git diff" algorithm. "compaction" is Git 2.9.0
>> with the `--compaction-heuristic` option "indent" is an earlier,
>
> s/option/&./
>
>>  static int diff_detect_rename_default;
>> +static int diff_indent_heuristic; /* experimental */
>>  static int diff_compaction_heuristic; /* experimental */
>
> These two flags are mutually exclusive in the xdiff code, so we should
> probably handle that here.
>
> TBH, I do not care that much what:
>
>   [diff]
>   compactionHeuristic = true
>   indentHeuristic = true
>
> does. But right now:
>
>   git config diff.compactionHeuristic true
>   git show --indent-heuristic
>
> still prefers the compaction heuristic, which I think is objectively
> wrong.
>
> So perhaps we need a single variable:
>
>   enum {
> DIFF_HEURISTIC_COMPACTION,
> DIFF_HEURISTIC_INDENT
>   } diff_heuristic;
>
> and set it in last-one-wins fashion (it would be nice if the config and
> command line options were shaped the same way so it's clear to the user
> that they are exclusive, but we may have to keep --compaction-heuristic
> around for compatibility, as an alias for --diff-heuristic=compaction).
>
>> diff --git a/git-add--interactive.perl b/git-add--interactive.perl
>> index 642cce1..ee3d812 100755
>> --- a/git-add--interactive.perl
>> +++ b/git-add--interactive.perl
>> @@ -45,6 +45,7 @@ my ($diff_new_color) =
>>  my $normal_color = $repo->get_color("", "reset");
>>
>>  my $diff_algorithm = $repo->config('diff.algorithm');
>> +my $diff_indent_heuristic = $repo->config_bool('diff.indentheuristic');
>>  my $diff_compaction_heuristic = 
>> $repo->config_bool('diff.compactionheuristic');
>
> Nice touch.
>
> Unfortunately the mutual-exclusivity handling will probably bleed over
> to here, too.
>
>> +/*
>> + * If a line is indented more than this, get_indent() just returns this 
>> value.
>> + * This avoids having to do absurd amounts of work for data that are not
>> + * human-readable text, and also ensures that the output of get_indent fits 
>> within
>> + * an int.
>> + */
>> +#define MAX_INDENT 200
>
> Speaking of absurd amounts of work, I was curious if there was a
> noticeable performance penalty for using this heuristic (just because
> it's a lot more complicated than the others). I couldn't detect any
> differences running "git log -p --no-merges -3000" on git.git with no
> heuristic, compaction, and indent. There may be other repositories that
> behave more pathologically (it looks like having 20 blank lines at the
> end of each hunk?), but I'd guess in most cases this will always be
> drowned out in the noise of doing the actual diff.
>
>> +#define START_OF_FILE_BONUS 9
>> +#define END_OF_FILE_BONUS 46
>> +#define TOTAL_BLANK_WEIGHT 4
>> +#define PRE_BLANK_WEIGHT 16
>> +#define RELATIVE_INDENT_BONUS -1
>> +#define RELATIVE_INDENT_HAS_BLANK_BONUS 15
>> +#define RELATIVE_OUTDENT_BONUS -19
>> +#define RELATIVE_OUTDENT_HAS_BLANK_BONUS 2
>> +#define RELATIVE_DEDENT_BONUS -63
>> +#define RELATIVE_DEDENT_HAS_BLANK_BONUS 50
>
> I see there is a comment below here mentioning that these are empirical
> voodoo, but it might be worth one at the top (or just moving these below
> the comment) because the comment looks like it's just associated with
> the function (and these are sufficiently bizarre that anybody reading is
> going to double-take on them).
>
>> +return 10 * score - bonus;
>
> I don't mind this not "10" not being a #define constant, but after
> reading the exchange between you and Stefan, I think it would be nice to
> describe what it is in a comment. The rest of the function is commented
> so nicely that this one left me thinking "huh?" upon seeing the "10".

After a night of sleep I agree with Peffs statement here, it's not about the
#define, it's about the comment. (which the #define would have given in a
short cryptic way in angry capital letters).

I have just reread the scoring function and I think you could pull out the
`score=indent` assignment (it is always assigned except for indent <0)

if (indent == -1)
   score = 0;
else
   score = indent;
... lots of bonus computation below, which in its current implementation
have lots of "score = indent;" lines as well.

Thanks,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-04 Thread Jeff King
On Thu, Aug 04, 2016 at 12:00:36AM +0200, Michael Haggerty wrote:

> This table shows the number of diff slider groups that were positioned
> differently than the human-generated values, for various repositories.
> "default" is the default "git diff" algorithm. "compaction" is Git 2.9.0
> with the `--compaction-heuristic` option "indent" is an earlier,

s/option/&./

>  static int diff_detect_rename_default;
> +static int diff_indent_heuristic; /* experimental */
>  static int diff_compaction_heuristic; /* experimental */

These two flags are mutually exclusive in the xdiff code, so we should
probably handle that here.

TBH, I do not care that much what:

  [diff]
  compactionHeuristic = true
  indentHeuristic = true

does. But right now:

  git config diff.compactionHeuristic true
  git show --indent-heuristic

still prefers the compaction heuristic, which I think is objectively
wrong.

So perhaps we need a single variable:

  enum {
DIFF_HEURISTIC_COMPACTION,
DIFF_HEURISTIC_INDENT
  } diff_heuristic;

and set it in last-one-wins fashion (it would be nice if the config and
command line options were shaped the same way so it's clear to the user
that they are exclusive, but we may have to keep --compaction-heuristic
around for compatibility, as an alias for --diff-heuristic=compaction).

> diff --git a/git-add--interactive.perl b/git-add--interactive.perl
> index 642cce1..ee3d812 100755
> --- a/git-add--interactive.perl
> +++ b/git-add--interactive.perl
> @@ -45,6 +45,7 @@ my ($diff_new_color) =
>  my $normal_color = $repo->get_color("", "reset");
>  
>  my $diff_algorithm = $repo->config('diff.algorithm');
> +my $diff_indent_heuristic = $repo->config_bool('diff.indentheuristic');
>  my $diff_compaction_heuristic = 
> $repo->config_bool('diff.compactionheuristic');

Nice touch.

Unfortunately the mutual-exclusivity handling will probably bleed over
to here, too.

> +/*
> + * If a line is indented more than this, get_indent() just returns this 
> value.
> + * This avoids having to do absurd amounts of work for data that are not
> + * human-readable text, and also ensures that the output of get_indent fits 
> within
> + * an int.
> + */
> +#define MAX_INDENT 200

Speaking of absurd amounts of work, I was curious if there was a
noticeable performance penalty for using this heuristic (just because
it's a lot more complicated than the others). I couldn't detect any
differences running "git log -p --no-merges -3000" on git.git with no
heuristic, compaction, and indent. There may be other repositories that
behave more pathologically (it looks like having 20 blank lines at the
end of each hunk?), but I'd guess in most cases this will always be
drowned out in the noise of doing the actual diff.

> +#define START_OF_FILE_BONUS 9
> +#define END_OF_FILE_BONUS 46
> +#define TOTAL_BLANK_WEIGHT 4
> +#define PRE_BLANK_WEIGHT 16
> +#define RELATIVE_INDENT_BONUS -1
> +#define RELATIVE_INDENT_HAS_BLANK_BONUS 15
> +#define RELATIVE_OUTDENT_BONUS -19
> +#define RELATIVE_OUTDENT_HAS_BLANK_BONUS 2
> +#define RELATIVE_DEDENT_BONUS -63
> +#define RELATIVE_DEDENT_HAS_BLANK_BONUS 50

I see there is a comment below here mentioning that these are empirical
voodoo, but it might be worth one at the top (or just moving these below
the comment) because the comment looks like it's just associated with
the function (and these are sufficiently bizarre that anybody reading is
going to double-take on them).

> +return 10 * score - bonus;

I don't mind this not "10" not being a #define constant, but after
reading the exchange between you and Stefan, I think it would be nice to
describe what it is in a comment. The rest of the function is commented
so nicely that this one left me thinking "huh?" upon seeing the "10".

The rest looks sane to me, though I am not sure I have absorbed all the
implications. IMHO the most interesting thing is the actual results,
though.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-03 Thread Jacob Keller
On Wed, Aug 3, 2016 at 3:36 PM, Michael Haggerty  wrote:
> On 08/04/2016 12:29 AM, Jacob Keller wrote:
>> On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty  
>> wrote:
>> It seems odd to be that a line with "199" spaces and nothing else will
>> return "-1" but a line with 200 spaces and nothing else will return
>> 200..? Would it be safe to just return -1 in both cases (if a line is
>> all spaces or starts with more than 200 spaces just return -1)?
>>
>>> +}
>>> +
>
> Thanks for your feedback.
>
> I was implicitly assuming that such lines would have text somewhere
> after those 200 spaces (or 25 TABs or whatever). But you're right, the
> line could consist only of whitespace. Unfortunately, the only way to
> distinguish these two cases is to read the rest of the line, which is
> exactly what we *don't* want to do.
>
> But I think it doesn't matter anyway. Such "text" will likely never be
> read by a human, so it's not a big deal if the slider position is not
> picked perfectly. And remember, this whole saga is just to improve the
> aesthetics of the diff. The diff is *correct* (e.g., in the sense of
> applicable) regardless of where we position the sliders.
>
> Michael
>

I think in this case treating it as "all whitespace" is more natural
than treating it as "200 characters with something following it"
because the only thing we've found so far is all white space.

Either way it's not really a big deal here.

Thanks,
Jake
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-03 Thread Stefan Beller
On Wed, Aug 3, 2016 at 4:30 PM, Michael Haggerty  wrote:

> This is an important point for the optimization, but less so for the
> implementation of the heuristic here. I was dynamically optimizing the
> ten other variables, and everything that goes into the bonus includes
> one of those factors. If I had also let this factor of 10 vary, then the
> net behavior of the algorithm would be completely unchanged if I would,
> say, double all eleven parameters. This is bad for optimization, because
> (1) it increases the search space unnecessarily, and (2) it means that
> whole lines in the parameter space give identical behavior, making the
> algorithm waste time searching along those lines for a minimum.
>
>> So another way to write it
>> would be
>>
>> score - bonus/10;
>>
>> assuming the values of score and bonus are large enough.
>
> Score is the number of columns that some line is indented, so it can be
> 0 or 1 or any other positive integer. It is not "large enough", which is
> why the "10" can't be "1".

Right, I should have made it more clear that it was a hypothetical rewrite,
just to point out we are looking at only one of score or bonus.

>> but if I look at the boni definitions ranging from -63..50 they are scaled up
>> so much that the bonus may become larger than '1' unit of 'score', i.e.
>> it is not an epsilon any more. Or to put it another way:
>> If we were to s/10/100/ the results would be worse.
>
> If you would change s/10/100/ and simultaneously multiply the other
> constants by 10, the end results would be unchanged.

Right, so maybe a good name would be CONSTANT_SCALE_OF_ONE_INDENT
as it has the meaning that a bonus of 10 is equivalent of "one indent"
in the weighting.

Speaking of which, do we want to "over-optimize" to make that constant a
power of 2 as that is a supposedly faster multiplication?
(Just asking, feel free to reject; as I can imagine the numbers itself are
already magic, so why scale them with 42?^H^H^H 16?)

>
>> Rather the 10 describes the ratio of "advanced magic" to pure indentation
>> based scoring in my understanding.
>
> No, it's basically just a number against which the other constants are
> compared. E.g., if another bonus wants to balance out against exactly
> one space of indentation, its constant needs to be 10. If it wants to
> balance out against exactly 5 spaces, its constant needs to be 50. Etc.

So another interpretation is that the 10 gives the resolution for all other
constants, i.e. if we keep 10, then we can only give weights in 1/10 of
"one indent". But the "ideal" weight may not be a multiple of 1/10,
so we approximate them to the nearest multiple of 1/10.

If we were to use 1000 here, we could have a higher accuracy of the
other constants, but probably we do not care about the 3rd decimal place
for these because they are created heuristically from a corpus that may
not warrant a precision of constants with a 3rd decimal place.


Stefan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-03 Thread Michael Haggerty
On 08/04/2016 12:51 AM, Stefan Beller wrote:
> On Wed, Aug 3, 2016 at 3:41 PM, Michael Haggerty  wrote:
>> On 08/04/2016 12:30 AM, Stefan Beller wrote:
>>> On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty  
>>> wrote:
>>>
 +return 10 * score - bonus;
>>>
>>> Would it make sense to define-ify the 10 as well
>>> as this is the only hardcoded number in the
>>> scoring function?
>>
>> I started answering this question by explaining why it is not important
>> to *optimize* the number 10 (namely because scores are only ever
>> compared against other scores, so an overall scaling factor makes no
>> difference). The factor 10 only has to be large enough to provide enough
>> dynamic range for the other (adjustable) parameters.
> 
> But it only scales the score, not the bonus.

Yes, that's how it provides the overall scale of the score. If it
multiplied both values, then it would be completely pointless.

This is an important point for the optimization, but less so for the
implementation of the heuristic here. I was dynamically optimizing the
ten other variables, and everything that goes into the bonus includes
one of those factors. If I had also let this factor of 10 vary, then the
net behavior of the algorithm would be completely unchanged if I would,
say, double all eleven parameters. This is bad for optimization, because
(1) it increases the search space unnecessarily, and (2) it means that
whole lines in the parameter space give identical behavior, making the
algorithm waste time searching along those lines for a minimum.

> So another way to write it
> would be
> 
> score - bonus/10;
> 
> assuming the values of score and bonus are large enough.

Score is the number of columns that some line is indented, so it can be
0 or 1 or any other positive integer. It is not "large enough", which is
why the "10" can't be "1".

If the calculations were done in floating point, then the factor could
be "1", because then the other factors could be made less than one.

> In some prior conversation you said you take the indent and add an epsilon
> for some special conditions to make one indent better than the other.
> 
> Epsilons are usually very small compared to the rest of the equation,

I should have mentioned that this heuristic is quite a bit more advanced
than my original proposal to use "indent" plus some "epsilon" factors.
The old discussion about epsilons is not relevant here except maybe as
an inspiration and starting point for this version.

> but if I look at the boni definitions ranging from -63..50 they are scaled up
> so much that the bonus may become larger than '1' unit of 'score', i.e.
> it is not an epsilon any more. Or to put it another way:
> If we were to s/10/100/ the results would be worse.

If you would change s/10/100/ and simultaneously multiply the other
constants by 10, the end results would be unchanged.

> Rather the 10 describes the ratio of "advanced magic" to pure indentation
> based scoring in my understanding.

No, it's basically just a number against which the other constants are
compared. E.g., if another bonus wants to balance out against exactly
one space of indentation, its constant needs to be 10. If it wants to
balance out against exactly 5 spaces, its constant needs to be 50. Etc.

Michael

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-03 Thread Stefan Beller
On Wed, Aug 3, 2016 at 3:41 PM, Michael Haggerty  wrote:
> On 08/04/2016 12:30 AM, Stefan Beller wrote:
>> On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty  
>> wrote:
>>
>>> +return 10 * score - bonus;
>>
>> Would it make sense to define-ify the 10 as well
>> as this is the only hardcoded number in the
>> scoring function?
>
> I started answering this question by explaining why it is not important
> to *optimize* the number 10 (namely because scores are only ever
> compared against other scores, so an overall scaling factor makes no
> difference). The factor 10 only has to be large enough to provide enough
> dynamic range for the other (adjustable) parameters.

But it only scales the score, not the bonus. So another way to write it
would be

score - bonus/10;

assuming the values of score and bonus are large enough.

In some prior conversation you said you take the indent and add an epsilon
for some special conditions to make one indent better than the other.

Epsilons are usually very small compared to the rest of the equation,
but if I look at the boni definitions ranging from -63..50 they are scaled up
so much that the bonus may become larger than '1' unit of 'score', i.e.
it is not an epsilon any more. Or to put it another way:
If we were to s/10/100/ the results would be worse.

Rather the 10 describes the ratio of "advanced magic" to pure indentation
based scoring in my understanding.

>
> But I think you are asking a simpler question: should we give this
> constant a name rather than hardcoding it? I don't see a strong reason
> for or against, so I'll give it a name in the next version, as you suggest.

Thanks,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-03 Thread Michael Haggerty
On 08/04/2016 12:30 AM, Stefan Beller wrote:
> On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty  wrote:
> 
>> +return 10 * score - bonus;
> 
> Would it make sense to define-ify the 10 as well
> as this is the only hardcoded number in the
> scoring function?

I started answering this question by explaining why it is not important
to *optimize* the number 10 (namely because scores are only ever
compared against other scores, so an overall scaling factor makes no
difference). The factor 10 only has to be large enough to provide enough
dynamic range for the other (adjustable) parameters.

But I think you are asking a simpler question: should we give this
constant a name rather than hardcoding it? I don't see a strong reason
for or against, so I'll give it a name in the next version, as you suggest.

Michael

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-03 Thread Michael Haggerty
On 08/04/2016 12:29 AM, Jacob Keller wrote:
> On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty  wrote:
>> +/*
>> + * If a line is indented more than this, get_indent() just returns this 
>> value.
>> + * This avoids having to do absurd amounts of work for data that are not
>> + * human-readable text, and also ensures that the output of get_indent fits 
>> within
>> + * an int.
>> + */
>> +#define MAX_INDENT 200
>> +
>> +/*
>> + * Return the amount of indentation of the specified line, treating TAB as 8
>> + * columns. Return -1 if line is empty or contains only whitespace. Clamp 
>> the
>> + * output value at MAX_INDENT.
>> + */
>> +static int get_indent(xrecord_t *rec)
>> +{
>> +   long i;
>> +   int ret = 0;
>> +
>> +   for (i = 0; i < rec->size; i++) {
>> +   char c = rec->ptr[i];
>> +
>> +   if (!XDL_ISSPACE(c))
>> +   return ret;
>> +   else if (c == ' ')
>> +   ret += 1;
>> +   else if (c == '\t')
>> +   ret += 8 - ret % 8;
>> +   /* ignore other whitespace characters */
>> +
>> +   if (ret >= MAX_INDENT)
>> +   return MAX_INDENT;
> 
> Should we return -1 here?
> 
>> +   }
>> +   /*
>> +* We have reached the end of the line without finding any non-space
>> +* characters; i.e., the whole line consists of trailing spaces, 
>> which we
>> +* are not interested in.
>> +*/
>> +   return -1;
> 
> It seems odd to be that a line with "199" spaces and nothing else will
> return "-1" but a line with 200 spaces and nothing else will return
> 200..? Would it be safe to just return -1 in both cases (if a line is
> all spaces or starts with more than 200 spaces just return -1)?
> 
>> +}
>> +

Thanks for your feedback.

I was implicitly assuming that such lines would have text somewhere
after those 200 spaces (or 25 TABs or whatever). But you're right, the
line could consist only of whitespace. Unfortunately, the only way to
distinguish these two cases is to read the rest of the line, which is
exactly what we *don't* want to do.

But I think it doesn't matter anyway. Such "text" will likely never be
read by a human, so it's not a big deal if the slider position is not
picked perfectly. And remember, this whole saga is just to improve the
aesthetics of the diff. The diff is *correct* (e.g., in the sense of
applicable) regardless of where we position the sliders.

Michael

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-03 Thread Stefan Beller
On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty  wrote:

> +return 10 * score - bonus;

Would it make sense to define-ify the 10 as well
as this is the only hardcoded number in the
scoring function?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-03 Thread Jacob Keller
On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty  wrote:
> +/*
> + * If a line is indented more than this, get_indent() just returns this 
> value.
> + * This avoids having to do absurd amounts of work for data that are not
> + * human-readable text, and also ensures that the output of get_indent fits 
> within
> + * an int.
> + */
> +#define MAX_INDENT 200
> +
> +/*
> + * Return the amount of indentation of the specified line, treating TAB as 8
> + * columns. Return -1 if line is empty or contains only whitespace. Clamp the
> + * output value at MAX_INDENT.
> + */
> +static int get_indent(xrecord_t *rec)
> +{
> +   long i;
> +   int ret = 0;
> +
> +   for (i = 0; i < rec->size; i++) {
> +   char c = rec->ptr[i];
> +
> +   if (!XDL_ISSPACE(c))
> +   return ret;
> +   else if (c == ' ')
> +   ret += 1;
> +   else if (c == '\t')
> +   ret += 8 - ret % 8;
> +   /* ignore other whitespace characters */
> +
> +   if (ret >= MAX_INDENT)
> +   return MAX_INDENT;

Should we return -1 here?

> +   }
> +   /*
> +* We have reached the end of the line without finding any non-space
> +* characters; i.e., the whole line consists of trailing spaces, 
> which we
> +* are not interested in.
> +*/
> +   return -1;

It seems odd to be that a line with "199" spaces and nothing else will
return "-1" but a line with 200 spaces and nothing else will return
200..? Would it be safe to just return -1 in both cases (if a line is
all spaces or starts with more than 200 spaces just return -1)?

> +}
> +
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/8] diff: improve positioning of add/delete blocks in diffs

2016-08-03 Thread Michael Haggerty
Some groups of added/deleted lines in diffs can be slid up or down,
because lines at the edges of the group are not unique. Picking good
shifts for such groups is not a matter of correctness but definitely has
a big effect on aesthetics. For example, consider the following two
diffs. The first is what standard Git emits:

--- a/9c572b21dd090a1e5c5bb397053bf8043ffe7fb4:git-send-email.perl
+++ b/6dcfa306f2b67b733a7eb2d7ded1bc9987809edb:git-send-email.perl
@@ -231,6 +231,9 @@ if (!defined $initial_reply_to && $prompting) {
 }

 if (!$smtp_server) {
+   $smtp_server = $repo->config('sendemail.smtpserver');
+}
+if (!$smtp_server) {
foreach (qw( /usr/sbin/sendmail /usr/lib/sendmail )) {
if (-x $_) {
$smtp_server = $_;

The following diff is equivalent, but is obviously preferable from an
aesthetic point of view:

--- a/9c572b21dd090a1e5c5bb397053bf8043ffe7fb4:git-send-email.perl
+++ b/6dcfa306f2b67b733a7eb2d7ded1bc9987809edb:git-send-email.perl
@@ -230,6 +230,9 @@ if (!defined $initial_reply_to && $prompting) {
$initial_reply_to =~ s/(^\s+|\s+$)//g;
 }

+if (!$smtp_server) {
+   $smtp_server = $repo->config('sendemail.smtpserver');
+}
 if (!$smtp_server) {
foreach (qw( /usr/sbin/sendmail /usr/lib/sendmail )) {
if (-x $_) {

This patch teaches Git to pick better positions for such "diff sliders".

The original Git code basically always shifted such "sliders" as far
down in the file as possible. The only exception is when the slider can
be aligned with a group of the lines in the other file, in which case
Git favors one add+delete block over one add and a slightly offset
delete block. This naive algorithm often yields ugly diffs.

Commit d634d61ed6 improved the situation somewhat by preferring to
position add/delete groups to make their last line a blank line, when
that is possible. This heuristic does more good than harm, but can only
help if there are blank lines in the right places. It still leaves a lot
of ugly diffs.

This commit implements a new and much better heuristic for picking
optimal "slider" positions using the following approach: First observe
that each hypothetical positioning of a diff slider introduces two
splits: one between the context lines preceding the group and the first
added/deleted line, and the other between the last added/deleted line
and the first line of context following it. It tries to find the
positioning that creates the least bad splits.

Splits are evaluated based only on the presence and locations of nearby
blank lines, and the indentation of lines near the split. Basically, it
prefers to introduce splits between lines that are indented less and
adjacent to blank lines. In more detail:

1. It measures the following characteristics of a proposed splitting
   position:

   * the number of blank lines above the proposed split
   * whether the line directly after the split is blank
   * the number of blank lines following that line
   * the indentation of the nearest non-blank line above the split
   * the indentation of the line directly below the split
   * the indentation of the nearest non-blank line after that line

2. It combines these attributes using a bunch of empirically-optimized
   weighting factors to estimate a score of the "badness" of splitting
   the text at that position.

3. It defines the score for a positioning of a diff slider to be the sum
   of the scores for the splits at the top and bottom of the slider.

4. It computes scores like this for all possible positions of the diff
   slider, and selects the position with the smallest "badness" score.

The weighting factors were found by collecting a corpus of code samples
in various programming languages, deciding "by eye" the best positioning
for about 2700 diff sliders, then optimizing the weights against this
corpus to get the best agreement with the manually-determined values.
(One caveat is that the same corpus was used for both optimization and
testing.)

The resulting numbers of non-optimal diff groups were as follows:

| repository   | default | compaction | indent | optimized |
|  | --- | -- | -- | - |
| ant  | 225 |102 |  7 | 5 |
| bugzilla | 208 | 81 | 14 | 8 |
| couchdb  |  44 | 24 | 13 | 4 |
| docker   | 180 |160 | 29 | 7 |
| git  | 446 | 59 | 27 | 4 |
| ipython  | 358 |163 | 61 |11 |
| junit| 146 | 67 |  5 | 1 |
| nodejs   | 489 | 78 | 12 | 2 |
| phpmyadmin   | 330 | 49 |  1 | 0 |
| test-more|  15 |  2 |  2 | 0