Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
Stefan Bellerwrites: > Maybe we can enable Michaels heuristic with the same > config/command line flag, i.e. "the flag changes its algorithm"? I think that is a very sensible proposal. After all, the name diff.compactionHeuristic only tells us what part of the diff process the heuristic is used, and does not say anything about what the heuristics does. It is neutral between "take a blank as a hint" vs "take indentation leveles as a hint". -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
> Is there a case where the compaction heuristic produces a better result > than this indent heuristic? AFAICT, you have not found one, and I'd be > surprised if there is one, because this _seems_ like a superset > generally. I suppose there is always the possibility that the empirical > knobs behave badly in some particular case that the compaction heuristic > just happens to get right, but it should be quite rare. This is how I understand it as well. I would not mind to remove the blank-line-suggested-split heuristic. Maybe we can enable Michaels heuristic with the same config/command line flag, i.e. "the flag changes its algorithm"? Then people, who read the prior announcement don't have to do anything, while we keep it as an experimental feature for the next release and auto-on it after that? We could also be a bit more aggressive and auto-on the new heuristic with the old heuristic removed and we only have an (undocumented) emergency-off knob for one or two releases? Thanks, Stefan -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On Sat, Aug 13, 2016 at 8:59 AM, Junio C Hamanowrote: > Jeff King writes: > >> So assuming everything I just said isn't complete bollocks, I think we >> can move to a future where nobody uses the compaction heuristic. And >> there are three ways to deal with that: >> >> 1. The knob and feature stay. It might be useful for somebody who >> wants to experiment in the future. >> >> 2. The knob and feature go away completely. It was an experiment, but >> now we have something more useful. >> >> 3. The feature goes away, but the knob stays as noop, or maybe as an >> alias for the indent heuristic, just because we did ship a version >> that accepts "--compaction-heuristic", and maybe somebody somewhere >> put it in a script? >> >> I think I'd be in favor of (2). > > I am all for (2) [*1*] > I also am in favor of (2). I understand the reasoning for maintaining compatibility, but this was a known experimental feature that was unlikely used by many people. Even if it was, these are the very sorts of people who should be aware that the experimental feature is going away. It reduces code complexity if it just goes away, and I believe the new heuristic is much better (Thank you Michael!) As for a knob on the new feature, I think it can become the default with a way to disable the feature via command line. I'm not really sure it needs a config option at all. > This and the previous "take a blank line as a hint" are both > heuristics. As long as the resulting code does not tax runtime > performance visibly and improves the resulting output 99% of the > time, there is no reason to leave end-users a knob. "Among 9 hunks > in this patch that touch hello.c, 7 are made much more readable but > 2 are worse" cannot even be helped with a command line option. > Yea I agree. It might be worth having it disabled via the stable patch IDs (? I don't know if we guarantee this?) but otherwise I don't see it being important either way. I would vote for a way to disable it via command line just because we *are* changing behavior here. But I don't think it needs to be a config option at all. > > [Footnote] > > *1* I am also strongly against (3), if only to teach people a > lesson ;-). -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
Jeff Kingwrites: > So assuming everything I just said isn't complete bollocks, I think we > can move to a future where nobody uses the compaction heuristic. And > there are three ways to deal with that: > > 1. The knob and feature stay. It might be useful for somebody who > wants to experiment in the future. > > 2. The knob and feature go away completely. It was an experiment, but > now we have something more useful. > > 3. The feature goes away, but the knob stays as noop, or maybe as an > alias for the indent heuristic, just because we did ship a version > that accepts "--compaction-heuristic", and maybe somebody somewhere > put it in a script? > > I think I'd be in favor of (2). I am all for (2) [*1*] This and the previous "take a blank line as a hint" are both heuristics. As long as the resulting code does not tax runtime performance visibly and improves the resulting output 99% of the time, there is no reason to leave end-users a knob. "Among 9 hunks in this patch that touch hello.c, 7 are made much more readable but 2 are worse" cannot even be helped with a command line option. [Footnote] *1* I am also strongly against (3), if only to teach people a lesson ;-). -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On Sat, Aug 13, 2016 at 01:25:05AM +0200, Michael Haggerty wrote: > > These two flags are mutually exclusive in the xdiff code, so we should > > probably handle that here. > > > > TBH, I do not care that much what: > > > > [diff] > > compactionHeuristic = true > > indentHeuristic = true > > > > does. But right now: > > > > git config diff.compactionHeuristic true > > git show --indent-heuristic > > > > still prefers the compaction heuristic, which I think is objectively > > wrong. > > I wasn't worrying about that yet, given that these two features are both > still experimental. I also have a strong inkling that at most one of > them needs to be made permanent. I propose that I repair the semantics > in the simplest way possible for now while we decide on the long-term > plan, which might conceivably be: > > * keep both options permanently > * keep only one option permanently > * choose one heuristic and use it always (i.e., make it part of the new > standard one-and-only diff algorithm) > * discard both heuristics (I hope not!) > > After we've decided on that, *then* let's decide on a suitable UI and > implement it before we declare either feature non-experimental. Is there a case where the compaction heuristic produces a better result than this indent heuristic? AFAICT, you have not found one, and I'd be surprised if there is one, because this _seems_ like a superset generally. I suppose there is always the possibility that the empirical knobs behave badly in some particular case that the compaction heuristic just happens to get right, but it should be quite rare. So assuming everything I just said isn't complete bollocks, I think we can move to a future where nobody uses the compaction heuristic. And there are three ways to deal with that: 1. The knob and feature stay. It might be useful for somebody who wants to experiment in the future. 2. The knob and feature go away completely. It was an experiment, but now we have something more useful. 3. The feature goes away, but the knob stays as noop, or maybe as an alias for the indent heuristic, just because we did ship a version that accepts "--compaction-heuristic", and maybe somebody somewhere put it in a script? I think I'd be in favor of (2). It doesn't seem likely enough for people to experiment with to merit a run-time knob; they can always patch and build if they want to do so. And (3) just seems like a pain for something that was only shipped in one version and was kind of experimental, and was unlikely to end up in scripts (much more likely is that people set the config, but that's easier to ignore). But it does violate our usual backwards-compatibility rules. So if we assume that indent is useful and compaction goes away, the only questions are "does indent it become the default" and "if so, does it still get a knob". I'd say "yes" to both. Making the new behavior the default was what we planned to do with compaction until we saw that it regressed some cases. But as a new feature, it's nice for users to be able to easily disable it to see if it's causing a problem (or to see what a big improvement it is!). I think we could get by with just a command-line option for that purpose, rather than a config option; that saves a lot of effort in having porcelains manually propagate the config option when they call a plumbing diff-tree. I guess the only users that leaves out are ones who really want stable backwards-compatible diff. I guess "patch --stable" is one such user (but that one we could handle internally). But let's say you had a code review system that attached comments to lines of a diff. You might want to disable the feature entirely to avoid invalidating old comments. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On 08/04/2016 06:55 PM, Stefan Beller wrote: > [...] > I have just reread the scoring function and I think you could pull out the > `score=indent` assignment (it is always assigned except for indent <0) > > if (indent == -1) >score = 0; > else >score = indent; > ... lots of bonus computation below, which in its current > implementation > have lots of "score = indent;" lines as well. Yes. An earlier version of the heuristic used different indent values in different situations, but that's gone away so the code can be made simpler now. I'll make the change. Thanks, Michael -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On 08/04/2016 09:52 PM, Junio C Hamano wrote: > Michael Haggertywrites: >> +#define START_OF_FILE_BONUS 9 >> +#define END_OF_FILE_BONUS 46 >> +#define TOTAL_BLANK_WEIGHT 4 >> +#define PRE_BLANK_WEIGHT 16 >> +#define RELATIVE_INDENT_BONUS -1 >> +#define RELATIVE_INDENT_HAS_BLANK_BONUS 15 >> +#define RELATIVE_OUTDENT_BONUS -19 >> +#define RELATIVE_OUTDENT_HAS_BLANK_BONUS 2 > > When I read up to here, I thought "Heh, isn't the opposite of INdent > DEdent?" and then saw this: > >> +#define RELATIVE_DEDENT_BONUS -63 >> +#define RELATIVE_DEDENT_HAS_BLANK_BONUS 50 > > It turns out that you mean by OUTdent a line that indents further > (if I am reading the code correctly). Is that obvious to everybody? I'll comment it better. >> +/* Bonuses based on the location of blank lines: */ >> +bonus += TOTAL_BLANK_WEIGHT * total_blanks; >> +bonus += PRE_BLANK_WEIGHT * m->pre_blank; > > This and ... > >> +} else if (indent > m->pre_indent) { >> +/* >> + * The line is indented more than its predecessor. Score it >> based >> + * on the larger indent: >> + */ >> +score = indent; >> +bonus += RELATIVE_INDENT_BONUS; >> +bonus += RELATIVE_INDENT_HAS_BLANK_BONUS * any_blanks; >> +} else if (indent < m->pre_indent) { > > ... this seems to be indented correctly even after getting quoted, > which in turn means most of the lines in the added code share > indent-with-non-tab badness. The code was copy-pasted from a Python prototype then converted to C :-) I'll fix the whitespace. Thanks, Michael -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On 08/04/2016 09:56 AM, Jeff King wrote: > On Thu, Aug 04, 2016 at 12:00:36AM +0200, Michael Haggerty wrote: > >> This table shows the number of diff slider groups that were positioned >> differently than the human-generated values, for various repositories. >> "default" is the default "git diff" algorithm. "compaction" is Git 2.9.0 >> with the `--compaction-heuristic` option "indent" is an earlier, > > s/option/&./ Thanks, will change. >> static int diff_detect_rename_default; >> +static int diff_indent_heuristic; /* experimental */ >> static int diff_compaction_heuristic; /* experimental */ > > These two flags are mutually exclusive in the xdiff code, so we should > probably handle that here. > > TBH, I do not care that much what: > > [diff] > compactionHeuristic = true > indentHeuristic = true > > does. But right now: > > git config diff.compactionHeuristic true > git show --indent-heuristic > > still prefers the compaction heuristic, which I think is objectively > wrong. I wasn't worrying about that yet, given that these two features are both still experimental. I also have a strong inkling that at most one of them needs to be made permanent. I propose that I repair the semantics in the simplest way possible for now while we decide on the long-term plan, which might conceivably be: * keep both options permanently * keep only one option permanently * choose one heuristic and use it always (i.e., make it part of the new standard one-and-only diff algorithm) * discard both heuristics (I hope not!) After we've decided on that, *then* let's decide on a suitable UI and implement it before we declare either feature non-experimental. > [...] > Speaking of absurd amounts of work, I was curious if there was a > noticeable performance penalty for using this heuristic [...] I included some performance numbers in my response to Junio [1]. >> +#define START_OF_FILE_BONUS 9 >> +#define END_OF_FILE_BONUS 46 >> +#define TOTAL_BLANK_WEIGHT 4 >> +#define PRE_BLANK_WEIGHT 16 >> +#define RELATIVE_INDENT_BONUS -1 >> +#define RELATIVE_INDENT_HAS_BLANK_BONUS 15 >> +#define RELATIVE_OUTDENT_BONUS -19 >> +#define RELATIVE_OUTDENT_HAS_BLANK_BONUS 2 >> +#define RELATIVE_DEDENT_BONUS -63 >> +#define RELATIVE_DEDENT_HAS_BLANK_BONUS 50 > > I see there is a comment below here mentioning that these are empirical > voodoo, but it might be worth one at the top (or just moving these below > the comment) because the comment looks like it's just associated with > the function (and these are sufficiently bizarre that anybody reading is > going to double-take on them). Good idea. >> +return 10 * score - bonus; > > I don't mind this not "10" not being a #define constant, but after > reading the exchange between you and Stefan, I think it would be nice to > describe what it is in a comment. The rest of the function is commented > so nicely that this one left me thinking "huh?" upon seeing the "10". Done. Thanks for your review. Michael [1] http://public-inbox.org/git/5fe0edbc-3659-058f-3328-639d1343f...@alum.mit.edu/ -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
Michael Haggertywrites: >> After >> all, somebody in this file is already scanning each and every line >> to see where it ends to split the input into records, so perhaps a >> "right" (if the "theoretical correctness" of the return value from >> this function mattered, which you wave-away below) optimization >> could be to precompute it while the lines are broken into records >> and store it in the "rec" structure? > > That would certainly be possible, and would help in cases where there > are a lot of lines with lots of leading whitespace. You could get nearly > the same benefit by recording a single bit in struct rec, indicating > whether the line is blank or not. > > But it wouldn't help the worst case described above, where each call to > `git_indent()` is already very cheap. And I didn't think it was worth > allocating the extra memory to optimize this heuristic True. > I don't know. It seems like a pretty contrived justification for what is > basically, "your input is too weird for us. We're not going to break our > necks trying to give you the best possible slider positioning." True again. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On 08/04/2016 02:04 AM, Stefan Beller wrote: > On Wed, Aug 3, 2016 at 4:30 PM, Michael Haggertywrote: >> Stefan Beller wrote: >>> [...] >>> Rather the 10 describes the ratio of "advanced magic" to pure indentation >>> based scoring in my understanding. >> >> No, it's basically just a number against which the other constants are >> compared. E.g., if another bonus wants to balance out against exactly >> one space of indentation, its constant needs to be 10. If it wants to >> balance out against exactly 5 spaces, its constant needs to be 50. Etc. > > So another interpretation is that the 10 gives the resolution for all other > constants, i.e. if we keep 10, then we can only give weights in 1/10 of > "one indent". But the "ideal" weight may not be a multiple of 1/10, > so we approximate them to the nearest multiple of 1/10. > > If we were to use 1000 here, we could have a higher accuracy of the > other constants, but probably we do not care about the 3rd decimal place > for these because they are created heuristically from a corpus that may > not warrant a precision of constants with a 3rd decimal place. Not only that. Since all of the inputs to the heuristic are integers, the outputs are discontinuous and can take only certain discrete values. And the outputs are only ever compared against one another. So a small adjustment to the output will only make a difference if it causes the value to become larger/smaller than another of the possible output values. So too high a resolution makes no sense at all. That being said, I didn't actually check in any systematic way whether the resolution of 10:1 is high enough in practice. But I can say that the overall performance of the heuristic (in terms of number of errors counted) remained constant over a relatively large parameter range, so I think the resolution is sufficient. Feel free to play with the parameters and/or other heuristics. The tools and raw data are all published in [1]. Let me know if you need help getting started. Michael [1] https://github.com/mhagger/diff-slider-tools -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On 08/04/2016 09:39 PM, Junio C Hamano wrote: > Michael Haggertywrites: > + } + /* +* We have reached the end of the line without finding any non-space +* characters; i.e., the whole line consists of trailing spaces, which we +* are not interested in. +*/ + return -1; > > Not related to Jacob's review, but "the whole line consists of > trailing spaces" made me read it twice; while it is technically > correct, "the whole line consists of spaces", or even "this is a > blank line", would read a lot more easily, at least for me. Hehe, yes, ETOOMANYWORDS. >> I was implicitly assuming that such lines would have text somewhere >> after those 200 spaces (or 25 TABs or whatever). But you're right, the >> line could consist only of whitespace. Unfortunately, the only way to >> distinguish these two cases is to read the rest of the line, which is >> exactly what we *don't* want to do. > > Hmm, why is it exactly what we don't want to do? Is it a > performance concern? In other words, is it because this function is > called many times to measure the same line multiple times? Yes, performance is the reason, and especially the desire to avoid unreasonable runtimes for pathological cases. Thanks for asking, though, because it's worthwhile to look into this more rigorously. Suppose there is a slider that can be shifted to any of `numshift` positions. Then we have to call `measure_split()` `2*numshift` times (once for the beginning and once for the end of each candidate slider position). Suppose there are `numblanks` blank lines in the neighborhood of the slider. Each time we call `measure_split()`, we count the number of blank lines before and after the proposed split position. So we end up calling `get_indent()` `2*numshift*numblanks` times. Finally, suppose that the blank lines each contain `numws` whitespace characters. Then each call to `get_indent()` has to do `O(numws)` work. So altogether, if there were no limits, then the amount of work to position a slider would scale like O(numshift * numblanks * numws) However, the total number of characters in the file might only be O(numblanks * numws) So without limits, the amount of work to position sliders could scale by numshift times the size of the file. The worst case is pretty easy to achieve. Just create a file consisting of a million or so LF characters, then add another LF to it. The diff would be a slider with numshift = 100 numblanks = 100 numws = 1 so the heuristic would take O(N^2) in the size of the file. Effectively the limits cap the effective `numblanks` at `2*MAX_BLANKS` (which is 2*20) and the effective `numws` at `MAX_INDENT` (which is 200), meaning that the maximum effort scales like numshift * 2*20 * 200 which is still a big number but not absurd. Empirically, for the case described above, `git diff` takes 104 ms and `git diff --indent-heuristic` takes 490 ms. I think that's not prohibitive for a pathological case. Meanwhile, Myers's algorithm scales like O(ND), where N is the number of lines and D is the edit distance, so I suppose that it is already possible to find diffs that are intractable to compute. > After > all, somebody in this file is already scanning each and every line > to see where it ends to split the input into records, so perhaps a > "right" (if the "theoretical correctness" of the return value from > this function mattered, which you wave-away below) optimization > could be to precompute it while the lines are broken into records > and store it in the "rec" structure? That would certainly be possible, and would help in cases where there are a lot of lines with lots of leading whitespace. You could get nearly the same benefit by recording a single bit in struct rec, indicating whether the line is blank or not. But it wouldn't help the worst case described above, where each call to `git_indent()` is already very cheap. And I didn't think it was worth allocating the extra memory to optimize this heuristic * which isn't used all that often in the first place, * which (for normal inputs) doesn't take a significant amount of time, and * when the optimization doesn't improve the worst-case scenario (and thus any DoS potential) I think the only way to ensure O(size_of_file) runtime in the above case would be to record, along with each line, how many blank lines immediately precede and succeed it. You could achieve something like O(size_of_file lg(size_of_file)) by storing, e.g., the total number of nonblank lines that precede each line and doing a binary search to find the nearest non-blank line. >> But I think it doesn't matter anyway. Such "text" will likely never be >> read by a human, so it's not a big deal if the slider position is not >> picked perfectly. And remember, this whole saga is just to improve the >> aesthetics of the diff. The diff is *correct*
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
Michael Haggertywrites: I agree with Peff about "comment on the voodoo upfront". > +#define START_OF_FILE_BONUS 9 > +#define END_OF_FILE_BONUS 46 > +#define TOTAL_BLANK_WEIGHT 4 > +#define PRE_BLANK_WEIGHT 16 > +#define RELATIVE_INDENT_BONUS -1 > +#define RELATIVE_INDENT_HAS_BLANK_BONUS 15 > +#define RELATIVE_OUTDENT_BONUS -19 > +#define RELATIVE_OUTDENT_HAS_BLANK_BONUS 2 When I read up to here, I thought "Heh, isn't the opposite of INdent DEdent?" and then saw this: > +#define RELATIVE_DEDENT_BONUS -63 > +#define RELATIVE_DEDENT_HAS_BLANK_BONUS 50 It turns out that you mean by OUTdent a line that indents further (if I am reading the code correctly). Is that obvious to everybody? > + /* Bonuses based on the location of blank lines: */ > +bonus += TOTAL_BLANK_WEIGHT * total_blanks; > + bonus += PRE_BLANK_WEIGHT * m->pre_blank; This and ... > +} else if (indent > m->pre_indent) { > + /* > + * The line is indented more than its predecessor. Score it > based > + * on the larger indent: > + */ > + score = indent; > + bonus += RELATIVE_INDENT_BONUS; > + bonus += RELATIVE_INDENT_HAS_BLANK_BONUS * any_blanks; > + } else if (indent < m->pre_indent) { ... this seems to be indented correctly even after getting quoted, which in turn means most of the lines in the added code share indent-with-non-tab badness. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
Stefan Bellerwrites: > I have just reread the scoring function and I think you could pull out the > `score=indent` assignment (it is always assigned except for indent <0) > > if (indent == -1) >score = 0; > else >score = indent; > ... lots of bonus computation below, which in its current > implementation > have lots of "score = indent;" lines as well. Yup. If each part in this if/else if/... cascade independently sets complete information (i.e. both "bonus" and "score") necessary for the final result, then I do not mind the same "score = indent" in many of them (these case happen to get the same score), but that is not what we have in this code (i.e. "bonus" has a shared component that is not affected by thie if/else if/ cascade), so setting score to indent upfront and reset it to 0 only on a blank line would make sense. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
Michael Haggertywrites: >>> + } >>> + /* >>> +* We have reached the end of the line without finding any non-space >>> +* characters; i.e., the whole line consists of trailing spaces, >>> which we >>> +* are not interested in. >>> +*/ >>> + return -1; Not related to Jacob's review, but "the whole line consists of trailing spaces" made me read it twice; while it is technically correct, "the whole line consists of spaces", or even "this is a blank line", would read a lot more easily, at least for me. > I was implicitly assuming that such lines would have text somewhere > after those 200 spaces (or 25 TABs or whatever). But you're right, the > line could consist only of whitespace. Unfortunately, the only way to > distinguish these two cases is to read the rest of the line, which is > exactly what we *don't* want to do. Hmm, why is it exactly what we don't want to do? Is it a performance concern? In other words, is it because this function is called many times to measure the same line multiple times? After all, somebody in this file is already scanning each and every line to see where it ends to split the input into records, so perhaps a "right" (if the "theoretical correctness" of the return value from this function mattered, which you wave-away below) optimization could be to precompute it while the lines are broken into records and store it in the "rec" structure? > But I think it doesn't matter anyway. Such "text" will likely never be > read by a human, so it's not a big deal if the slider position is not > picked perfectly. And remember, this whole saga is just to improve the > aesthetics of the diff. The diff is *correct* (e.g., in the sense of > applicable) regardless of where we position the sliders. A better argument may be "if the user is truly reading a diff output for such an unusual "text", it is likely that she has a very wide display and/or running less -S, and treating such an overindented line as if it were a blank line would give a result that is more consistent to what appears on her display", perhaps? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On Thu, Aug 4, 2016 at 12:56 AM, Jeff Kingwrote: > On Thu, Aug 04, 2016 at 12:00:36AM +0200, Michael Haggerty wrote: > >> This table shows the number of diff slider groups that were positioned >> differently than the human-generated values, for various repositories. >> "default" is the default "git diff" algorithm. "compaction" is Git 2.9.0 >> with the `--compaction-heuristic` option "indent" is an earlier, > > s/option/&./ > >> static int diff_detect_rename_default; >> +static int diff_indent_heuristic; /* experimental */ >> static int diff_compaction_heuristic; /* experimental */ > > These two flags are mutually exclusive in the xdiff code, so we should > probably handle that here. > > TBH, I do not care that much what: > > [diff] > compactionHeuristic = true > indentHeuristic = true > > does. But right now: > > git config diff.compactionHeuristic true > git show --indent-heuristic > > still prefers the compaction heuristic, which I think is objectively > wrong. > > So perhaps we need a single variable: > > enum { > DIFF_HEURISTIC_COMPACTION, > DIFF_HEURISTIC_INDENT > } diff_heuristic; > > and set it in last-one-wins fashion (it would be nice if the config and > command line options were shaped the same way so it's clear to the user > that they are exclusive, but we may have to keep --compaction-heuristic > around for compatibility, as an alias for --diff-heuristic=compaction). > >> diff --git a/git-add--interactive.perl b/git-add--interactive.perl >> index 642cce1..ee3d812 100755 >> --- a/git-add--interactive.perl >> +++ b/git-add--interactive.perl >> @@ -45,6 +45,7 @@ my ($diff_new_color) = >> my $normal_color = $repo->get_color("", "reset"); >> >> my $diff_algorithm = $repo->config('diff.algorithm'); >> +my $diff_indent_heuristic = $repo->config_bool('diff.indentheuristic'); >> my $diff_compaction_heuristic = >> $repo->config_bool('diff.compactionheuristic'); > > Nice touch. > > Unfortunately the mutual-exclusivity handling will probably bleed over > to here, too. > >> +/* >> + * If a line is indented more than this, get_indent() just returns this >> value. >> + * This avoids having to do absurd amounts of work for data that are not >> + * human-readable text, and also ensures that the output of get_indent fits >> within >> + * an int. >> + */ >> +#define MAX_INDENT 200 > > Speaking of absurd amounts of work, I was curious if there was a > noticeable performance penalty for using this heuristic (just because > it's a lot more complicated than the others). I couldn't detect any > differences running "git log -p --no-merges -3000" on git.git with no > heuristic, compaction, and indent. There may be other repositories that > behave more pathologically (it looks like having 20 blank lines at the > end of each hunk?), but I'd guess in most cases this will always be > drowned out in the noise of doing the actual diff. > >> +#define START_OF_FILE_BONUS 9 >> +#define END_OF_FILE_BONUS 46 >> +#define TOTAL_BLANK_WEIGHT 4 >> +#define PRE_BLANK_WEIGHT 16 >> +#define RELATIVE_INDENT_BONUS -1 >> +#define RELATIVE_INDENT_HAS_BLANK_BONUS 15 >> +#define RELATIVE_OUTDENT_BONUS -19 >> +#define RELATIVE_OUTDENT_HAS_BLANK_BONUS 2 >> +#define RELATIVE_DEDENT_BONUS -63 >> +#define RELATIVE_DEDENT_HAS_BLANK_BONUS 50 > > I see there is a comment below here mentioning that these are empirical > voodoo, but it might be worth one at the top (or just moving these below > the comment) because the comment looks like it's just associated with > the function (and these are sufficiently bizarre that anybody reading is > going to double-take on them). > >> +return 10 * score - bonus; > > I don't mind this not "10" not being a #define constant, but after > reading the exchange between you and Stefan, I think it would be nice to > describe what it is in a comment. The rest of the function is commented > so nicely that this one left me thinking "huh?" upon seeing the "10". After a night of sleep I agree with Peffs statement here, it's not about the #define, it's about the comment. (which the #define would have given in a short cryptic way in angry capital letters). I have just reread the scoring function and I think you could pull out the `score=indent` assignment (it is always assigned except for indent <0) if (indent == -1) score = 0; else score = indent; ... lots of bonus computation below, which in its current implementation have lots of "score = indent;" lines as well. Thanks, Stefan -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On Thu, Aug 04, 2016 at 12:00:36AM +0200, Michael Haggerty wrote: > This table shows the number of diff slider groups that were positioned > differently than the human-generated values, for various repositories. > "default" is the default "git diff" algorithm. "compaction" is Git 2.9.0 > with the `--compaction-heuristic` option "indent" is an earlier, s/option/&./ > static int diff_detect_rename_default; > +static int diff_indent_heuristic; /* experimental */ > static int diff_compaction_heuristic; /* experimental */ These two flags are mutually exclusive in the xdiff code, so we should probably handle that here. TBH, I do not care that much what: [diff] compactionHeuristic = true indentHeuristic = true does. But right now: git config diff.compactionHeuristic true git show --indent-heuristic still prefers the compaction heuristic, which I think is objectively wrong. So perhaps we need a single variable: enum { DIFF_HEURISTIC_COMPACTION, DIFF_HEURISTIC_INDENT } diff_heuristic; and set it in last-one-wins fashion (it would be nice if the config and command line options were shaped the same way so it's clear to the user that they are exclusive, but we may have to keep --compaction-heuristic around for compatibility, as an alias for --diff-heuristic=compaction). > diff --git a/git-add--interactive.perl b/git-add--interactive.perl > index 642cce1..ee3d812 100755 > --- a/git-add--interactive.perl > +++ b/git-add--interactive.perl > @@ -45,6 +45,7 @@ my ($diff_new_color) = > my $normal_color = $repo->get_color("", "reset"); > > my $diff_algorithm = $repo->config('diff.algorithm'); > +my $diff_indent_heuristic = $repo->config_bool('diff.indentheuristic'); > my $diff_compaction_heuristic = > $repo->config_bool('diff.compactionheuristic'); Nice touch. Unfortunately the mutual-exclusivity handling will probably bleed over to here, too. > +/* > + * If a line is indented more than this, get_indent() just returns this > value. > + * This avoids having to do absurd amounts of work for data that are not > + * human-readable text, and also ensures that the output of get_indent fits > within > + * an int. > + */ > +#define MAX_INDENT 200 Speaking of absurd amounts of work, I was curious if there was a noticeable performance penalty for using this heuristic (just because it's a lot more complicated than the others). I couldn't detect any differences running "git log -p --no-merges -3000" on git.git with no heuristic, compaction, and indent. There may be other repositories that behave more pathologically (it looks like having 20 blank lines at the end of each hunk?), but I'd guess in most cases this will always be drowned out in the noise of doing the actual diff. > +#define START_OF_FILE_BONUS 9 > +#define END_OF_FILE_BONUS 46 > +#define TOTAL_BLANK_WEIGHT 4 > +#define PRE_BLANK_WEIGHT 16 > +#define RELATIVE_INDENT_BONUS -1 > +#define RELATIVE_INDENT_HAS_BLANK_BONUS 15 > +#define RELATIVE_OUTDENT_BONUS -19 > +#define RELATIVE_OUTDENT_HAS_BLANK_BONUS 2 > +#define RELATIVE_DEDENT_BONUS -63 > +#define RELATIVE_DEDENT_HAS_BLANK_BONUS 50 I see there is a comment below here mentioning that these are empirical voodoo, but it might be worth one at the top (or just moving these below the comment) because the comment looks like it's just associated with the function (and these are sufficiently bizarre that anybody reading is going to double-take on them). > +return 10 * score - bonus; I don't mind this not "10" not being a #define constant, but after reading the exchange between you and Stefan, I think it would be nice to describe what it is in a comment. The rest of the function is commented so nicely that this one left me thinking "huh?" upon seeing the "10". The rest looks sane to me, though I am not sure I have absorbed all the implications. IMHO the most interesting thing is the actual results, though. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On Wed, Aug 3, 2016 at 3:36 PM, Michael Haggertywrote: > On 08/04/2016 12:29 AM, Jacob Keller wrote: >> On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty >> wrote: >> It seems odd to be that a line with "199" spaces and nothing else will >> return "-1" but a line with 200 spaces and nothing else will return >> 200..? Would it be safe to just return -1 in both cases (if a line is >> all spaces or starts with more than 200 spaces just return -1)? >> >>> +} >>> + > > Thanks for your feedback. > > I was implicitly assuming that such lines would have text somewhere > after those 200 spaces (or 25 TABs or whatever). But you're right, the > line could consist only of whitespace. Unfortunately, the only way to > distinguish these two cases is to read the rest of the line, which is > exactly what we *don't* want to do. > > But I think it doesn't matter anyway. Such "text" will likely never be > read by a human, so it's not a big deal if the slider position is not > picked perfectly. And remember, this whole saga is just to improve the > aesthetics of the diff. The diff is *correct* (e.g., in the sense of > applicable) regardless of where we position the sliders. > > Michael > I think in this case treating it as "all whitespace" is more natural than treating it as "200 characters with something following it" because the only thing we've found so far is all white space. Either way it's not really a big deal here. Thanks, Jake -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On Wed, Aug 3, 2016 at 4:30 PM, Michael Haggertywrote: > This is an important point for the optimization, but less so for the > implementation of the heuristic here. I was dynamically optimizing the > ten other variables, and everything that goes into the bonus includes > one of those factors. If I had also let this factor of 10 vary, then the > net behavior of the algorithm would be completely unchanged if I would, > say, double all eleven parameters. This is bad for optimization, because > (1) it increases the search space unnecessarily, and (2) it means that > whole lines in the parameter space give identical behavior, making the > algorithm waste time searching along those lines for a minimum. > >> So another way to write it >> would be >> >> score - bonus/10; >> >> assuming the values of score and bonus are large enough. > > Score is the number of columns that some line is indented, so it can be > 0 or 1 or any other positive integer. It is not "large enough", which is > why the "10" can't be "1". Right, I should have made it more clear that it was a hypothetical rewrite, just to point out we are looking at only one of score or bonus. >> but if I look at the boni definitions ranging from -63..50 they are scaled up >> so much that the bonus may become larger than '1' unit of 'score', i.e. >> it is not an epsilon any more. Or to put it another way: >> If we were to s/10/100/ the results would be worse. > > If you would change s/10/100/ and simultaneously multiply the other > constants by 10, the end results would be unchanged. Right, so maybe a good name would be CONSTANT_SCALE_OF_ONE_INDENT as it has the meaning that a bonus of 10 is equivalent of "one indent" in the weighting. Speaking of which, do we want to "over-optimize" to make that constant a power of 2 as that is a supposedly faster multiplication? (Just asking, feel free to reject; as I can imagine the numbers itself are already magic, so why scale them with 42?^H^H^H 16?) > >> Rather the 10 describes the ratio of "advanced magic" to pure indentation >> based scoring in my understanding. > > No, it's basically just a number against which the other constants are > compared. E.g., if another bonus wants to balance out against exactly > one space of indentation, its constant needs to be 10. If it wants to > balance out against exactly 5 spaces, its constant needs to be 50. Etc. So another interpretation is that the 10 gives the resolution for all other constants, i.e. if we keep 10, then we can only give weights in 1/10 of "one indent". But the "ideal" weight may not be a multiple of 1/10, so we approximate them to the nearest multiple of 1/10. If we were to use 1000 here, we could have a higher accuracy of the other constants, but probably we do not care about the 3rd decimal place for these because they are created heuristically from a corpus that may not warrant a precision of constants with a 3rd decimal place. Stefan -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On 08/04/2016 12:51 AM, Stefan Beller wrote: > On Wed, Aug 3, 2016 at 3:41 PM, Michael Haggertywrote: >> On 08/04/2016 12:30 AM, Stefan Beller wrote: >>> On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty >>> wrote: >>> +return 10 * score - bonus; >>> >>> Would it make sense to define-ify the 10 as well >>> as this is the only hardcoded number in the >>> scoring function? >> >> I started answering this question by explaining why it is not important >> to *optimize* the number 10 (namely because scores are only ever >> compared against other scores, so an overall scaling factor makes no >> difference). The factor 10 only has to be large enough to provide enough >> dynamic range for the other (adjustable) parameters. > > But it only scales the score, not the bonus. Yes, that's how it provides the overall scale of the score. If it multiplied both values, then it would be completely pointless. This is an important point for the optimization, but less so for the implementation of the heuristic here. I was dynamically optimizing the ten other variables, and everything that goes into the bonus includes one of those factors. If I had also let this factor of 10 vary, then the net behavior of the algorithm would be completely unchanged if I would, say, double all eleven parameters. This is bad for optimization, because (1) it increases the search space unnecessarily, and (2) it means that whole lines in the parameter space give identical behavior, making the algorithm waste time searching along those lines for a minimum. > So another way to write it > would be > > score - bonus/10; > > assuming the values of score and bonus are large enough. Score is the number of columns that some line is indented, so it can be 0 or 1 or any other positive integer. It is not "large enough", which is why the "10" can't be "1". If the calculations were done in floating point, then the factor could be "1", because then the other factors could be made less than one. > In some prior conversation you said you take the indent and add an epsilon > for some special conditions to make one indent better than the other. > > Epsilons are usually very small compared to the rest of the equation, I should have mentioned that this heuristic is quite a bit more advanced than my original proposal to use "indent" plus some "epsilon" factors. The old discussion about epsilons is not relevant here except maybe as an inspiration and starting point for this version. > but if I look at the boni definitions ranging from -63..50 they are scaled up > so much that the bonus may become larger than '1' unit of 'score', i.e. > it is not an epsilon any more. Or to put it another way: > If we were to s/10/100/ the results would be worse. If you would change s/10/100/ and simultaneously multiply the other constants by 10, the end results would be unchanged. > Rather the 10 describes the ratio of "advanced magic" to pure indentation > based scoring in my understanding. No, it's basically just a number against which the other constants are compared. E.g., if another bonus wants to balance out against exactly one space of indentation, its constant needs to be 10. If it wants to balance out against exactly 5 spaces, its constant needs to be 50. Etc. Michael -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On Wed, Aug 3, 2016 at 3:41 PM, Michael Haggertywrote: > On 08/04/2016 12:30 AM, Stefan Beller wrote: >> On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty >> wrote: >> >>> +return 10 * score - bonus; >> >> Would it make sense to define-ify the 10 as well >> as this is the only hardcoded number in the >> scoring function? > > I started answering this question by explaining why it is not important > to *optimize* the number 10 (namely because scores are only ever > compared against other scores, so an overall scaling factor makes no > difference). The factor 10 only has to be large enough to provide enough > dynamic range for the other (adjustable) parameters. But it only scales the score, not the bonus. So another way to write it would be score - bonus/10; assuming the values of score and bonus are large enough. In some prior conversation you said you take the indent and add an epsilon for some special conditions to make one indent better than the other. Epsilons are usually very small compared to the rest of the equation, but if I look at the boni definitions ranging from -63..50 they are scaled up so much that the bonus may become larger than '1' unit of 'score', i.e. it is not an epsilon any more. Or to put it another way: If we were to s/10/100/ the results would be worse. Rather the 10 describes the ratio of "advanced magic" to pure indentation based scoring in my understanding. > > But I think you are asking a simpler question: should we give this > constant a name rather than hardcoding it? I don't see a strong reason > for or against, so I'll give it a name in the next version, as you suggest. Thanks, Stefan -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On 08/04/2016 12:30 AM, Stefan Beller wrote: > On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggertywrote: > >> +return 10 * score - bonus; > > Would it make sense to define-ify the 10 as well > as this is the only hardcoded number in the > scoring function? I started answering this question by explaining why it is not important to *optimize* the number 10 (namely because scores are only ever compared against other scores, so an overall scaling factor makes no difference). The factor 10 only has to be large enough to provide enough dynamic range for the other (adjustable) parameters. But I think you are asking a simpler question: should we give this constant a name rather than hardcoding it? I don't see a strong reason for or against, so I'll give it a name in the next version, as you suggest. Michael -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On 08/04/2016 12:29 AM, Jacob Keller wrote: > On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggertywrote: >> +/* >> + * If a line is indented more than this, get_indent() just returns this >> value. >> + * This avoids having to do absurd amounts of work for data that are not >> + * human-readable text, and also ensures that the output of get_indent fits >> within >> + * an int. >> + */ >> +#define MAX_INDENT 200 >> + >> +/* >> + * Return the amount of indentation of the specified line, treating TAB as 8 >> + * columns. Return -1 if line is empty or contains only whitespace. Clamp >> the >> + * output value at MAX_INDENT. >> + */ >> +static int get_indent(xrecord_t *rec) >> +{ >> + long i; >> + int ret = 0; >> + >> + for (i = 0; i < rec->size; i++) { >> + char c = rec->ptr[i]; >> + >> + if (!XDL_ISSPACE(c)) >> + return ret; >> + else if (c == ' ') >> + ret += 1; >> + else if (c == '\t') >> + ret += 8 - ret % 8; >> + /* ignore other whitespace characters */ >> + >> + if (ret >= MAX_INDENT) >> + return MAX_INDENT; > > Should we return -1 here? > >> + } >> + /* >> +* We have reached the end of the line without finding any non-space >> +* characters; i.e., the whole line consists of trailing spaces, >> which we >> +* are not interested in. >> +*/ >> + return -1; > > It seems odd to be that a line with "199" spaces and nothing else will > return "-1" but a line with 200 spaces and nothing else will return > 200..? Would it be safe to just return -1 in both cases (if a line is > all spaces or starts with more than 200 spaces just return -1)? > >> +} >> + Thanks for your feedback. I was implicitly assuming that such lines would have text somewhere after those 200 spaces (or 25 TABs or whatever). But you're right, the line could consist only of whitespace. Unfortunately, the only way to distinguish these two cases is to read the rest of the line, which is exactly what we *don't* want to do. But I think it doesn't matter anyway. Such "text" will likely never be read by a human, so it's not a big deal if the slider position is not picked perfectly. And remember, this whole saga is just to improve the aesthetics of the diff. The diff is *correct* (e.g., in the sense of applicable) regardless of where we position the sliders. Michael -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggertywrote: > +return 10 * score - bonus; Would it make sense to define-ify the 10 as well as this is the only hardcoded number in the scoring function? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggertywrote: > +/* > + * If a line is indented more than this, get_indent() just returns this > value. > + * This avoids having to do absurd amounts of work for data that are not > + * human-readable text, and also ensures that the output of get_indent fits > within > + * an int. > + */ > +#define MAX_INDENT 200 > + > +/* > + * Return the amount of indentation of the specified line, treating TAB as 8 > + * columns. Return -1 if line is empty or contains only whitespace. Clamp the > + * output value at MAX_INDENT. > + */ > +static int get_indent(xrecord_t *rec) > +{ > + long i; > + int ret = 0; > + > + for (i = 0; i < rec->size; i++) { > + char c = rec->ptr[i]; > + > + if (!XDL_ISSPACE(c)) > + return ret; > + else if (c == ' ') > + ret += 1; > + else if (c == '\t') > + ret += 8 - ret % 8; > + /* ignore other whitespace characters */ > + > + if (ret >= MAX_INDENT) > + return MAX_INDENT; Should we return -1 here? > + } > + /* > +* We have reached the end of the line without finding any non-space > +* characters; i.e., the whole line consists of trailing spaces, > which we > +* are not interested in. > +*/ > + return -1; It seems odd to be that a line with "199" spaces and nothing else will return "-1" but a line with 200 spaces and nothing else will return 200..? Would it be safe to just return -1 in both cases (if a line is all spaces or starts with more than 200 spaces just return -1)? > +} > + -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
Some groups of added/deleted lines in diffs can be slid up or down, because lines at the edges of the group are not unique. Picking good shifts for such groups is not a matter of correctness but definitely has a big effect on aesthetics. For example, consider the following two diffs. The first is what standard Git emits: --- a/9c572b21dd090a1e5c5bb397053bf8043ffe7fb4:git-send-email.perl +++ b/6dcfa306f2b67b733a7eb2d7ded1bc9987809edb:git-send-email.perl @@ -231,6 +231,9 @@ if (!defined $initial_reply_to && $prompting) { } if (!$smtp_server) { + $smtp_server = $repo->config('sendemail.smtpserver'); +} +if (!$smtp_server) { foreach (qw( /usr/sbin/sendmail /usr/lib/sendmail )) { if (-x $_) { $smtp_server = $_; The following diff is equivalent, but is obviously preferable from an aesthetic point of view: --- a/9c572b21dd090a1e5c5bb397053bf8043ffe7fb4:git-send-email.perl +++ b/6dcfa306f2b67b733a7eb2d7ded1bc9987809edb:git-send-email.perl @@ -230,6 +230,9 @@ if (!defined $initial_reply_to && $prompting) { $initial_reply_to =~ s/(^\s+|\s+$)//g; } +if (!$smtp_server) { + $smtp_server = $repo->config('sendemail.smtpserver'); +} if (!$smtp_server) { foreach (qw( /usr/sbin/sendmail /usr/lib/sendmail )) { if (-x $_) { This patch teaches Git to pick better positions for such "diff sliders". The original Git code basically always shifted such "sliders" as far down in the file as possible. The only exception is when the slider can be aligned with a group of the lines in the other file, in which case Git favors one add+delete block over one add and a slightly offset delete block. This naive algorithm often yields ugly diffs. Commit d634d61ed6 improved the situation somewhat by preferring to position add/delete groups to make their last line a blank line, when that is possible. This heuristic does more good than harm, but can only help if there are blank lines in the right places. It still leaves a lot of ugly diffs. This commit implements a new and much better heuristic for picking optimal "slider" positions using the following approach: First observe that each hypothetical positioning of a diff slider introduces two splits: one between the context lines preceding the group and the first added/deleted line, and the other between the last added/deleted line and the first line of context following it. It tries to find the positioning that creates the least bad splits. Splits are evaluated based only on the presence and locations of nearby blank lines, and the indentation of lines near the split. Basically, it prefers to introduce splits between lines that are indented less and adjacent to blank lines. In more detail: 1. It measures the following characteristics of a proposed splitting position: * the number of blank lines above the proposed split * whether the line directly after the split is blank * the number of blank lines following that line * the indentation of the nearest non-blank line above the split * the indentation of the line directly below the split * the indentation of the nearest non-blank line after that line 2. It combines these attributes using a bunch of empirically-optimized weighting factors to estimate a score of the "badness" of splitting the text at that position. 3. It defines the score for a positioning of a diff slider to be the sum of the scores for the splits at the top and bottom of the slider. 4. It computes scores like this for all possible positions of the diff slider, and selects the position with the smallest "badness" score. The weighting factors were found by collecting a corpus of code samples in various programming languages, deciding "by eye" the best positioning for about 2700 diff sliders, then optimizing the weights against this corpus to get the best agreement with the manually-determined values. (One caveat is that the same corpus was used for both optimization and testing.) The resulting numbers of non-optimal diff groups were as follows: | repository | default | compaction | indent | optimized | | | --- | -- | -- | - | | ant | 225 |102 | 7 | 5 | | bugzilla | 208 | 81 | 14 | 8 | | couchdb | 44 | 24 | 13 | 4 | | docker | 180 |160 | 29 | 7 | | git | 446 | 59 | 27 | 4 | | ipython | 358 |163 | 61 |11 | | junit| 146 | 67 | 5 | 1 | | nodejs | 489 | 78 | 12 | 2 | | phpmyadmin | 330 | 49 | 1 | 0 | | test-more| 15 | 2 | 2 | 0