On 09-Jan-16, Justin M. Keyes wrote:
> On Sat, Jan 9, 2016 at 7:29 PM, Olaf Dabrunz <[email protected]> wrote:
> > On 09-Jan-16, Olaf Dabrunz wrote:
> >> Note that the existing diff tools that we all use every day bring in
> >> heuristical speed improvements, and do pre- and postprocessing to
> >> improve the results.
> >
> > To clarify: the pre- and post-processing improve the result of the diff,
> > regardless of whether any speedup heuristics are used in the comparison
> > itself.
> >
> > Basically, the pre-processing makes arrangements to skip uninteresting
> > lines when comparing the texts, so that the relative weight of matches
> > between interesting lines increases, making it more likely to match-up
> > interesting lines.
> >
> > The post-processing chooses between several same-weight solutions, and
> > tries to select a solution that is typically more likely to represent
> > the actual change.
>
> Olaf, what you describe is very exciting. Why not post a pull request?
Thank you! I am excited as well. :)
A pull request would mean that licensing issues need to be resolved, so
parts of the code need to be re-written from scratch. Please see below.
> What exactly do you mean by "license issues": is some part of the code
> proprietary, or GPL?
I cannot write short answers. Please bear with me.
-- LCS Implementation (The Core Algorithm)
The LCS implementation that I use at the moment, that is the core
algorithm of the diff, is a verbatim copy of a file from a GPL project.
This needs to be replaced with a different file under a different
license. I am not 100% sure (IANAL), but I believe the interface could
be re-used. It consists of a few #defines and a single function call.
I want to re-implement this from Myers' 1985 paper, and I have
implemented and analyzed LCS algorithms before, and adapted them for
research purposes (including Myers' famous algorithm).
The boundary conditions and corner cases need to be done right and
tested well. They are real brain teasers.
But
- It is difficult to ignore the additional ideas that went into the
GPL project's implementation of Myers' LCS algorithm, and which
contribute to the quality of its results. I have not seen any
papers published about these, so a re-implementation must probably
be based on a write-down of these ideas. They include:
- two optional speed heuristics, one is used by default: much
faster on larger files
- a simple way to create several instances of the algorithm for
the comparison of different kinds of sequences (not required
for vim though)
- pre- and post-processing stages that improve the diff results
-- these are not part of the core algorithm, but the
post-processing is included in my integration code, see below
- an elegant, readable and versatile implementation
I have already written descriptions of how these ideas work,
except for the pre-processing part.
But I fear that a re-implementation would be similar to the GPLed
implementation.
But I may be wrong. :)
- It takes time to implement this and to get it right, testing,
debugging ...
- Most of the other implementations I saw add ideas that are not
really helpful, at least not for our purposes. Some even add
unnecessary overhead. They are also more difficult to adapt or
re-use, some rely on features in some other implementation
language, and they are not as tried and tested.
That being said, the implementation Bram mentioned in his mail seems to
be worth a closer look. It may be the best re-implementation so far,
and may be a good basis for a C version.
But it should also include the speed heuristics and the pre- and post-
processing done in the GPLed project.
-- Vim Integration Code
The integration code is mostly written by me, aimed at inclusion in vim,
when usable for everyone. It also contains 100 - 150 lines of code
adapted from the GPL project (and about 2400 new lines of code, not
counting large comments). The GPLed parts need to be rewritten to make
all of it available under the vim license.
Also, the pre-processing needs to be implemented (and adapted to use my
"string sequences".)
I wrote the patch within one month last year, mainly because it was
there in my head and it eventually needed to get written. I also needed
a fast lcs() function for several reasons, I was using a Perl module
before that.
I do not think I should publish any of this unless I can publish all of
it under a palatable license, to avoid license confusion.
But since this keeps coming up, the purpose of my mail was to show that
there is work in progress here.
Also, it would be interesting to know what people here think about it,
and how the license issues should be sorted out.
By the way, I can also use the code as a back end for vim diff: the
patch implements the vim function lcs() that I can call from a small
vim script function that can be invoked by 'diffexpr'.
It works and is relatively fast, but since the pre-processing is not
implemented yet, calling aforementioned well-known diff program
still gives better results, at least in some cases.
Other than that, the lcs() interface exposes all configuration
options of the diff algorithm and the post-processing. And it adds
the optional tokenizer layer.
When the implementation is more complete (pre-processing included,
adapted code rewritten), a direct integration into vim's diff code
becomes interesting. Bram already noted that the LCS results can
directly be written into diff_T blocks. I have worked with diff_T
blocks for another patch (for my research project, see below), so I
do not expect any major problems with this part of the integration.
And my code already supports the comparison of buffer text and the
comparison of characters in two strings (read: comparison of two
lines).
The update strategy code needs to be added though, as noted by Bram:
keep calling the diff for every update, unless that takes too long,
then fall back on the faster workarounds.
I also have one or two ideas in that area.
-- Diff Research
The last part I was writing about is about proprietary research.
This is also not ready for publishing yet.
I mentioned this mainly to give Bram a reason to keep the code that
interfaces with an external diff program, because he said it could
probably be dropped.
If my research is any good, and the results look promising so far (and I
am already using the prototype for my daily work), then there will be
an external diff at some point and people may want to call it from vim.
--
Olaf Dabrunz (oda <at> fctrace.org)
--
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
---
You received this message because you are subscribed to the Google Groups
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.