When a new paragraph was inserted, diff doesn't discover that the 
previous first paragraph is now the second. The diff reports much 
larger changes than actually happened. Why is that? How can it be 
fixed?

I'm talking about Wikipedia now. Are there different 
implementations of diff in various instances of MediaWiki?
How is it implemented? Using UNIX/Linux diff, wdiff, or some other 
algorithm?

Here is an example, where a bullet list of works (discography) was 
enhanced,
http://sv.wikipedia.org/w/index.php?title=Staffan_M%C3%A5rtensson&diff=10522416&oldid=10304813

As you can see, Brahms Clarinet Sonatas were pushed from 1st to 
2nd position, but is reported by diff as a total change.  Instead 
the record label (Channel Sound) is reported as unchanged text.
Yes, the phrase "med Erik Lanninger" was also changed to "Med E 
Lanninger", but that is a much smaller change than the one 
reported.

At my website runeberg.org, where scanned books are proofread,
I have implemented the diff function using wdiff with some
extra features. An example is shown here,
http://runeberg.org/rc.pl?action=diff&src=nfbf/0734

Since a common edit is to change "word" to "<b>word</b>", I want 
changes in XML-like markup to be reported separately, which you 
can see is the case at the bottom of that diff. But wdiff looks 
strictly at whitespace, so I had to modify this. The quite naive 
and non-optimized (but working) Perl code looks like this (yes, 
versions are maintained by plain old RCS):

    # A change from "foo bar" to "<b>foo bar" is seen by wdiff as a
    # change of the word "foo" into "<b>foo".  But we want to see this
    # as the addition of the HTML/XML tag "<b>".  To this effect, we
    # pad spaces around all "<" and ">" in the original text versions,
    # i.e. " <b> foo bar" before calling wdiff.  The output from wdiff
    # will be " <span><b></span> foo bar", where the padding spaces
    # are outside of the <span> tags.  This has to be taken into
    # consideration when removing the space padding, below.

    my $cmd = "umask 2"
     . " && co -p1.$rev1 $filename 2>/dev/null | sed 's/</ </g;s/>/> /g' >$tmp1"
     . " && co -p1.$rev2 $filename 2>/dev/null | sed 's/</ </g;s/>/> /g' >$tmp2"
        . " && wdiff -n -s -w '<span class=\"del\">' -x '</span>' "
        . " -y '<span class=\"ins\">' -z '</span>' $tmp1 $tmp2 |";
    if (open(FILE, $cmd)) {
        local $/ = undef;
        $diff = <FILE>;
        close(FILE);
    } else {
        debug_log("rc.pl: Failed with $cmd");
    }
    $diff = html_encode($diff);


Hope this was helpful.


-- 
  Lars Aronsson ([email protected])
  Aronsson Datateknik - http://aronsson.se

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to