On 8/13/22 17:36, Ray Gardner wrote: > Rob, I don't know if you've done much work on diff.c since your Aug. 8 > post that had attachment.c, but I tried that program and it apparently > gets into an infinite loop with a certain pair of files. I can send > them to you or attach them to a post here if you like.
Sorry, I missed this getting stuck in the moderation queue. (Just went through to whitelist Weizhao and pass through his two emails, and saw I'd missed two other old ones in the ocean of spam. I think I already dealt with the BSD issue, and while I think I already replied to this one too, it's a good excuse for an update.) I pushed the diff work to a temporary branch at https://github.com/landley/toybox/commits/branch ala https://github.com/landley/toybox/blob/branch/toys/pending/diff.c which is not load bearing yet but getting there, so people can at least see it. It passes your test case now (diffs the two files without segfaulting, producing reasonable hunks), although the output still isn't quite right in a couple places: ---- dif1.c 2022-08-26 00:54:19.827964685 -0500 -+++ dif2.c 2022-08-26 00:54:41.231964278 -0500 -@@ -8,30 +8,15 @@ +diff dif1.c dif2.c +--- dif1.c ++++ dif2.c +@@ -8,29 +8,14 @@ I need to add the timestamp info to the header output, and only output the "diff" line when comparing directory contents (and possibly --from-file and --to-file). Plus THAT ONE HUNK (and none of the others!) is off by one in the lengths and I should figure out why. (The emitted lines match but the @@ header line doesn't, which can't be right...) @@ -17,8 +18,8 @@ } -#if 00 --void putqln(LINE *pln, DFILE *file) --{ + void putqln(LINE *pln, DFILE *file) + { ... - void putqln(LINE *pln, DFILE *file) - { - if ( ! pln->lneof ) { +-void putqln(LINE *pln, DFILE *file) +-{ That's an interesting corner case: the three lines that end the hunk also occur verbatim in order earlier in the removed section, so matching them earlier and emitting them as context lines at the start, then ending the hunk with three subtraction lines is technically the same transformation (even the same number of matched lines)... but it violates the expectation that a unified diff hunk starts and ends with context lines when not at the start or end of the file. (The find_hunks() did delineate the hunk with appropriate context lines, but dump_hunk() is printing the hunk via a simple greedy algorithm that takes earliest matches, and this is an example of the two disagreeing.) That's where I left off last night. I need to decide if I should just do a heuristic to peel off up to -U many context lines from the end so the "greedy" algorithm doesn't get the chance to eat them earlier than it should, or if I should teach dump_hunk() to symmetrically parse from start and end and stop in the middle. (It's getting the STARTING context lines right because greedy match grabbing does the right thing there. If it started at the end it would get the end right. Maybe it should alternate?) I already have to split "identify context lines to keep" and the "output result" into two passes to properly implement -B, because -B (suppress hunks that only change blank lines) is kind of annoying: $ diff -Bu <(echo $'one\n\ntwo\n\nthree') <(echo $'one\n\ntwo\nthree') $ diff -Bu <(echo $'one\ntwo\n\nthree') <(echo $'one\n\ntwo\nthree') --- /dev/fd/63 +++ /dev/fd/62 @@ -1,4 +1,4 @@ one -two +two three That's ALMOST the same diff, and in both cases the change is addition/removal of blank lines. In the first one all the changes are identified as being in blank lines so it suppresses outputting the hunk. But in the second one debian's diff (not mine, the host system!) decided that "two" is what moved instead of the blank line before/after it, and thus the hunk gets output even though identifying the blank line as what moved would produce a smaller patch even without -B, and would also suppress it with -B. I haven't seen _any_ of the various diff algorithms pay attention to THAT yet. :) Anyway, breaking it into two passes means I'm no longer printing the lines as I go so identifying them out of order isn't a big deal either. But the heuristic might still be smaller...? *shrug* It's corner cases all the way down, this stuff. Last year when Elliott gave me his todo list I went: >> diff (--line-format, -I) > > If I keep the current algorithm instead of switching to Bram Cohen's patience > diff, it's only a few days of cleanup. (Implementing patience diff shouldn't > be that hard either, but I haven't sat down and scoped it because my todo list > runneth over.) And I turned out to be profoundly wrong in that time estimate. Way more of a learning curve than I expected. Sorry it's taken so long... Rob _______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
