On 1/31/25 11:18, Ray Gardner wrote:
On Fri, Jan 31, 2025 at 7:19 AM Rob Landley <r...@landley.net> wrote:

On 1/30/25 17:59, Ray Gardner wrote:
[patience diff] may have predated the "histogram" algorithm which is
surprisingly hard to google for...
...
I was really hoping I could implement just ONE algorithm and call it good.
Right now it looks like "histogram" would be that one (it's an improvement
I just started work on my diff branch again after
https://github.com/landley/toybox/issues/489#issuecomment-2588079658
with the goal of replacing code I don't understand with code I do
understand.
...
If I throw what I've done and replace the code I understand with code I
don't understand, it will go behind "man.c" on the todo list.

As I quoted, you indicated you wanted a histogram diff.

I did. For many years. Heck, view source on https://landley.net/notes-2008.html and the commented out todo list at the top has an entry about it. (I was following his livejournal when he posted it, that's how I found out about it.)

But I also wanted a simple streaming diff that worked like the opposite of "patch", doing one pass scanning along two files and zipping them together without having to load both simultaneously into memory or requiring lseek().

When I sat down to try to implement patience from the old livejournal entry... it was very much not that. And what it did do was computationally expensive in multiple ways, and needed a fallback algorithm anyway.

All these diff algorithms have pathological edge cases, and when fiddling with them I get distracted by what the failure modes would be and what inputs would trigger them. Max memory use, max cpu time, hunks being way bigger than they need to be, hunks being unreadable salad when the actual change isn't...

Back when I was shoveling through this the first time I was wondering if there's some sort of diff test corpus out there to compare algorithms against, but "implementing diff in multiple ways" doesn't seem to be a large enough community for such a thing. (I'd really hoped Bram Cohen had test inputs showing conventional diff vs his new algorithm, but when I went to look I couldn't find them. Maybe there were some in 2007 and they fell off the net?)

I was never worried about producing something that works on a simple test input, I was worried about figuring out what inputs would break what I'd written, in what ways. Where "break" isn't always "wrong answer", but "expensive to compute" or "results are unnecessarily hard for a human to read".

I put a pretty
detailed explanation at the end of my submission (can see it in the
patch) and more, including pseudocode, in my blog post
https://raygard.net/2025/01/28/how-histogram-diff-works/ .

Ooh, very's nice. I have the tab open and have read through the first quarter of it and need more caffeine and a nap before tackling this.

It's (I hope)
easier to understand than Hunt-McIlroy or Myers, as it's not based on
any CS theory but on an empirical method that just works pretty well.
Well enough that Torvalds prefers it for patch submissions
(https://lkml.org/lkml/2023/5/7/206).

What is "hunky-dory code" in this context?

Just a reference to your function names hunky() and dory() and their
friends in your hunk detection (etc.) code.

Ah, ok. Not a bad name for it, I just hadn't connected the dots. (I have terrible naming hygiene during development, especially while sleep deprived. I went through later to remove everything named after Roger Zelazny's "nine princes in amber" series from patch.c and all the beatles references in ps.c, but that's a cleanup pass at the end.)

The core of that algorithm is finding the end of the next hunk, either the next X matching lines or EOF on either input. The rest of it boils down to counting consecutive pairs of matching lines (cheap) or figuring out what to emit once you've delineated a hunk (tricksy but not expensive).

The brute force way of finding the next matching line pair is N^2 computationally expensive in the size of the current hunk: just read a line from one of the inputs and scan back over the previously seen lines of the other to see if it matches. If so, can I read MORE lines from that input source to get X consecutive matches at that point in the other input source? (At which point I've ended the hunk, and generally one of the inputs has read ahead and has some buffered lines.) Alternate which input you're reading a new line from because we dunno which one did an insert and which one did a deletion.

That said, individual hunks larger than a couple hundred lines are uncommon. There IS pathological input that would make this slow and use a lot of memory (basically diff -u <(seq 1 1000000) <(seq 1000000 -1 1) ) but it's still only N^2 on number of input lines slow (not any sort of X^N shenanigans, it should finish even given a gigabyte of differing input on each end, just not conveniently). And its pathological memory use case is the base memory use of the other algorithms. And I COULD apply hashing to speed the search up (modulo -i and -b and -I would have to apply to the hash too in order to be of any benefit, don't ask me how -I regex is supposed to hash, "cut out the match and replace it with an empty string" isn't actually what the man page describes -I as doing...). But I have yet to find a real world input that would actually benefit from it rather than net lose from the overhead, and NOT doing it is simplier.

The positive tradeoff is it doesn't care about the size of the _file_ because the matching parts are almost free. Comparing two matching lines and discarding them when not currently in a differing hunk is O(1) ("yup, they still match up, keep going"), and everything before the current hunk has already been output/discarded so makes no difference to processing the current hunk. And that's SHOULD be optimizing for the common case, I think?

Note that X to end current hunk is actually 2x+1 the usual start/end context because if your hunks have 3 context lines but your files have "change, 5 same lines, another change" then that's one hunk with 5 consecutive unchanged in the middle. No really:

$ diff -u <(printf '%s\n' a b c d e f g h i j) <(printf '%s\n' a j c d e f g h k j)
--- /dev/fd/63  2025-01-31 13:13:37.357744367 -0600
+++ /dev/fd/62  2025-01-31 13:13:37.357744367 -0600
@@ -1,10 +1,10 @@
 a
-b
+j
 c
 d
 e
 f
 g
 h
-i
+k
 j
landley@driftwood:~/toybox/clean2$ diff -u <(printf '%s\n' a b c d e f g h i i j) <(printf '%s\n' a j c d e f g h i k j)
--- /dev/fd/63  2025-01-31 13:13:54.946009279 -0600
+++ /dev/fd/62  2025-01-31 13:13:54.950009339 -0600
@@ -1,5 +1,5 @@
 a
-b
+j
 c
 d
 e
@@ -7,5 +7,5 @@
 g
 h
 i
-i
+k
 j

This is for two reasons: 1) ending a hunk and starting a new hunk is a longer output than just letting it run, 2) patch doesn't reprocess its output and thus can't apply overlapping hunks. (It also needs hunks to occur in order. Strangely that's not just a limitation of MY patch, other patch programs consider hunks out of order within the same file illegal. Concatenating patches works because they close and re-open the file, which starts over at the beginning. That's part of the reason I did mine my way, because as long as that limitation's there anyway...)

Yes, the 2x case being treated as contiguous is a weird edge case, I would have thought 2x was where it would break (the point where patch doesn't have to reprocess output of a previous hunk to apply the next patch), but it's 2x+1 (because that's where output size is equivalent with the @@ resync line) which means a hunk can't be considered "done" until we've seen 2x+1 matching trailing lines (or EOF on at least one input).

Once you've loaded and terminated a hunk, you know how many leading and trailing common lines the resulting hunk should output (leading common lines naturally match up, but I mentioned needing to treat TRAILING common lines specially because they don't automatically zip back together all the time, although I have yet to to find my test case of where they didn't) and if they don't it sends the wrong signal to patch (which COUNTS leading/trailing match lines to detect "must match SOF/EOF").

That was the pending bug I needed to fix back when I mothballed this because I got too many "why are you ignoring code that works to go off and putter with something irrelevant, shame on you wasting our time" pokes and walked away for a while. The challenging part wasn't fixing the bug, the problem was delineating the edge cases and coming up with all the tests for them. Because you can mix SOF and EOF into there too, and it should handle them all right, and the hard part isn't handling them right it's making sure you've tested each one so you can PROVE it's handling them right. I suspect I blogged about this at the time...)

And then you have a pile of lines in between that need to be output with +, space, and -. That's the part where I thought patience/histogram might come in (and potentially the constraint of operating WITHIN a known hunk might cancel out the computational expense), but doing patience _within_ already delineated hunks turned out not to be what it was designed for at all.

Note that one thing this algorithm (fairly fundamentally) cannot do is detect REORDERED hunks. (I miss the OS/2 tool that did that in 1996. It was really nice to use. Color coded output, drew little diagonal lines between the left side and right side versions when it detected reorders... Alas, IBM propritetary. Part of my formative "don't get addicted to proprietary crap, this too shall pass and then you'll have withdrawl symptoms when it's gone extinct", LONG before its logical continuation "cloud is bad, do not cloud".)

But diff doesn't have a syntax for specially marking reordered hunks (not unified diff, anyway), it treats it as a deletion and an insertion. And patch couldn't handle it either because of the not-reprocessing-output limitation. (Git added some file level move indications, but even they don't have a syntax to move text within a file.)

So if what you're looking for is "here's the set of transforms to turn A into B", this should reliably give you a workable answer. (Modulo fixable bugs.) Will it produce the most human-readable possible answer? Not sure. When the next hunk starts and ends seems to have fairly deterministic right answers, it's the + and - decisions within a hunk that vary. (I think.) And "what were you thinking" should be easy for a human to work through with this algorithm. If somebody comes to me with a good "This was just X, you got confused" test input maybe I can tweak the +/- output emitter to understand what the human saw that the algorithm didn't, but I need the tests first.

(Don't ask me how diff3 works into any of this, I've never used it but I know git does under the covers so will have to care at some point. And I have yet to open the can of worms of marking changes WITHIN lines either.)

I might not have bothered if I'd known you were into diff again, after
your posts about how you were sick of it.

Sorry. I _was_ sick of it (not the code, the "stop what you're doing and listen to me tell what you should be doing instead"), but somebody needs a new feature because kernel, and shortest path to that was beating the behavior out of the codebase I theoretically understand.

I thought you were focused on sh.c.

I am (well, ok, the top of stack is currently kexec for jcore and some boot rom rewrites and SOC documentation, but top of stack for TOYBOX is sh.c), but real world users coming to me with use cases is always a priority boost.

"I need X and can't figure out how to make your tool do it" is the highest feedback there is. That's the part I _can't_ know already. Even when the answer's just documentation, that's still something that was missing.

Anyway, my code and blog is there if you decide you want to try
it, see how it works and find out what histogram diff really does.
I don't expect you will.

I'm definitely reading the blog entry. Thanks for writing it.

BTW you're concerned that the current pending diff.c may have
plagiarized code from toybox.

busybox.

(I should have called the new one dorodango. I should have called mkroot dorodango. If I actually get an automated LFS bootstrapping project working on the level of what aboriginal linux used to do I probably WILL call it dorodango. Keep polishing the ball of mud until it shines...)

I looked at them a bit, and I didn't see a
lot in common, but the read_tok() function you called out in your issue
comment, and a related nameless enum of flags or states, bear a strong
resemblance to read_token() and an enum in busybox diff.c.

Eh, it's not a strong objection, I just got a bit gun-shy after catching a source very obviously (obvious to me, the ex-busybox maintainer) doing that a few years back.

This was not a straight copy, my cleanups tend to be significant rewrites anyway (I'm not entirely sure of the provenance ala ifconfig in cleanup.html), and I would LOVE to give Bradley a PR black eye if he tried to use busybox to sue toybox. (The new SCO.) But I'm not gonna invite it either. Basic hygiene.

Mostly, the problem was cleaning up the old code was several times more work (for me) than just writing a new one, and when I broke the old code I didn't understand WHY it was broken so everything was endless debugging of code I was trying to reverse engineer how to replace it.

Ray

Rob

P.S. Sorry if I'm reading "stop what you're doing, do this instead, I command it" into things that DO NOT MEAN THAT. My motivation has recovered a LOT recently but is still sadly fragile, and that's a me problem. I spent far too much of last year curled up in a ball waiting for the meteor to hit, but now that the worst has basically happened and we're back in The Time Of The Circular Firing Squad, I've gotten on with it again. (The "loss aversion" part is over: we're fscked. Moving on...)

(Selling a house, moving cross country, my wife graduating and getting a full time job so I'm househusbanding now, the industry I'm usually employed in having the same kind of layoffs the dot-com crash and mortgage crisis caused while smiling and claiming everything's fine (I'm ok for now but not all my friends are), getting covid AGAIN, and seeing america's version of brexit coming 6 months ahead of time and REALLY HOPING I WAS WRONG... all that stacked poorly. Not your fault, not Oliver's fault, I feel terrible about being unable to properly harness your enthusiasm.)
_______________________________________________
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Reply via email to