Re: [Toybox] histogram diff

Rob Landley Fri, 31 Jan 2025 12:08:29 -0800

On 1/31/25 11:18, Ray Gardner wrote:

On Fri, Jan 31, 2025 at 7:19 AM Rob Landley <r...@landley.net> wrote:


On 1/30/25 17:59, Ray Gardner wrote:

[patience diff] may have predated the "histogram" algorithm which is
surprisingly hard to google for...
...
I was really hoping I could implement just ONE algorithm and call it good.
Right now it looks like "histogram" would be that one (it's an improvement

I just started work on my diff branch again after
https://github.com/landley/toybox/issues/489#issuecomment-2588079658
with the goal of replacing code I don't understand with code I do
understand.
...
If I throw what I've done and replace the code I understand with code I
don't understand, it will go behind "man.c" on the todo list.


As I quoted, you indicated you wanted a histogram diff.

I did. For many years. Heck, view source onhttps://landley.net/notes-2008.html and the commented out todo list atthe top has an entry about it. (I was following his livejournal when heposted it, that's how I found out about it.)

But I also wanted a simple streaming diff that worked like the oppositeof "patch", doing one pass scanning along two files and zipping themtogether without having to load both simultaneously into memory orrequiring lseek().

When I sat down to try to implement patience from the old livejournalentry... it was very much not that. And what it did do wascomputationally expensive in multiple ways, and needed a fallbackalgorithm anyway.

All these diff algorithms have pathological edge cases, and whenfiddling with them I get distracted by what the failure modes would beand what inputs would trigger them. Max memory use, max cpu time, hunksbeing way bigger than they need to be, hunks being unreadable salad whenthe actual change isn't...

Back when I was shoveling through this the first time I was wondering ifthere's some sort of diff test corpus out there to compare algorithmsagainst, but "implementing diff in multiple ways" doesn't seem to be alarge enough community for such a thing. (I'd really hoped Bram Cohenhad test inputs showing conventional diff vs his new algorithm, but whenI went to look I couldn't find them. Maybe there were some in 2007 andthey fell off the net?)

I was never worried about producing something that works on a simpletest input, I was worried about figuring out what inputs would breakwhat I'd written, in what ways. Where "break" isn't always "wronganswer", but "expensive to compute" or "results are unnecessarily hardfor a human to read".

I put a pretty
detailed explanation at the end of my submission (can see it in the
patch) and more, including pseudocode, in my blog post
https://raygard.net/2025/01/28/how-histogram-diff-works/ .

Ooh, very's nice. I have the tab open and have read through the firstquarter of it and need more caffeine and a nap before tackling this.

It's (I hope)
easier to understand than Hunt-McIlroy or Myers, as it's not based on
any CS theory but on an empirical method that just works pretty well.
Well enough that Torvalds prefers it for patch submissions
(https://lkml.org/lkml/2023/5/7/206).

What is "hunky-dory code" in this context?


Just a reference to your function names hunky() and dory() and their
friends in your hunk detection (etc.) code.

Ah, ok. Not a bad name for it, I just hadn't connected the dots. (I haveterrible naming hygiene during development, especially while sleepdeprived. I went through later to remove everything named after RogerZelazny's "nine princes in amber" series from patch.c and all thebeatles references in ps.c, but that's a cleanup pass at the end.)

The core of that algorithm is finding the end of the next hunk, eitherthe next X matching lines or EOF on either input. The rest of it boilsdown to counting consecutive pairs of matching lines (cheap) or figuringout what to emit once you've delineated a hunk (tricksy but not expensive).

The brute force way of finding the next matching line pair is N^2computationally expensive in the size of the current hunk: just read aline from one of the inputs and scan back over the previously seen linesof the other to see if it matches. If so, can I read MORE lines fromthat input source to get X consecutive matches at that point in theother input source? (At which point I've ended the hunk, and generallyone of the inputs has read ahead and has some buffered lines.) Alternatewhich input you're reading a new line from because we dunno which onedid an insert and which one did a deletion.

That said, individual hunks larger than a couple hundred lines areuncommon. There IS pathological input that would make this slow and usea lot of memory (basically diff -u <(seq 1 1000000) <(seq 1000000 -1 1)) but it's still only N^2 on number of input lines slow (not any sort ofX^N shenanigans, it should finish even given a gigabyte of differinginput on each end, just not conveniently). And its pathological memoryuse case is the base memory use of the other algorithms. And I COULDapply hashing to speed the search up (modulo -i and -b and -I would haveto apply to the hash too in order to be of any benefit, don't ask me how-I regex is supposed to hash, "cut out the match and replace it with anempty string" isn't actually what the man page describes -I asdoing...). But I have yet to find a real world input that would actuallybenefit from it rather than net lose from the overhead, and NOT doing itis simplier.

The positive tradeoff is it doesn't care about the size of the _file_because the matching parts are almost free. Comparing two matching linesand discarding them when not currently in a differing hunk is O(1)("yup, they still match up, keep going"), and everything before thecurrent hunk has already been output/discarded so makes no difference toprocessing the current hunk. And that's SHOULD be optimizing for thecommon case, I think?

Note that X to end current hunk is actually 2x+1 the usual start/endcontext because if your hunks have 3 context lines but your files have"change, 5 same lines, another change" then that's one hunk with 5consecutive unchanged in the middle. No really:

$ diff -u <(printf '%s\n' a b c d e f g h i j) <(printf '%s\n' a j c d ef g h k j)

--- /dev/fd/63  2025-01-31 13:13:37.357744367 -0600
+++ /dev/fd/62  2025-01-31 13:13:37.357744367 -0600
@@ -1,10 +1,10 @@
 a
-b
+j
 c
 d
 e
 f
 g
 h
-i
+k
 j

landley@driftwood:~/toybox/clean2$ diff -u <(printf '%s\n' a b c d e f gh i i j) <(printf '%s\n' a j c d e f g h i k j)

--- /dev/fd/63  2025-01-31 13:13:54.946009279 -0600
+++ /dev/fd/62  2025-01-31 13:13:54.950009339 -0600
@@ -1,5 +1,5 @@
 a
-b
+j
 c
 d
 e
@@ -7,5 +7,5 @@
 g
 h
 i
-i
+k
 j

This is for two reasons: 1) ending a hunk and starting a new hunk is alonger output than just letting it run, 2) patch doesn't reprocess itsoutput and thus can't apply overlapping hunks. (It also needs hunks tooccur in order. Strangely that's not just a limitation of MY patch,other patch programs consider hunks out of order within the same fileillegal. Concatenating patches works because they close and re-open thefile, which starts over at the beginning. That's part of the reason Idid mine my way, because as long as that limitation's there anyway...)

Yes, the 2x case being treated as contiguous is a weird edge case, Iwould have thought 2x was where it would break (the point where patchdoesn't have to reprocess output of a previous hunk to apply the nextpatch), but it's 2x+1 (because that's where output size is equivalentwith the @@ resync line) which means a hunk can't be considered "done"until we've seen 2x+1 matching trailing lines (or EOF on at least oneinput).

Once you've loaded and terminated a hunk, you know how many leading andtrailing common lines the resulting hunk should output (leading commonlines naturally match up, but I mentioned needing to treat TRAILINGcommon lines specially because they don't automatically zip backtogether all the time, although I have yet to to find my test case ofwhere they didn't) and if they don't it sends the wrong signal to patch(which COUNTS leading/trailing match lines to detect "must match SOF/EOF").

That was the pending bug I needed to fix back when I mothballed thisbecause I got too many "why are you ignoring code that works to go offand putter with something irrelevant, shame on you wasting our time"pokes and walked away for a while. The challenging part wasn't fixingthe bug, the problem was delineating the edge cases and coming up withall the tests for them. Because you can mix SOF and EOF into there too,and it should handle them all right, and the hard part isn't handlingthem right it's making sure you've tested each one so you can PROVE it'shandling them right. I suspect I blogged about this at the time...)

And then you have a pile of lines in between that need to be output with+, space, and -. That's the part where I thought patience/histogrammight come in (and potentially the constraint of operating WITHIN aknown hunk might cancel out the computational expense), but doingpatience _within_ already delineated hunks turned out not to be what itwas designed for at all.

Note that one thing this algorithm (fairly fundamentally) cannot do isdetect REORDERED hunks. (I miss the OS/2 tool that did that in 1996. Itwas really nice to use. Color coded output, drew little diagonal linesbetween the left side and right side versions when it detectedreorders... Alas, IBM propritetary. Part of my formative "don't getaddicted to proprietary crap, this too shall pass and then you'll havewithdrawl symptoms when it's gone extinct", LONG before its logicalcontinuation "cloud is bad, do not cloud".)

But diff doesn't have a syntax for specially marking reordered hunks(not unified diff, anyway), it treats it as a deletion and an insertion.And patch couldn't handle it either because of thenot-reprocessing-output limitation. (Git added some file level moveindications, but even they don't have a syntax to move text within a file.)

So if what you're looking for is "here's the set of transforms to turn Ainto B", this should reliably give you a workable answer. (Modulofixable bugs.) Will it produce the most human-readable possible answer?Not sure. When the next hunk starts and ends seems to have fairlydeterministic right answers, it's the + and - decisions within a hunkthat vary. (I think.) And "what were you thinking" should be easy for ahuman to work through with this algorithm. If somebody comes to me witha good "This was just X, you got confused" test input maybe I can tweakthe +/- output emitter to understand what the human saw that thealgorithm didn't, but I need the tests first.

(Don't ask me how diff3 works into any of this, I've never used it but Iknow git does under the covers so will have to care at some point. And Ihave yet to open the can of worms of marking changes WITHIN lines either.)

I might not have bothered if I'd known you were into diff again, after
your posts about how you were sick of it.

Sorry. I _was_ sick of it (not the code, the "stop what you're doing andlisten to me tell what you should be doing instead"), but somebody needsa new feature because kernel, and shortest path to that was beating thebehavior out of the codebase I theoretically understand.

I thought you were focused on sh.c.

I am (well, ok, the top of stack is currently kexec for jcore and someboot rom rewrites and SOC documentation, but top of stack for TOYBOX issh.c), but real world users coming to me with use cases is always apriority boost.

"I need X and can't figure out how to make your tool do it" is thehighest feedback there is. That's the part I _can't_ know already. Evenwhen the answer's just documentation, that's still something that wasmissing.

Anyway, my code and blog is there if you decide you want to try
it, see how it works and find out what histogram diff really does.
I don't expect you will.


I'm definitely reading the blog entry. Thanks for writing it.

BTW you're concerned that the current pending diff.c may have
plagiarized code from toybox.


busybox.

(I should have called the new one dorodango. I should have called mkrootdorodango. If I actually get an automated LFS bootstrapping projectworking on the level of what aboriginal linux used to do I probably WILLcall it dorodango. Keep polishing the ball of mud until it shines...)

I looked at them a bit, and I didn't see a
lot in common, but the read_tok() function you called out in your issue
comment, and a related nameless enum of flags or states, bear a strong
resemblance to read_token() and an enum in busybox diff.c.

Eh, it's not a strong objection, I just got a bit gun-shy after catchinga source very obviously (obvious to me, the ex-busybox maintainer) doingthat a few years back.

This was not a straight copy, my cleanups tend to be significantrewrites anyway (I'm not entirely sure of the provenance ala ifconfig incleanup.html), and I would LOVE to give Bradley a PR black eye if hetried to use busybox to sue toybox. (The new SCO.) But I'm not gonnainvite it either. Basic hygiene.

Mostly, the problem was cleaning up the old code was several times morework (for me) than just writing a new one, and when I broke the old codeI didn't understand WHY it was broken so everything was endlessdebugging of code I was trying to reverse engineer how to replace it.

Ray

Rob

P.S. Sorry if I'm reading "stop what you're doing, do this instead, Icommand it" into things that DO NOT MEAN THAT. My motivation hasrecovered a LOT recently but is still sadly fragile, and that's a meproblem. I spent far too much of last year curled up in a ball waitingfor the meteor to hit, but now that the worst has basically happened andwe're back in The Time Of The Circular Firing Squad, I've gotten on withit again. (The "loss aversion" part is over: we're fscked. Moving on...)

(Selling a house, moving cross country, my wife graduating and getting afull time job so I'm househusbanding now, the industry I'm usuallyemployed in having the same kind of layoffs the dot-com crash andmortgage crisis caused while smiling and claiming everything's fine (I'mok for now but not all my friends are), getting covid AGAIN, and seeingamerica's version of brexit coming 6 months ahead of time and REALLYHOPING I WAS WRONG... all that stacked poorly. Not your fault, notOliver's fault, I feel terrible about being unable to properly harnessyour enthusiasm.)

_______________________________________________
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Re: [Toybox] histogram diff

Reply via email to