On 9/19/22 18:45, Ray Gardner wrote: > Rob, first, let me say thank you for at least looking at my post. Sorry I > did not see your August 27 blog post sooner. I fell behind, I thought > any response to my writeup would be in the list.
I try to write blog entries regularly, but it's quick dashed off things without all the right HTML tags and lots of [LINK find url to thingy] and sentences trailing off unfinished as I get distracted. I go through and edit and upload them in batches later. Making them seem coherent takes far longer than initially writing them did. :) > I am wondering if you read my post all the way through, and did it give > you an understanding of the actual LCS algo. Recall that I wrote it for you, > specifically though not only, because you (and Elliott) encouraged me to. I still have the tab open, but I haven't had the stomach to circle back around to diff.c recently. (Sigh, I should get it over with.) >> Ray Gardner wrote up his own explanation of the classic diff algorithm, >> and I'm trying to read it to see if it makes more sense to me than the >> other writeups did. > > What other writeups did you look at, that try to explain any LCS > algorithms? Other than the original toys/pending/diff.c implementation and the paper I emailed Doug McIlroy to get a URL to since the bell labs website everybody was linking to had gone down and wasn't in archive.org? (Doug beat Elliott with the updated URL by about 20 minutes.) And the wikipedia explanation of the algorithm (https://en.wikipedia.org/wiki/Diff#Algorithm)? I've closed a lot of the tabs but here's a few that haven't cycled out of about:history yet: https://blog.jcoglan.com/2017/02/12/the-myers-diff-algorithm-part-1/ https://blog.jcoglan.com/2017/09/19/the-patience-diff-algorithm/ https://www.quora.com/How-does-Bram-Cohens-patience-algorithm-compare-relative-to-myers-minimal-and-histogram-options-for-gits-diff-function https://bramcohen.livejournal.com/73318.html https://tiarkrompf.github.io/notes/?/diff-algorithm/aside2 https://tiarkrompf.github.io/notes/?/diff-algorithm/ There were more... > The only one I have found is Thomas Guest's > https://wordaligned.org/articles/longest-common-subsequence > which (after much prelim) explains Hirschberg's linear space LCS, which > is good on space but always N^2 time so not practical. Nope, hadn't seen that one. I eventually broke down and looked at the 15 year old busybox implementation I used to maintain and the explanation at https://git.busybox.net/busybox/tree/coreutils/diff.c?h=f86a5ba510ef#n839 made a reasonable amount of sense. (Knew I'd seen it explained SOMEWHERE.) But that still reads both entire input files in one go and hashes everything, so I'd have to care about choice of hashing algorithm and potential collisions. I didn't check if the ancient busybox implementation seeks around and re-reads lines multiple times, but back when I was wrestling with the submitted toys/pending/diff.c I was trying to figure out if I needed to use long for offset and length or could get away with int for length (so those two were just 12 bytes/line on 64 bits), because a quick check of the likely average length of lines in the linux kernel source puts it at around 25 bytes: $ X=0; for i in */*.c; do while read j; do ((X+=${#j})); done < $i; done; echo $X 13993089 $ wc */*.c | tail -n 1 571208 1877113 15297205 total $ python -c 'print 13993089.0/571208' 24.4973617316 Meaning it was actually pretty easy for the metadata we were tracking _about_ the lines to get bigger than just keeping the lines. And THAT still didn't answer "why do you need the whole file" when diff hunks can't occur out of order. Once you output a hunk you can't go back, and you can tell when a hunk ends by an appropriate number of matching lines. The implementation I did at https://github.com/landley/toybox/blob/branch/toys/pending/diff.c was tailored specifically to unified diffs and worked by reading a line at a time from each input, discarding what it could as soon as it could (so immediately matching lines weren't retained except enough history to start the current hunk). It identified and retained one diff hunk at a time, so scaled with size of hunk, not size of file. (Sure that wasn't efficient if the entire file is one big hunk, but at that point what is?) It didn't have to make special provision for nonseekable input ala <(command) because it read a line at a time and retained what it hasn't matched off and disposed of in memory. There was no hash algorithm, just a comparison function which handled all the case insensitivity and whitespace skipping consistently in one place. That approach made sense to me, but I eventually admitted to myself that people will never stop objecting to me using anything other than the one true diff algorithm, so I stopped working on it. Rob _______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
