On Wed, Jul 13, 2022 at 11:58 PM Rob Landley <[email protected]> wrote:
> On 7/12/22 19:13, enh via Toybox wrote: > > so.. --transform works (though it confused people that it's not in the > --help > > output; they didn't know that there's a --xform synonym), > > Well, it doesn't FULLY work yet: those trailing type indicators aren't in > yet. I > have a todo thingy for it and think I know how I want to handle it now (sed > --tartransform enabling extra s///X trailing selectors, making the result > NULL > terminated, disabling NULL output repacement so it doesn't get out of > sync, and > stripping a file type indicator character from the start of each input > line; yes > this is the correct use for a --longopt, internal nonsense one program > passes to > another which no human should ever type...) > yeah, no worries --- sufficient unto the day are the bugs that are currently getting in people's way :-) > But I've been distracted trying to wrap my head around diff, and am > currently > writing a toy diff that works like the reverse of my "patch" where it > DOESN'T > load the whole file in but just finds hunks in a streaming manner. Probably > terrible results but I'm curious. (Where are good diff TEST inputs, that's > what > I wanna know. Currently I'm diffing linux-2.6.12 and linux-2.6.13 to see > what > fun corner cases come out of that. > > I actually strongly suspect a streaming approach would work reasonably > well for > unified diffs because A) patch hunks can't occur out of order (a move is > always > an insert and a delete), B) we only need to remember 3 trailing "match" > lines to > start a hunk and can END a hunk whenever we've seen 6 trailing "match" > lines. > > Yes that's 6 not 3: patch can't handle overlapping hunks, so if there are > fewer > than 6 match lines we can't break the hunk and guarantee non-overlapping > end/start context. This is why a hunk can have up to head+tail-1 many > consecutive match lines in the middle of it, because: > > @@ count @@ > start one > start two > start three > +add one > -zap one > unchanged one > unchanged two > unchanged three > unchanged four > unchanged five > +add two > -zap two > end one > end two > end three > > Would otherwise be: > > @@ count @@ > start one > start two > start three > +add one > -zap one > unchanged one > unchanged two > unchanged three > @@ count @@ > unchanged three > unchanged four > unchanged five > +add two > +zap two > end one > end two > end three > > And that writes three to the output and then backs UP to read it in again, > which > patch can't do. (I tried it to see if I needed to implement it in mine.) > > The trick with a "streaming" diff is of course determining what's a real > match > and what's a spurious match, which is what all this "find longest run of > match > lines" algorithmic stuff is for. It's not to make the diff CORRECT, it's > to make > it small/clean in nasty corner cases, which C is full of largely due to > blank > lines and runs of: > > } > } > } > } > } > > But UNIFIED diff needing six consecutive lines of match to end a hunk > might make > that less horrible? The naive approach is N^2 on the size of the hunk, but > individual diff hunks are seldom more than a hundred lines. > > And the streaming approach does not care at ALL about the size of the > _file_. > You can feed it gigabytes of input and for the matching sections it's going > read/compare/discard cycling through three trailing lines of context so it > can > start a new hunk when it sees a difference... > > I strongly suspect there are pathological inputs that break this, or else > why > would all these guys have done the complicated "read the whole file in at > once" > approach in the first place? But I can't find what those inputs ARE. My > real > problem here is a shortage of good TEST CASES... > (i know nothing useful about diff, other than where to find the pdf of the original paper :-) ) > > but in the meantime > > the kernel build script now uses --null with > > -T: > https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964 > > < > https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964 > > > > > > i think implementing that is a sub-line change here... > > > > for (;TT.T; TT.T = TT.T->next) do_lines(xopenro(TT.T->arg), '\n', > do_XT); > > What does this... ah, another null termination variant. > > The man page says "subsequent". Affecting _subsequent_ -T. So technically, > you > could have: > > tar -T file-with-newlines.txt --null file-with-nulls.txt > > Which lib/args.c fundamentally can't cope with because all command line > arguments get processed before the command's main gets called. There's a > similar > issue in sed.c: > > https://github.com/landley/toybox/blob/0.8.7/toys/posix/sed.c#L1009 > > > ...but i'm assuming you'd also want to add --no-null (last one wins), > > Yeah, modulo it still won't help the sequencing issue above. in a sense, it makes it worse --- they explicitly document that you can have stuff like `--null -T nulls --no-null -T newlines`. and yet it's only the -T file. not the -X file. (i mean, i get YAGNI, but there's an argument for orthogonality, and personally i'd put "--null should work with all file lists" over "i should be able to alternate between nulls and newlines if i want".) > But as with > "truncate -s 8G" not working on 32 bits (because # parses into a long), > there > are some corner cases in the plumbing that I have TODO items about but the > fix > is uglier than the problem until somebody comes to me with a real world > bug it > causes... > i'm interpreting this as a "will accept patch for just --null [that affects all -Ts, not just following ones], since that's all anyone wants right now"... sent! > > and i'm not sure how to write that in the USE_ line? > > I've been thinking about --no-longopt as general plumbing for a while. > It's on > the todo list, because it comes up a lot: > > $ zgrep '[-]no-' /usr/share/man/man?/*.gz | wc -l > 787 > > Although it hasn't been worth it yet because so far most of the --no-blah > longopts in toybox (grep '(no-' toys/*/*.c ) haven't had a corresponding > non-no > version they toggle. They're mostly "--no-fork", "--no-dereference", > "--no-symlinks" and so on. Even tar has --no-recursion. The exceptions are > both > in pwgen.c (--capitalize and --numerals have --no- variants). > > But in the meantime, when there's a short alias then the [-abc] logic will > switch off the previous one. (Which is what pwgen is doing with > [-cA][-n0].) > > When there aren't short aliases, here's a TRULY HORRIBLE trick to add them: > > NEWTOY(blah, "\376(null)\375(no-null)[-\376\375]", BLAH); > > Octal escapes for high ascii. (Yes I added support in mkflags.c a while > back. :) > Rob >
_______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
