On 03/23/2016 07:40 PM, Andy Chu wrote: >>> 2) It does indeed seem undocumented -- I can't find any mention of >>> this (unintuitive) behavior in the GNU sed manual or FAQ: >> >> The _fun_ part is that this is doubly nonstandard, because the logic >> I've used so far is if the line of input didn't have a newline, we >> don't add one _until_ we output another line after it, in which case >> we backfill the missing newline. > > I don't know enough about sed to say anything useful about this part > ... I pretty much just stick to s// .
I literally learned sed by writing a better busybox sed implmentation back in 2003, because when I tried to build Linux From Scratch with busybox commands in the $PATH, the first package's autoconf broke badly: http://lists.busybox.net/pipermail/busybox/2003-September/043403.html I had no _idea_ what I was doing, just a lot of failing test cases. I suspect that was the last time I read a GNU "info" page for anything. (That or writing "sort" a year or two later. These days I don't bother with that, if it's not in the man page it doesn't matter. If some script out there uses it, we'll notice when we get the bug report and decide then.) That said, have you read the toybox sed --help? I tried REALLY HARD to cover everything thoroughly and concisely. > Though I am wondering about the general toybox philosophy of > compatibility. I think some compatibility is essential, but extreme > compatibility is often at odds with the goal of simple code. I would > define extreme compatibility as "bug for bug" compatibility. Agreed. The reason I was going through all the weird tests last essage was to figure out what the correct behavior _is_. This is usually much harder than implementing it. :) I have some rules of thumb like expecting every --long-opt to have a corresponding single character short option, so whenever there's a --long-opt without a short opt in a man page that means it's probably useless crap and has a much higher bar to clear to convince me to implement it. But mostly, I implement what looks good on the first pass and then wait for people to complain, then figure out what to do then. The problem to solve here is that Debian is currently broken when using toybox. That's a real problem that needs to be fixed, but there's more than one way to fix it. Debian introduced this bug within the past few years and maybe we can convince them to fix it on their end, that's one possible fix. But for the moment they need a different behavior than we provide. If I can figure out a way to alter the toybox logic in a couple lines to do what they need, great. If it turns out to be intrusive and horrible and fiddly and not make any sense and be a SPECIAL CASE (this code is shared by c, i, and a, but what does c\ with no follow-up mean)... if I try to implement it and don't like it then maybe I revisit making puppy eyes at Debian to fix their stuff instead of changing toybox. So my first question is, is this an _extension_, or is it a _bug_? (Hence probing the behavior, which looks really bug-like.) > I've looked through most of the docs, which generally are very lucid > about the project goals, but didn't see anything explicitly about > compatibility. (e.g. http://landley.net/toybox/design.html) It's case by case. It's really hard to make blanket calls in isolation. There's a little in the about page, maybe a little in the roadmap... I see the design page has "features" and misses this entirely. :) > I know it's basically a mixture of POSIX, other standards, common > practice like GNU coreutils, and Aboriginal Linux use cases. I see > that Aboriginal has a sources/patches directory, so I wonder where you > draw the line at where to patch vs. where to re-implement bugs. Those patches generally aren't to build with toybox, they're toolchain patches because I'm building 2016 packages with a 2007 toolchain (I don't ever ship GPLv3 code unless paid to do so by an employer, I consider it one of the nastier and more litigation-prone proprietary licenses), kernel patches because they broke mips and arm again, uClibc patches because that thing's last release was 2012 (weaning the project off it, switching to musl, but not all architectures are supported by musl yet)... When I hit a toybox issue, I mostly fix toybox. Also, even if I patch stuff when _I_ build it, but I want _other_ people to be able to build the vanilla versions, without needing my patches. If buildroot or gentoo or something has toybox commands in the $PATH during its build, I want them to _work_. If I can get an upstream package to accept a patch, that's a good fix. Assuming that's the _only_ package that has that issue, which I'm never sure of. Generally I implement the set of features I'm familiar with (and/or the standard or man page prominently requires), and then wait for somebody to complain. When I find a user of other corner cases in the wild, somebody complained (the author of that package). Sometimes I respond with "ew, no" even when evaluating the standard. (I'm not implementing EBCDIC in dd. It's just not happening.) But again, case by case basis, judgement calls, and I can be argued out of my position. (Elliott's done that several times. He represents the final word for Android, and has a billion seats behind him. I did NOT want to open the selinux can of worms, but... it's in there now.) > I noticed the other recent example of making a change for the Perl build. Comparing Debian to the sed fix for Perl: I don't expect perl to ever get fixed, and even if it did people continue to build old versions of perl because new ones are bigger and more complicated, and are they the _only_ people who use that behavior? Adding the filter wasn't that big and seemed reasonable to do. (Also, perl is used by 90% of the build environments out there, although it should build natively on target and NOT have to cross compile. If Debian doesn't work for a while there's another dozen major distros.) When I add this sort of corner case, I want to document it, preferably in the help text. I want it to be part of the spec for the toybox version of this command. If it's not something I want to document, it's probably not something I should implement. (That said, I haven't always _done_ so, because writing good documentation is an enormous timesink and I get distracted by todo item du jour. And documentation has the problem that "concise is better" because people have to READ it and it bloats the binary. In this case sed didn't already explain how to regex (there's a section on _using_ regexes but it assumes you know regex syntax from elsewhere, and referring to "man 7 regex" is awkward when we don't have a man command and a URL to the posix regex page is long and hard to wordwrap... The toybox help is _mostly_ otherwise self contained. I dunno, maybe grep --help should explain regex syntax?) But anyway, that's part of my judgement on whether or not to implement extensions and corner cases. Is this documentable cleanly? I note that Rich Felker takes 100% the opposite approach with musl, pushing patches into other packages, even for little stuff, to the point where he doesn't provide a _MUSL_ #ifdef people can use to identify his libc. (Of course I add one in aboriginal linux when building musl. :) > Given what I know, I would classify implementing 'c\' and 'a\' as > bug-for-bug compatibility -- that seems like truly an accidental > implementation quirk rather than a useful extension or generalization > of existing behavior. Yes. But somebody is using it to do something specific. It would be nice if they stopped, and if people want to send patches to debian to that effect I'm all for it. But are they the ONLY user of this "trick"? At the moment, what I can fix right now is toybox. If I can find something that qualifies as a "fix". The EASY thing I can do is remove the line checking for and reporting the error, in which case the code falls through and treats it as having a blank line after it, and thus the "a\" adds a blank line after every match. Which is why I was doing all those tests to see what the other one did and going "that's disgusting". By the way, guess what the version I wrote for busybox many moons ago is doing? $ echo -en "one\ntwo" | busybox sed 'a\' one two (With a single newline after the two.) That's another point of reference: busybox has 15 years of testing in the "sheep across the minefield" sense (a largeish chunk of it done by me way back when). What can _they_ get away with not doing? In this case the specific test case that was reported was "$a\" which adds a newline to the LAST line. So I think that fix is good enough for debian: we might add an extra newline to the last line (becuase we _add_a_line_ and that triggers the "there wasn't a newline on the previous line so put one before the next line" logic in emit()) but that doesn't seem like it would hurt anything? Then again, busybox _isn't_ adding an extra line and I'm curious why? I also needed to look at the code to make sure that sed "a" still gives an error (which ubuntu does and busybox _doesn't_, it treats a and a\ the same), and decide how I wanted that to behave. Another fun thing is that sed 'a' is _already_ extended, saying "sed 'a stuff'" on the same line isn't in posix. Posix REQUIRES the \ and info on the next line. And you can tell this was a gnu extension because arbitrary whitespace between "a" and "stuff" gets discarded, so how to you add a line with whitespace at the beginning? (Answer in my case: it strips _one_ kind of leading space, so give it a tab to eat if you want leading spaces and give it a space to eat if you want leading tabs. That way you at least _can_ get leading spaces in your output, although I see I didn't document it and probably don't have a test for it. I wonder if something broke when I tried it and I reverted it? Hmmm...) Of course it would be nice if posix mentioned _any_ of this stuff, but they're totally Jorged so who cares what they think. http://landley.net/notes-2016.html#11-03-2016 > There are some things you can guess the > semantics of, even if they're not documented of, but this doesn't seem > like one of those things. I can test the semantics by running probes and seeing what output (or error messages) I get. I've done that a LOT... But in this case, making "sed -e 'a\'" be a synonym for "sed -e 'a\' -e ''" is probably the correct answer. It's the simplest thing to code (actually removes one line), fixes the problem at hand, and it looks like busybox got away with doing that. Then I wait for somebody to complain. :) It's "close enough". > FWIW there are some other sed implementations listed here, mainly on > commercial Unixes: > > http://sed.sourceforge.net/sedfaq7.html Which I do not have access to and doubt anybody's regression tested anything against this decade. Those don't matter. BSD might if it wanted to run under qemu and build _any_ package I'd tried. Unfortunately BSD has this inward-looking portage thing where the whole of userspace is merged into a single source control repository and the entire OS builds from that and you need the developers permission to use a different implementation of ssh and trying to build external packages you downloaded from elsewhere is a strange foreign concept that's just not how they do it... If you want to know why Linux beat BSD, it's because Linux is modular commodity interchangeable parts available from multiple sources with multiple compatible implementations of stuff as fundamental as the C library (like whitebox PCs), and BSD is a giant monolithic entity with gatekeepers saying what is and isn't ok (like Macintosh hardware) and you have to for the entire OS (kernel, libc and all) if you want to swap dropbear for openssh. I can make my own linux distribution the same way I could assemble my own PC from parts. It's not a big deal, I can do one just for me without starting a large organization to do it, nobody else needs to care and I don't need permission. Over in BSD, if you want to constomize stuff you join a big organization and convince them to add an extra configuration option for your use case. That's why Net/Free/Open each maintain their own kernels: if you fork you have to fork EVERYTHING, that's how their build is designed. (P.S. If debian didn't build with bsd's sed, would either debian or bsd care? If not, bsd sed's behavior does not give a good indication of what a Linux sed's behavior should be.) > Andy Rob _______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
