Re: [Toybox] sed -e '$a\'

Rob Landley Thu, 24 Mar 2016 10:17:24 -0700

On 03/23/2016 07:40 PM, Andy Chu wrote:
>>> 2) It does indeed seem undocumented -- I can't find any mention of
>>> this (unintuitive) behavior in the GNU sed manual or FAQ:
>>
>> The _fun_ part is that this is doubly nonstandard, because the logic
>> I've used so far is if the line of input didn't have a newline, we
>> don't add one _until_ we output another line after it, in which case
>> we backfill the missing newline.
> 
> I don't know enough about sed to say anything useful about this part
> ... I pretty much just stick to s// .


I literally learned sed by writing a better busybox sed implmentation
back in 2003, because when I tried to build Linux From Scratch with
busybox commands in the $PATH, the first package's autoconf broke badly:

http://lists.busybox.net/pipermail/busybox/2003-September/043403.html

I had no _idea_ what I was doing, just a lot of failing test cases. I
suspect that was the last time I read a GNU "info" page for anything.
(That or writing "sort" a year or two later. These days I don't bother
with that, if it's not in the man page it doesn't matter. If some script
out there uses it, we'll notice when we get the bug report and decide then.)

That said, have you read the toybox sed --help? I tried REALLY HARD to
cover everything thoroughly and concisely.

> Though I am wondering about the general toybox philosophy of
> compatibility.  I think some compatibility is essential, but extreme
> compatibility is often at odds with the goal of simple code.  I would
> define extreme compatibility as "bug for bug" compatibility.

Agreed.

The reason I was going through all the weird tests last essage was to
figure out what the correct behavior _is_. This is usually much harder
than implementing it. :)

I have some rules of thumb like expecting every --long-opt to have a
corresponding single character short option, so whenever there's a
--long-opt without a short opt in a man page that means it's probably
useless crap and has a much higher bar to clear to convince me to
implement it. But mostly, I implement what looks good on the first pass
and then wait for people to complain, then figure out what to do then.

The problem to solve here is that Debian is currently broken when using
toybox. That's a real problem that needs to be fixed, but there's more
than one way to fix it. Debian introduced this bug within the past few
years and maybe we can convince them to fix it on their end, that's one
possible fix.

But for the moment they need a different behavior than we provide. If I
can figure out a way to alter the toybox logic in a couple lines to do
what they need, great. If it turns out to be intrusive and horrible and
fiddly and not make any sense and be a SPECIAL CASE (this code is shared
by c, i, and a, but what does c\ with no follow-up mean)... if I try to
implement it and don't like it then maybe I revisit making puppy eyes at
Debian to fix their stuff instead of changing toybox.

So my first question is, is this an _extension_, or is it a _bug_?
(Hence probing the behavior, which looks really bug-like.)

> I've looked through most of the docs, which generally are very lucid
> about the project goals, but didn't see anything explicitly about
> compatibility.  (e.g. http://landley.net/toybox/design.html)

It's case by case. It's really hard to make blanket calls in isolation.

There's a little in the about page, maybe a little in the roadmap... I
see the design page has "features" and misses this entirely. :)

> I know it's basically a mixture of POSIX, other standards, common
> practice like GNU coreutils, and Aboriginal Linux use cases.  I see
> that Aboriginal has a sources/patches directory, so I wonder where you
> draw the line at where to patch vs. where to re-implement bugs.

Those patches generally aren't to build with toybox, they're toolchain
patches because I'm building 2016 packages with a 2007 toolchain (I
don't ever ship GPLv3 code unless paid to do so by an employer, I
consider it one of the nastier and more litigation-prone proprietary
licenses), kernel patches because they broke mips and arm again, uClibc
patches because that thing's last release was 2012 (weaning the project
off it, switching to musl, but not all architectures are supported by
musl yet)...

When I hit a toybox issue, I mostly fix toybox.

Also, even if I patch stuff when _I_ build it, but I want _other_ people
to be able to build the vanilla versions, without needing my patches. If
buildroot or gentoo or something has toybox commands in the $PATH during
its build, I want them to _work_.

If I can get an upstream package to accept a patch, that's a good fix.
Assuming that's the _only_ package that has that issue, which I'm never
sure of.

Generally I implement the set of features I'm familiar with (and/or the
standard or man page prominently requires), and then wait for somebody
to complain. When I find a user of other corner cases in the wild,
somebody complained (the author of that package).

Sometimes I respond with "ew, no" even when evaluating the standard.
(I'm not implementing EBCDIC in dd. It's just not happening.) But again,
case by case basis, judgement calls, and I can be argued out of my
position. (Elliott's done that several times. He represents the final
word for Android, and has a billion seats behind him. I did NOT want to
open the selinux can of worms, but... it's in there now.)

> I noticed the other recent example of making a change for the Perl build.

Comparing Debian to the sed fix for Perl: I don't expect perl to ever
get fixed, and even if it did people continue to build old versions of
perl because new ones are bigger and more complicated, and are they the
_only_ people who use that behavior? Adding the filter wasn't that big
and seemed reasonable to do.

(Also, perl is used by 90% of the build environments out there, although
it should build natively on target and NOT have to cross compile. If
Debian doesn't work for a while there's another dozen major distros.)

When I add this sort of corner case, I want to document it, preferably
in the help text. I want it to be part of the spec for the toybox
version of this command. If it's not something I want to document, it's
probably not something I should implement. (That said, I haven't always
_done_ so, because writing good documentation is an enormous timesink
and I get distracted by todo item du jour. And documentation has the
problem that "concise is better" because people have to READ it and it
bloats the binary. In this case sed didn't already explain how to regex
(there's a section on _using_ regexes but it assumes you know regex
syntax from elsewhere, and referring to "man 7 regex" is awkward when we
don't have a man command and a URL to the posix regex page is long and
hard to wordwrap... The toybox help is _mostly_ otherwise self
contained. I dunno, maybe grep --help should explain regex syntax?)

But anyway, that's part of my judgement on whether or not to implement
extensions and corner cases. Is this documentable cleanly?

I note that Rich Felker takes 100% the opposite approach with musl,
pushing patches into other packages, even for little stuff, to the point
where he doesn't provide a _MUSL_ #ifdef people can use to identify his
libc. (Of course I add one in aboriginal linux when building musl. :)

> Given what I know, I would classify implementing 'c\' and 'a\' as
> bug-for-bug compatibility -- that seems like truly an accidental
> implementation quirk rather than a useful extension or generalization
> of existing behavior.

Yes. But somebody is using it to do something specific. It would be nice
if they stopped, and if people want to send patches to debian to that
effect I'm all for it. But are they the ONLY user of this "trick"?

At the moment, what I can fix right now is toybox. If I can find
something that qualifies as a "fix".

The EASY thing I can do is remove the line checking for and reporting
the error, in which case the code falls through and treats it as having
a blank line after it, and thus the "a\" adds a blank line after every
match. Which is why I was doing all those tests to see what the other
one did and going "that's disgusting".

By the way, guess what the version I wrote for busybox many moons ago is
doing?

  $ echo -en "one\ntwo" | busybox sed 'a\'
  one

  two

(With a single newline after the two.)

That's another point of reference: busybox has 15 years of testing in
the "sheep across the minefield" sense (a largeish chunk of it done by
me way back when). What can _they_ get away with not doing?

In this case the specific test case that was reported was "$a\" which
adds a newline to the LAST line. So I think that fix is good enough for
debian: we might add an extra newline to the last line (becuase we
_add_a_line_ and that triggers the "there wasn't a newline on the
previous line so put one before the next line" logic in emit()) but that
doesn't seem like it would hurt anything? Then again, busybox _isn't_
adding an extra line and I'm curious why?

I also needed to look at the code to make sure that sed "a" still gives
an error (which ubuntu does and busybox _doesn't_, it treats a and a\
the same), and decide how I wanted that to behave. Another fun thing is
that sed 'a' is _already_ extended, saying "sed 'a stuff'" on the same
line isn't in posix. Posix REQUIRES the \ and info on the next line. And
you can tell this was a gnu extension because arbitrary whitespace
between "a" and "stuff" gets discarded, so how to you add a line with
whitespace at the beginning? (Answer in my case: it strips _one_ kind of
leading space, so give it a tab to eat if you want leading spaces and
give it a space to eat if you want leading tabs. That way you at least
_can_ get leading spaces in your output, although I see I didn't
document it and probably don't have a test for it. I wonder if something
broke when I tried it and I reverted it? Hmmm...)

Of course it would be nice if posix mentioned _any_ of this stuff, but
they're totally Jorged so who cares what they think.

  http://landley.net/notes-2016.html#11-03-2016

> There are some things you can guess the
> semantics of, even if they're not documented of, but this doesn't seem
> like one of those things.

I can test the semantics by running probes and seeing what output (or
error messages) I get. I've done that a LOT...

But in this case, making "sed -e 'a\'" be a synonym for "sed -e 'a\' -e
''" is probably the correct answer. It's the simplest thing to code
(actually removes one line), fixes the problem at hand, and it looks
like busybox got away with doing that. Then I wait for somebody to
complain. :)

It's "close enough".

> FWIW there are some other sed implementations listed here, mainly on
> commercial Unixes:
> 
> http://sed.sourceforge.net/sedfaq7.html

Which I do not have access to and doubt anybody's regression tested
anything against this decade.

Those don't matter. BSD might if it wanted to run under qemu and build
_any_ package I'd tried. Unfortunately BSD has this inward-looking
portage thing where the whole of userspace is merged into a single
source control repository and the entire OS builds from that and you
need the developers permission to use a different implementation of ssh
and trying to build external packages you downloaded from elsewhere is a
strange foreign concept that's just not how they do it...

If you want to know why Linux beat BSD, it's because Linux is modular
commodity interchangeable parts available from multiple sources with
multiple compatible implementations of stuff as fundamental as the C
library (like whitebox PCs), and BSD is a giant monolithic entity with
gatekeepers saying what is and isn't ok (like Macintosh hardware) and
you have to for the entire OS (kernel, libc and all) if you want to swap
dropbear for openssh.

I can make my own linux distribution the same way I could assemble my
own PC from parts. It's not a big deal, I can do one just for me without
starting a large organization to do it, nobody else needs to care and I
don't need permission. Over in BSD, if you want to constomize stuff you
join a big organization and convince them to add an extra configuration
option for your use case. That's why Net/Free/Open each maintain their
own kernels: if you fork you have to fork EVERYTHING, that's how their
build is designed.

(P.S. If debian didn't build with bsd's sed, would either debian or bsd
care? If not, bsd sed's behavior does not give a good indication of what
a Linux sed's behavior should be.)

> Andy

Rob
_______________________________________________
Toybox mailing list
[email protected]
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Re: [Toybox] sed -e '$a\'

Reply via email to