Re: [Toybox] pathological case in sed s///g

2019-05-06 Thread enh via Toybox
From: Rich Felker 
Date: Mon, May 6, 2019 at 7:40 PM
To: Rob Landley
Cc: toybox

> On Mon, May 06, 2019 at 07:05:46PM -0500, Rob Landley wrote:
> > On 5/6/19 12:48 PM, Rich Felker wrote:
> > > On Mon, May 06, 2019 at 12:42:44PM -0500, Rob Landley wrote:
> > >> Huh... I'll assume REG_STARTEND works in bionic since you're pointing me 
> > >> at it.
> > >> It's not in the regex man page but it _is_ in the glibc headers...
> > >>
> > >>   https://github.com/bminor/glibc/commit/6fefb4e0b16
> > >>
> > >> Looks like it went into glibc in 2004, which is way past 7 years. I 
> > >> should poke
> > >> Michael Kerrisk to update the man page.
> > >>
> > >> And musl explicitly refused to do it, but Rich makes bad calls all the 
> > >> time:
> > >>
> > >>   https://www.openwall.com/lists/musl/2013/01/15/26
> > >>
> > >> I still haven't got an #ifdef __MUSL__ in portability because Rich 
> > >> insists his
> > >> libc is the only perfect piece of software ever written, but I can probe 
> > >> for it
> > >
> > > You can #ifdef REG_STARTEND and use it conditionally depending on
> > > whether the functionality is offered. There is no reason to hard-code
> > > an assumption that musl doesn't or does have this functionality;
> >
> > I did (https://github.com/landley/toybox/commit/48162c4ee3fb) but it's 
> > still a
> > musl-specific workaround. Glibc, uClibc, and bionic all adopted this about 
> > 15
> > years ago, and it came from freebsd in the first place which presumably 
> > covers
> > macosx (I can't currently check, but somebody would poke me before too 
> > long).
>
> Regardless of your opinion on whether or why musl does or doesn't have
> this extension, testing for it with #ifdef REG_STARTEND rather than
> "testing for musl" is the right approach, because if/when it's added
> to musl, you automatically get support for it. If you hard-coded an
> assumption that musl doesn't have it, then users are stuck with that
> even if it changes.
>
> > Musl is the only thing left that _doesn't_ have this functonality, and 
> > that's
> > because
>   ~~~
>
> I don't think it's particularly "because" of this. Rather, the pros
> and cons of adding this extension have been under discussion for a
> while, with nobody pushing for further action on the topic.
>
> > you dismissed NUL matching as not a "meaningful use of sed", and called
> > it "hideous hacks" and "nobody will notice the difference with it missing". 
> > (I
> > implemented a regexec0() wrapper to provide NUL matching 3 years before this
> > performance issue came up.)
>
> Use on non-text data is outside the scope of the standard sed utility,
> and I'm a bit hostile to insistence that the standard utilities be
> useful with non-text because of what a disaster Austin Group issue 663
> turned into. However this is really not a consideration in whether
> REG_STARTEND is included going forward.
>
> > > it's
> > > been proposed and there's even a patch somewhere. (It's not costly to
> > > support relative to the current bad regex implementation, but the
> > > concern is that it might impose a nontrivial cost on a future good
> > > implementation once we're locked into having it.)
> >
> > We have different project philosophies. I wait for actual users to show up
> > before implementing a lot of things, but when they do I don't tend to think 
> > I
> > know better than they do about what their use cases should be.
>
> If I accepted arbitrary requests for stuff in musl, it would just be a
> clone of glibc. As a community we have developed pretty clear and
> consistent criteria for what does or doesn't get included. I'm not
> aware of any indication that REG_STARTEND shouldn't be included except
> possible future runtime cost; it doesn't have any major complexity
> cost or incompatibility between different historical implementations
> AFAIK.
>
> > $ echo -e 'hello\0blah' | ./sed 's/blah/boing/' | hd
> >   68 65 6c 6c 6f 00 62 6f  69 6e 67 0a  |hello.boing.|
> > 000c
> >
> > The freebsd man page from last time
> > (https://www.freebsd.org/cgi/man.cgi?query=regex=3=freebsd-release-ports)
> > says that REG_PEND is explicitly for specifying the length of a string
> > containing NULs (without moving the starting point like REG_STARTEND can), 
> > so
> > the plumbing in between already has to support embedded NULs. The man page 
> > could
> > be clearer, hopefully Michael Kerrisk's will be.
> >
> > > whether it treats the start/end as *additional* constraints,
> >
> > Not in the systems I've been able to test. (I need to get my freebsd test 
> > image
> > set back up on the new machine...)
> >
> > (And hey, congrats on musl not doing the strlen() before matching without 
> > this.
> > You don't have the N^2 performance issue glibc and bionic did, and thus 
> > don't
> > need that reason this patch went in. But I didn't keep the previous 
> > "implement
> > match after NULL" code that worked without it just for musl because I don't 
> > care
> > enough about 

Re: [Toybox] pathological case in sed s///g

2019-05-06 Thread Rich Felker
On Mon, May 06, 2019 at 07:05:46PM -0500, Rob Landley wrote:
> On 5/6/19 12:48 PM, Rich Felker wrote:
> > On Mon, May 06, 2019 at 12:42:44PM -0500, Rob Landley wrote:
> >> Huh... I'll assume REG_STARTEND works in bionic since you're pointing me 
> >> at it.
> >> It's not in the regex man page but it _is_ in the glibc headers...
> >>
> >>   https://github.com/bminor/glibc/commit/6fefb4e0b16
> >>
> >> Looks like it went into glibc in 2004, which is way past 7 years. I should 
> >> poke
> >> Michael Kerrisk to update the man page.
> >>
> >> And musl explicitly refused to do it, but Rich makes bad calls all the 
> >> time:
> >>
> >>   https://www.openwall.com/lists/musl/2013/01/15/26
> >>
> >> I still haven't got an #ifdef __MUSL__ in portability because Rich insists 
> >> his
> >> libc is the only perfect piece of software ever written, but I can probe 
> >> for it
> > 
> > You can #ifdef REG_STARTEND and use it conditionally depending on
> > whether the functionality is offered. There is no reason to hard-code
> > an assumption that musl doesn't or does have this functionality;
> 
> I did (https://github.com/landley/toybox/commit/48162c4ee3fb) but it's still a
> musl-specific workaround. Glibc, uClibc, and bionic all adopted this about 15
> years ago, and it came from freebsd in the first place which presumably covers
> macosx (I can't currently check, but somebody would poke me before too long).

Regardless of your opinion on whether or why musl does or doesn't have
this extension, testing for it with #ifdef REG_STARTEND rather than
"testing for musl" is the right approach, because if/when it's added
to musl, you automatically get support for it. If you hard-coded an
assumption that musl doesn't have it, then users are stuck with that
even if it changes.

> Musl is the only thing left that _doesn't_ have this functonality, and that's
> because
  ~~~

I don't think it's particularly "because" of this. Rather, the pros
and cons of adding this extension have been under discussion for a
while, with nobody pushing for further action on the topic.

> you dismissed NUL matching as not a "meaningful use of sed", and called
> it "hideous hacks" and "nobody will notice the difference with it missing". (I
> implemented a regexec0() wrapper to provide NUL matching 3 years before this
> performance issue came up.)

Use on non-text data is outside the scope of the standard sed utility,
and I'm a bit hostile to insistence that the standard utilities be
useful with non-text because of what a disaster Austin Group issue 663
turned into. However this is really not a consideration in whether
REG_STARTEND is included going forward.

> > it's
> > been proposed and there's even a patch somewhere. (It's not costly to
> > support relative to the current bad regex implementation, but the
> > concern is that it might impose a nontrivial cost on a future good
> > implementation once we're locked into having it.)
> 
> We have different project philosophies. I wait for actual users to show up
> before implementing a lot of things, but when they do I don't tend to think I
> know better than they do about what their use cases should be.

If I accepted arbitrary requests for stuff in musl, it would just be a
clone of glibc. As a community we have developed pretty clear and
consistent criteria for what does or doesn't get included. I'm not
aware of any indication that REG_STARTEND shouldn't be included except
possible future runtime cost; it doesn't have any major complexity
cost or incompatibility between different historical implementations
AFAIK.

> $ echo -e 'hello\0blah' | ./sed 's/blah/boing/' | hd
>   68 65 6c 6c 6f 00 62 6f  69 6e 67 0a  |hello.boing.|
> 000c
> 
> The freebsd man page from last time
> (https://www.freebsd.org/cgi/man.cgi?query=regex=3=freebsd-release-ports)
> says that REG_PEND is explicitly for specifying the length of a string
> containing NULs (without moving the starting point like REG_STARTEND can), so
> the plumbing in between already has to support embedded NULs. The man page 
> could
> be clearer, hopefully Michael Kerrisk's will be.
> 
> > whether it treats the start/end as *additional* constraints,
> 
> Not in the systems I've been able to test. (I need to get my freebsd test 
> image
> set back up on the new machine...)
> 
> (And hey, congrats on musl not doing the strlen() before matching without 
> this.
> You don't have the N^2 performance issue glibc and bionic did, and thus don't
> need that reason this patch went in. But I didn't keep the previous "implement
> match after NULL" code that worked without it just for musl because I don't 
> care
> enough about working around musl's unique and intentional limitations anymore.
> It can stay broken for all I care. I'm not quite to the point of removing
> existing fixes like the sched_get_priority() syscall wrapping and the nommu
> config symbol, but only because bionic doesn't support nommu yet, and I'm not
> _quite_ 

Re: [Toybox] pathological case in sed s///g

2019-05-06 Thread Rob Landley
On 5/6/19 12:48 PM, Rich Felker wrote:
> On Mon, May 06, 2019 at 12:42:44PM -0500, Rob Landley wrote:
>> Huh... I'll assume REG_STARTEND works in bionic since you're pointing me at 
>> it.
>> It's not in the regex man page but it _is_ in the glibc headers...
>>
>>   https://github.com/bminor/glibc/commit/6fefb4e0b16
>>
>> Looks like it went into glibc in 2004, which is way past 7 years. I should 
>> poke
>> Michael Kerrisk to update the man page.
>>
>> And musl explicitly refused to do it, but Rich makes bad calls all the time:
>>
>>   https://www.openwall.com/lists/musl/2013/01/15/26
>>
>> I still haven't got an #ifdef __MUSL__ in portability because Rich insists 
>> his
>> libc is the only perfect piece of software ever written, but I can probe for 
>> it
> 
> You can #ifdef REG_STARTEND and use it conditionally depending on
> whether the functionality is offered. There is no reason to hard-code
> an assumption that musl doesn't or does have this functionality;

I did (https://github.com/landley/toybox/commit/48162c4ee3fb) but it's still a
musl-specific workaround. Glibc, uClibc, and bionic all adopted this about 15
years ago, and it came from freebsd in the first place which presumably covers
macosx (I can't currently check, but somebody would poke me before too long).

Musl is the only thing left that _doesn't_ have this functonality, and that's
because you dismissed NUL matching as not a "meaningful use of sed", and called
it "hideous hacks" and "nobody will notice the difference with it missing". (I
implemented a regexec0() wrapper to provide NUL matching 3 years before this
performance issue came up.)

> it's
> been proposed and there's even a patch somewhere. (It's not costly to
> support relative to the current bad regex implementation, but the
> concern is that it might impose a nontrivial cost on a future good
> implementation once we're locked into having it.)

We have different project philosophies. I wait for actual users to show up
before implementing a lot of things, but when they do I don't tend to think I
know better than they do about what their use cases should be.

I'll push back a bit or negotiate to make sure they're serious and this is
important to them and that I understand what they actually need. Sometimes I ask
a second user to speak up to make sure the first isn't an outlier. and I'll
often offer alternatives before doing what they asked to see if an easier thing
for me to do will meet their needs (and then wait for somebody else to show up
needing the full thing)...

But I implement what the users need my software to _do_ because that's what this
software is _for_. I do NOT tell my users that they're wrong because I disagree
with what they're trying to run. I grumble about it a lot while doing it, but I
don't pretend I know what my users should be doing better than they do. My
bright line boundary is "this command is out of scope, it should not be in
toybox at all, use a different implementation". Because if they say "this
command you have needs to do this" and mine can't do it, using a different
implementation is what they'll do anyway. When it comes to the privilege of
being the first command in the $PATH with that name, there can be only one.

(There is no _way_ I wouldn't have ripped out tar --to-command during the
cleanup otherwise. But it's part of somebody's use case, existing scripts, oh
well. Lots of extra work to get it to work with nommu, but I got there
eventually...)

I've told you a bunch of times that you're wrong about nommu support. The
founder of uclinux has told you a bunch of times you're wrong about nommu
support. But you insist that you know better than your users, don't even offer a
--nommu-probeable build time option, and that everyone else must rewrite their
software to be worthy to use your library.

I cc'd you on this issue because it seemed impolite to talk about it behind your
back, but I stopped trying to argue with you about this sort of thing after
https://github.com/landley/toybox/commit/833fb23fe8b4 because there's no point.

>> If this _does_ match NUL bytes it lets me remove regexec0() entirely from 
>> libc,
>> which would be very nice. And since it started life as a BSD extension 
>> (they're
>> the only man page _documenting_ it) I get support on FreeBSD and probably 
>> MacOS too.
> 
> I'm not sure if the proposed patch supports matching NUL or not (i.e.

It's not proposed, it's committed, and it worked when I tested it with glibc in
devuan's toolchain:

$ echo -e "hello\0there" | toybox sed 's/there/blah/'
helloblah

And with bionic in the Android NDK:

$ echo -e 'hello\0blah' | ./sed 's/blah/boing/' | hd
  68 65 6c 6c 6f 00 62 6f  69 6e 67 0a  |hello.boing.|
000c

The freebsd man page from last time
(https://www.freebsd.org/cgi/man.cgi?query=regex=3=freebsd-release-ports)
says that REG_PEND is explicitly for specifying the length of a string
containing NULs (without moving the starting point like REG_STARTEND 

Re: [Toybox] [PATCH] sed.test: remove accidental newlines.

2019-05-06 Thread Rob Landley
On 5/6/19 10:40 AM, enh via Toybox wrote:
> ---
>  tests/sed.test | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tests/sed.test b/tests/sed.test
> index 6b27fff8..04603a21 100755
> --- a/tests/sed.test
> +++ b/tests/sed.test
> @@ -169,11 +169,11 @@ testing '-z' 'sed -z "s/\n/-/g"' "a-b-c" "" "a\nb\nc"
> 
>  # toybox handling of empty capturing groups broke minjail. Check that we
>  # correctly replace an empty capturing group with the empty string:
> -testing '\n with empty capture' \
> +testing '\N with empty capture' \
>  'sed -E "s/(ARM_)?(NR_)([a-z]*) (.*)/\1\2\3/"' "NR_read" "" "NR_read foo"
>  # ...but also that we report an error for a backreference to a group that
>  # isn't in the pattern:
> -testing '\n too high' \
> +testing '\N too high' \
>  'sed -E "s/(.*)/\2/p" 2>/dev/null || echo OK' "OK\n" "" "foo"
> 
>  # -i with $ last line test

Um, I'm not seeing these as newlines? Hmmm... scripts/runtest.sh function
testing()...

  NAME="$CMDNAME $1"
  echo "$SHOWPASS: $NAME"

So your echo is defaulting to -e (and debian's isn't), which is why you needed
-E. Except we _just_ added -E to toybox and I'm not quite comfortable trusting
it to be there on the host yet...

Ah, printf is in aosp/prebuilts/build-tools/linux-x86/bin/toybox (which took 14
hours to download through my cable modem, but is now on my laptop). Does
switching to that fix it?

Rob
___
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net


Re: [Toybox] pathological case in sed s///g

2019-05-06 Thread Rich Felker
On Mon, May 06, 2019 at 12:42:44PM -0500, Rob Landley wrote:
> Huh... I'll assume REG_STARTEND works in bionic since you're pointing me at 
> it.
> It's not in the regex man page but it _is_ in the glibc headers...
> 
>   https://github.com/bminor/glibc/commit/6fefb4e0b16
> 
> Looks like it went into glibc in 2004, which is way past 7 years. I should 
> poke
> Michael Kerrisk to update the man page.
> 
> And musl explicitly refused to do it, but Rich makes bad calls all the time:
> 
>   https://www.openwall.com/lists/musl/2013/01/15/26
> 
> I still haven't got an #ifdef __MUSL__ in portability because Rich insists his
> libc is the only perfect piece of software ever written, but I can probe for 
> it

You can #ifdef REG_STARTEND and use it conditionally depending on
whether the functionality is offered. There is no reason to hard-code
an assumption that musl doesn't or does have this functionality; it's
been proposed and there's even a patch somewhere. (It's not costly to
support relative to the current bad regex implementation, but the
concern is that it might impose a nontrivial cost on a future good
implementation once we're locked into having it.)

> If this _does_ match NUL bytes it lets me remove regexec0() entirely from 
> libc,
> which would be very nice. And since it started life as a BSD extension 
> (they're
> the only man page _documenting_ it) I get support on FreeBSD and probably 
> MacOS too.

I'm not sure if the proposed patch supports matching NUL or not (i.e.
whether it treats the start/end as *additional* constraints, to work
with a substring of a necessarily null-terminated string, or whether
they replace the null-termination criterion. If we do adopt it we
should ensure we do whatever other implementations do here, especially
if that's important to use cases you or others want.

Rich
___
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net


Re: [Toybox] pathological case in sed s///g

2019-05-06 Thread Rob Landley
Huh... I'll assume REG_STARTEND works in bionic since you're pointing me at it.
It's not in the regex man page but it _is_ in the glibc headers...

  https://github.com/bminor/glibc/commit/6fefb4e0b16

Looks like it went into glibc in 2004, which is way past 7 years. I should poke
Michael Kerrisk to update the man page.

And musl explicitly refused to do it, but Rich makes bad calls all the time:

  https://www.openwall.com/lists/musl/2013/01/15/26

I still haven't got an #ifdef __MUSL__ in portability because Rich insists his
libc is the only perfect piece of software ever written, but I can probe for it
at compile time and #define 0 to stub out. (Unlike needing a manual config
symbol to work around his nommu insanity where he went out of his WAY to break
compile time probes, or needing to wrap syscalls myself in
https://github.com/landley/toybox/commit/833fb23fe8b4  because he decided he
didn't like a set of syscalls and _removed_ support musl had working fine at one
point...)

If this _does_ match NUL bytes it lets me remove regexec0() entirely from libc,
which would be very nice. And since it started life as a BSD extension (they're
the only man page _documenting_ it) I get support on FreeBSD and probably MacOS 
too.

Cool, thanks,

Rob

On 5/6/19 10:56 AM, enh wrote:
> i think you might be able to use REG_STARTEND to avoid the implicit
> strlen: 
> https://www.freebsd.org/cgi/man.cgi?query=regex=3=freebsd-release-ports
> 
> REG_STARTEND
>The string is considered to start at string +
>pmatch[0].rm_so and to end before the byte located at
>string + pmatch[0].rm_eo, regardless of the value of
>nmatch.  See below for the definition of pmatch and nmatch.
>This is an extension, compatible with but not specified by
>IEEE Std 1003.2 (``POSIX.2''), and should be used with cau-
>tion in software intended to be portable to other systems.
>
>Without REG_NOTBOL, the position rm_so is considered the
>beginning of a line, such that `^' matches before it, and
>the beginning of a word if there is a word character at
>this position, such that `[[:<:]]' and `\<' match before
>it.
> 
>With REG_NOTBOL, the character at position rm_so is treated
>as the continuation of a line, and if rm_so is greater than
>0, the preceding character is taken into consideration.  If
>the preceding character is a newline and the regular
>expression was compiled with REG_NEWLINE, `^' matches
>before the string; if the preceding character is not a word
>character but the string starts with a word character,
>`[[:<:]]' and `\<' match before the string.
> 
> From: Rob Landley 
> Date: Sun, May 5, 2019 at 9:41 AM
> To: enh
> Cc: toybox
> 
>> On 5/3/19 2:42 PM, enh wrote:
>>> On Fri, May 3, 2019 at 11:59 AM Rob Landley  wrote:

 On 5/3/19 1:56 PM, Rob Landley wrote:
> On 5/3/19 1:05 PM, enh wrote:
> But yeah, the new pessimal case after the change I'm making now would be a
> megabyte of xyxyxyxy with 's/xy/x/g' _THAT_ would need the one output 
> buffer
> thing...

 And it can be avoided by an in-place copy that remembers "last gap" and 
 doesn't
 copy trailing data until the _next_ hit (or we confirm there's no hit) so 
 we
 know how much of the skipped data is "confirmed" and only copy it once we 
 know
 we're keeping it.
>>>
>>> yeah, that's what i meant by only ensuring that the first n characters
>>> are right.
>>>
 So every kept byte is only copied once, and it can still be done in place 
 for
 the shrinking case. (You'll have to realloc when you expand, but that's 
 probably
 unavoidable...)
>>
>> Which I just did (not even trying in-place, everything gets copied once to a 
>> new
>> string), and it's still way too slow.
>>
>> Time to dig out https://jvns.ca/perf-zine.pdf and see where it's spending its
>> time... Sigh. I forgot quite how much I hate perf, but:
>>
>>   68.60%  sed  sed[.] regexec0
>>   29.93%  sed  [kernel.kallsyms]  [k] nmi
>>1.47%  sed  libc-2.24.so   [.] strlen
>>0.00%  sed  [kernel.kallsyms]  [k] __acct_update_integrals
>>
>> It's regexec0(). H... ah, my NUL wrapper is essentially calling strlen() 
>> on
>> the string before doing the regex each time (to figure out where the null to
>> skip to next is), that should go _after_ it tries to match this segment...
>>
>> Ok, and now that I've fixed that it's _still_ taking forever and strlen() is
>> dominating, but regexec0() is basically calling regexec() as the first thing 
>> and
>> returning immediately when it hits? The only thing that could be calling
>> strlen() is libc's regexec()? Hmmm...
>>
>> @@ -473,11 +473,15 @@ static void sed_line(char **pline, long plen)
>>  mlen, off, newlen;
>>
>>// Loop finding match in remaining line (up to remaining len)
>> -  while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) {
>> +//  while 

Re: [Toybox] pathological case in sed s///g

2019-05-06 Thread enh via Toybox
i think you might be able to use REG_STARTEND to avoid the implicit
strlen: 
https://www.freebsd.org/cgi/man.cgi?query=regex=3=freebsd-release-ports

REG_STARTEND
   The string is considered to start at string +
   pmatch[0].rm_so and to end before the byte located at
   string + pmatch[0].rm_eo, regardless of the value of
   nmatch.  See below for the definition of pmatch and nmatch.
   This is an extension, compatible with but not specified by
   IEEE Std 1003.2 (``POSIX.2''), and should be used with cau-
   tion in software intended to be portable to other systems.

   Without REG_NOTBOL, the position rm_so is considered the
   beginning of a line, such that `^' matches before it, and
   the beginning of a word if there is a word character at
   this position, such that `[[:<:]]' and `\<' match before
   it.

   With REG_NOTBOL, the character at position rm_so is treated
   as the continuation of a line, and if rm_so is greater than
   0, the preceding character is taken into consideration.  If
   the preceding character is a newline and the regular
   expression was compiled with REG_NEWLINE, `^' matches
   before the string; if the preceding character is not a word
   character but the string starts with a word character,
   `[[:<:]]' and `\<' match before the string.

From: Rob Landley 
Date: Sun, May 5, 2019 at 9:41 AM
To: enh
Cc: toybox

> On 5/3/19 2:42 PM, enh wrote:
> > On Fri, May 3, 2019 at 11:59 AM Rob Landley  wrote:
> >>
> >> On 5/3/19 1:56 PM, Rob Landley wrote:
> >>> On 5/3/19 1:05 PM, enh wrote:
> >>> But yeah, the new pessimal case after the change I'm making now would be a
> >>> megabyte of xyxyxyxy with 's/xy/x/g' _THAT_ would need the one output 
> >>> buffer
> >>> thing...
> >>
> >> And it can be avoided by an in-place copy that remembers "last gap" and 
> >> doesn't
> >> copy trailing data until the _next_ hit (or we confirm there's no hit) so 
> >> we
> >> know how much of the skipped data is "confirmed" and only copy it once we 
> >> know
> >> we're keeping it.
> >
> > yeah, that's what i meant by only ensuring that the first n characters
> > are right.
> >
> >> So every kept byte is only copied once, and it can still be done in place 
> >> for
> >> the shrinking case. (You'll have to realloc when you expand, but that's 
> >> probably
> >> unavoidable...)
>
> Which I just did (not even trying in-place, everything gets copied once to a 
> new
> string), and it's still way too slow.
>
> Time to dig out https://jvns.ca/perf-zine.pdf and see where it's spending its
> time... Sigh. I forgot quite how much I hate perf, but:
>
>   68.60%  sed  sed[.] regexec0
>   29.93%  sed  [kernel.kallsyms]  [k] nmi
>1.47%  sed  libc-2.24.so   [.] strlen
>0.00%  sed  [kernel.kallsyms]  [k] __acct_update_integrals
>
> It's regexec0(). H... ah, my NUL wrapper is essentially calling strlen() 
> on
> the string before doing the regex each time (to figure out where the null to
> skip to next is), that should go _after_ it tries to match this segment...
>
> Ok, and now that I've fixed that it's _still_ taking forever and strlen() is
> dominating, but regexec0() is basically calling regexec() as the first thing 
> and
> returning immediately when it hits? The only thing that could be calling
> strlen() is libc's regexec()? Hmmm...
>
> @@ -473,11 +473,15 @@ static void sed_line(char **pline, long plen)
>  mlen, off, newlen;
>
>// Loop finding match in remaining line (up to remaining len)
> -  while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) {
> +//  while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) {
> +while (!strncmp(rline, "x", 1)) {
> +  match[0].rm_eo = 1;
> +  match[1].rm_so = 0;
> +
>
> And _that_ completes a megabyte input line basically instantly. It's regexec()
> calling strlen() each time pulling (n^2)/2 bytes (half a terabyte?) through 
> the
> l1 cache that's taking up all the time now.
>
> Grrr. So I either have to write my own regex plumbing (which I can do, I've 
> done
> it before years ago, but I really didn't _want_ to here because LIBC HAS ONE),
> or I need to make my wrapper special case literal matches, which means they
> don't have... let's see... (For the honor of greyskull "man 7
> regex"...)
>
>   ^|.[*+?(\{$
>
> Sigh. If I _was_ going to write my own regex plumbing I could make it take a
> length and properly match NUL bytes. On the one hand that's really a post-1.0
> todo item. On the other hand we're talking maybe 200 lines of code here, _and_
> it could probably speed up grep (I could go back to searching for multiple
> regexes in parallel... What _is_ involved in implementing lex, anyway...)
>
> Rob
___
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net


[Toybox] [PATCH] sed.test: remove accidental newlines.

2019-05-06 Thread enh via Toybox
---
 tests/sed.test | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tests/sed.test b/tests/sed.test
index 6b27fff8..04603a21 100755
--- a/tests/sed.test
+++ b/tests/sed.test
@@ -169,11 +169,11 @@ testing '-z' 'sed -z "s/\n/-/g"' "a-b-c" "" "a\nb\nc"

 # toybox handling of empty capturing groups broke minjail. Check that we
 # correctly replace an empty capturing group with the empty string:
-testing '\n with empty capture' \
+testing '\N with empty capture' \
 'sed -E "s/(ARM_)?(NR_)([a-z]*) (.*)/\1\2\3/"' "NR_read" "" "NR_read foo"
 # ...but also that we report an error for a backreference to a group that
 # isn't in the pattern:
-testing '\n too high' \
+testing '\N too high' \
 'sed -E "s/(.*)/\2/p" 2>/dev/null || echo OK' "OK\n" "" "foo"

 # -i with $ last line test
-- 
2.21.0.1020.gf2820cf01a-goog
From d9cd2d40136c4e082e534d9e92771d574ec1652f Mon Sep 17 00:00:00 2001
From: Elliott Hughes 
Date: Mon, 6 May 2019 08:39:20 -0700
Subject: [PATCH] sed.test: remove accidental newlines.

---
 tests/sed.test | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tests/sed.test b/tests/sed.test
index 6b27fff8..04603a21 100755
--- a/tests/sed.test
+++ b/tests/sed.test
@@ -169,11 +169,11 @@ testing '-z' 'sed -z "s/\n/-/g"' "a-b-c" "" "a\nb\nc"
 
 # toybox handling of empty capturing groups broke minjail. Check that we
 # correctly replace an empty capturing group with the empty string:
-testing '\n with empty capture' \
+testing '\N with empty capture' \
 'sed -E "s/(ARM_)?(NR_)([a-z]*) (.*)/\1\2\3/"' "NR_read" "" "NR_read foo"
 # ...but also that we report an error for a backreference to a group that
 # isn't in the pattern:
-testing '\n too high' \
+testing '\N too high' \
 'sed -E "s/(.*)/\2/p" 2>/dev/null || echo OK' "OK\n" "" "foo"
 
 # -i with $ last line test
-- 
2.21.0.1020.gf2820cf01a-goog

___
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net