Re: [Toybox] pathological case in sed s///g
I emailed Michael Kerrisk to update the man page to mention it... Rob On 5/8/19 7:52 PM, enh wrote: > (i checked and macOS does have both REG_STARTEND and REG_PEND.) > > From: Rob Landley > Date: Mon, May 6, 2019 at 10:42 AM > To: enh > Cc: toybox, Rich Felker > >> Huh... I'll assume REG_STARTEND works in bionic since you're pointing me at >> it. >> It's not in the regex man page but it _is_ in the glibc headers... >> >> https://github.com/bminor/glibc/commit/6fefb4e0b16 >> >> Looks like it went into glibc in 2004, which is way past 7 years. I should >> poke >> Michael Kerrisk to update the man page. >> >> And musl explicitly refused to do it, but Rich makes bad calls all the time: >> >> https://www.openwall.com/lists/musl/2013/01/15/26 >> >> I still haven't got an #ifdef __MUSL__ in portability because Rich insists >> his >> libc is the only perfect piece of software ever written, but I can probe for >> it >> at compile time and #define 0 to stub out. (Unlike needing a manual config >> symbol to work around his nommu insanity where he went out of his WAY to >> break >> compile time probes, or needing to wrap syscalls myself in >> https://github.com/landley/toybox/commit/833fb23fe8b4 because he decided he >> didn't like a set of syscalls and _removed_ support musl had working fine at >> one >> point...) >> >> If this _does_ match NUL bytes it lets me remove regexec0() entirely from >> libc, >> which would be very nice. And since it started life as a BSD extension >> (they're >> the only man page _documenting_ it) I get support on FreeBSD and probably >> MacOS too. >> >> Cool, thanks, >> >> Rob >> >> On 5/6/19 10:56 AM, enh wrote: >>> i think you might be able to use REG_STARTEND to avoid the implicit >>> strlen: >>> https://www.freebsd.org/cgi/man.cgi?query=regex=3=freebsd-release-ports >>> >>> REG_STARTEND >>>The string is considered to start at string + >>>pmatch[0].rm_so and to end before the byte located at >>>string + pmatch[0].rm_eo, regardless of the value of >>>nmatch. See below for the definition of pmatch and nmatch. >>>This is an extension, compatible with but not specified by >>>IEEE Std 1003.2 (``POSIX.2''), and should be used with cau- >>>tion in software intended to be portable to other systems. >>> >>>Without REG_NOTBOL, the position rm_so is considered the >>>beginning of a line, such that `^' matches before it, and >>>the beginning of a word if there is a word character at >>>this position, such that `[[:<:]]' and `\<' match before >>>it. >>> >>>With REG_NOTBOL, the character at position rm_so is treated >>>as the continuation of a line, and if rm_so is greater than >>>0, the preceding character is taken into consideration. If >>>the preceding character is a newline and the regular >>>expression was compiled with REG_NEWLINE, `^' matches >>>before the string; if the preceding character is not a word >>>character but the string starts with a word character, >>>`[[:<:]]' and `\<' match before the string. >>> >>> From: Rob Landley >>> Date: Sun, May 5, 2019 at 9:41 AM >>> To: enh >>> Cc: toybox >>> On 5/3/19 2:42 PM, enh wrote: > On Fri, May 3, 2019 at 11:59 AM Rob Landley wrote: >> >> On 5/3/19 1:56 PM, Rob Landley wrote: >>> On 5/3/19 1:05 PM, enh wrote: >>> But yeah, the new pessimal case after the change I'm making now would >>> be a >>> megabyte of xyxyxyxy with 's/xy/x/g' _THAT_ would need the one output >>> buffer >>> thing... >> >> And it can be avoided by an in-place copy that remembers "last gap" and >> doesn't >> copy trailing data until the _next_ hit (or we confirm there's no hit) >> so we >> know how much of the skipped data is "confirmed" and only copy it once >> we know >> we're keeping it. > > yeah, that's what i meant by only ensuring that the first n characters > are right. > >> So every kept byte is only copied once, and it can still be done in >> place for >> the shrinking case. (You'll have to realloc when you expand, but that's >> probably >> unavoidable...) Which I just did (not even trying in-place, everything gets copied once to a new string), and it's still way too slow. Time to dig out https://jvns.ca/perf-zine.pdf and see where it's spending its time... Sigh. I forgot quite how much I hate perf, but: 68.60% sed sed[.] regexec0 29.93% sed [kernel.kallsyms] [k] nmi 1.47% sed libc-2.24.so [.] strlen 0.00% sed [kernel.kallsyms] [k] __acct_update_integrals It's regexec0(). H... ah, my NUL wrapper is essentially calling strlen() on the string before doing the regex each time (to figure out where the null to skip to next is), that should go _after_ it tries to match this segment... Ok,
Re: [Toybox] pathological case in sed s///g
(i checked and macOS does have both REG_STARTEND and REG_PEND.) From: Rob Landley Date: Mon, May 6, 2019 at 10:42 AM To: enh Cc: toybox, Rich Felker > Huh... I'll assume REG_STARTEND works in bionic since you're pointing me at > it. > It's not in the regex man page but it _is_ in the glibc headers... > > https://github.com/bminor/glibc/commit/6fefb4e0b16 > > Looks like it went into glibc in 2004, which is way past 7 years. I should > poke > Michael Kerrisk to update the man page. > > And musl explicitly refused to do it, but Rich makes bad calls all the time: > > https://www.openwall.com/lists/musl/2013/01/15/26 > > I still haven't got an #ifdef __MUSL__ in portability because Rich insists his > libc is the only perfect piece of software ever written, but I can probe for > it > at compile time and #define 0 to stub out. (Unlike needing a manual config > symbol to work around his nommu insanity where he went out of his WAY to break > compile time probes, or needing to wrap syscalls myself in > https://github.com/landley/toybox/commit/833fb23fe8b4 because he decided he > didn't like a set of syscalls and _removed_ support musl had working fine at > one > point...) > > If this _does_ match NUL bytes it lets me remove regexec0() entirely from > libc, > which would be very nice. And since it started life as a BSD extension > (they're > the only man page _documenting_ it) I get support on FreeBSD and probably > MacOS too. > > Cool, thanks, > > Rob > > On 5/6/19 10:56 AM, enh wrote: > > i think you might be able to use REG_STARTEND to avoid the implicit > > strlen: > > https://www.freebsd.org/cgi/man.cgi?query=regex=3=freebsd-release-ports > > > > REG_STARTEND > >The string is considered to start at string + > >pmatch[0].rm_so and to end before the byte located at > >string + pmatch[0].rm_eo, regardless of the value of > >nmatch. See below for the definition of pmatch and nmatch. > >This is an extension, compatible with but not specified by > >IEEE Std 1003.2 (``POSIX.2''), and should be used with cau- > >tion in software intended to be portable to other systems. > > > >Without REG_NOTBOL, the position rm_so is considered the > >beginning of a line, such that `^' matches before it, and > >the beginning of a word if there is a word character at > >this position, such that `[[:<:]]' and `\<' match before > >it. > > > >With REG_NOTBOL, the character at position rm_so is treated > >as the continuation of a line, and if rm_so is greater than > >0, the preceding character is taken into consideration. If > >the preceding character is a newline and the regular > >expression was compiled with REG_NEWLINE, `^' matches > >before the string; if the preceding character is not a word > >character but the string starts with a word character, > >`[[:<:]]' and `\<' match before the string. > > > > From: Rob Landley > > Date: Sun, May 5, 2019 at 9:41 AM > > To: enh > > Cc: toybox > > > >> On 5/3/19 2:42 PM, enh wrote: > >>> On Fri, May 3, 2019 at 11:59 AM Rob Landley wrote: > > On 5/3/19 1:56 PM, Rob Landley wrote: > > On 5/3/19 1:05 PM, enh wrote: > > But yeah, the new pessimal case after the change I'm making now would > > be a > > megabyte of xyxyxyxy with 's/xy/x/g' _THAT_ would need the one output > > buffer > > thing... > > And it can be avoided by an in-place copy that remembers "last gap" and > doesn't > copy trailing data until the _next_ hit (or we confirm there's no hit) > so we > know how much of the skipped data is "confirmed" and only copy it once > we know > we're keeping it. > >>> > >>> yeah, that's what i meant by only ensuring that the first n characters > >>> are right. > >>> > So every kept byte is only copied once, and it can still be done in > place for > the shrinking case. (You'll have to realloc when you expand, but that's > probably > unavoidable...) > >> > >> Which I just did (not even trying in-place, everything gets copied once to > >> a new > >> string), and it's still way too slow. > >> > >> Time to dig out https://jvns.ca/perf-zine.pdf and see where it's spending > >> its > >> time... Sigh. I forgot quite how much I hate perf, but: > >> > >> 68.60% sed sed[.] regexec0 > >> 29.93% sed [kernel.kallsyms] [k] nmi > >>1.47% sed libc-2.24.so [.] strlen > >>0.00% sed [kernel.kallsyms] [k] __acct_update_integrals > >> > >> It's regexec0(). H... ah, my NUL wrapper is essentially calling > >> strlen() on > >> the string before doing the regex each time (to figure out where the null > >> to > >> skip to next is), that should go _after_ it tries to match this segment... > >> > >> Ok, and now that I've fixed that it's _still_ taking forever and strlen() > >> is > >> dominating, but regexec0() is basically calling regexec() as the first > >>
Re: [Toybox] pathological case in sed s///g
On Mon, May 06, 2019 at 10:57:23PM -0700, enh wrote: > > I know not everything is a macro that can be tested; both myself and > > folks from other implementations are interested in developing an > > agreed-upon way to report availability of other extensions via macros, > > so that configure-style compile/link tests are not needed. This would > > also facilitate reporting of some things that can't even properly be > > tested because they would need runtime tests. Unfortunately efforts in > > this area are stalled from lack of interest. If you or others would > > like to raise a fuss about making it happen, I'd welcome it. > > wouldn't it be better to have more stuff along the lines of clang's > __has_builtin/__has_feature/__has_include? they're already very useful > for getting rid of this kind of stuff, and a __has_function would > cover most of what's missing. It probably doesn't, and it's hard to figure out who would be responsible for providing them. Are you suggesting the compiler would do it automatically based on presence of a declaration? There's already an evil hack to check for presence of a declaration and branch on the result using _Generic, but of course that's after the preprocessor level. OTOH the preprocessor doesn't and can't have access to the presence or absence of declarations since everything is just tokens at that stage. I think it would be a lot simpler, more flexible, and compatible with existing compilers (probably a requirement on our side) to simply have the library define macros for the individual extensions or extension groups it provides -- basically the same thing as the existing POSIX macros in unistd.h, but for extensions outside of POSIX. One major advantage here is that we can also report extensions that are not in the form of symbols or flags, but some other type of interface -- for example, support for %m in the printf family. As a side benefit, this would probably require us to look at, enumerate, and better document the extensions we support. :-) Rich ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] pathological case in sed s///g
On Tue, May 07, 2019 at 05:27:24PM +0300, makep...@firemail.cc wrote: > > > wouldn't it be better to have more stuff along the lines of clang's > > __has_builtin/__has_feature/__has_include? they're already very useful > > for getting rid of this kind of stuff, and a __has_function would > > cover most of what's missing. > > Feature claims of clang and musl are giving us a conflict when > building cmake btw, could anyone look at > https://github.com/gentoo/musl/issues/141 please? I don't know > whether to forward this to clang or musl or cmake or some other > list, thus glad you brought up the topic. Just replied on the tracker: "musl uses a BSD style header layout, where the system (libc) header location must come before the compiler's header location in the search path; it does not use and is not compatible with the compiler-provided versions of the standard headers, which should never get included." Thanks for the heads-up. Rich ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] pathological case in sed s///g
> wouldn't it be better to have more stuff along the lines of clang's > __has_builtin/__has_feature/__has_include? they're already very useful > for getting rid of this kind of stuff, and a __has_function would > cover most of what's missing. Feature claims of clang and musl are giving us a conflict when building cmake btw, could anyone look at https://github.com/gentoo/musl/issues/141 please? I don't know whether to forward this to clang or musl or cmake or some other list, thus glad you brought up the topic. ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] pathological case in sed s///g
From: Rich Felker Date: Mon, May 6, 2019 at 7:40 PM To: Rob Landley Cc: toybox > On Mon, May 06, 2019 at 07:05:46PM -0500, Rob Landley wrote: > > On 5/6/19 12:48 PM, Rich Felker wrote: > > > On Mon, May 06, 2019 at 12:42:44PM -0500, Rob Landley wrote: > > >> Huh... I'll assume REG_STARTEND works in bionic since you're pointing me > > >> at it. > > >> It's not in the regex man page but it _is_ in the glibc headers... > > >> > > >> https://github.com/bminor/glibc/commit/6fefb4e0b16 > > >> > > >> Looks like it went into glibc in 2004, which is way past 7 years. I > > >> should poke > > >> Michael Kerrisk to update the man page. > > >> > > >> And musl explicitly refused to do it, but Rich makes bad calls all the > > >> time: > > >> > > >> https://www.openwall.com/lists/musl/2013/01/15/26 > > >> > > >> I still haven't got an #ifdef __MUSL__ in portability because Rich > > >> insists his > > >> libc is the only perfect piece of software ever written, but I can probe > > >> for it > > > > > > You can #ifdef REG_STARTEND and use it conditionally depending on > > > whether the functionality is offered. There is no reason to hard-code > > > an assumption that musl doesn't or does have this functionality; > > > > I did (https://github.com/landley/toybox/commit/48162c4ee3fb) but it's > > still a > > musl-specific workaround. Glibc, uClibc, and bionic all adopted this about > > 15 > > years ago, and it came from freebsd in the first place which presumably > > covers > > macosx (I can't currently check, but somebody would poke me before too > > long). > > Regardless of your opinion on whether or why musl does or doesn't have > this extension, testing for it with #ifdef REG_STARTEND rather than > "testing for musl" is the right approach, because if/when it's added > to musl, you automatically get support for it. If you hard-coded an > assumption that musl doesn't have it, then users are stuck with that > even if it changes. > > > Musl is the only thing left that _doesn't_ have this functonality, and > > that's > > because > ~~~ > > I don't think it's particularly "because" of this. Rather, the pros > and cons of adding this extension have been under discussion for a > while, with nobody pushing for further action on the topic. > > > you dismissed NUL matching as not a "meaningful use of sed", and called > > it "hideous hacks" and "nobody will notice the difference with it missing". > > (I > > implemented a regexec0() wrapper to provide NUL matching 3 years before this > > performance issue came up.) > > Use on non-text data is outside the scope of the standard sed utility, > and I'm a bit hostile to insistence that the standard utilities be > useful with non-text because of what a disaster Austin Group issue 663 > turned into. However this is really not a consideration in whether > REG_STARTEND is included going forward. > > > > it's > > > been proposed and there's even a patch somewhere. (It's not costly to > > > support relative to the current bad regex implementation, but the > > > concern is that it might impose a nontrivial cost on a future good > > > implementation once we're locked into having it.) > > > > We have different project philosophies. I wait for actual users to show up > > before implementing a lot of things, but when they do I don't tend to think > > I > > know better than they do about what their use cases should be. > > If I accepted arbitrary requests for stuff in musl, it would just be a > clone of glibc. As a community we have developed pretty clear and > consistent criteria for what does or doesn't get included. I'm not > aware of any indication that REG_STARTEND shouldn't be included except > possible future runtime cost; it doesn't have any major complexity > cost or incompatibility between different historical implementations > AFAIK. > > > $ echo -e 'hello\0blah' | ./sed 's/blah/boing/' | hd > > 68 65 6c 6c 6f 00 62 6f 69 6e 67 0a |hello.boing.| > > 000c > > > > The freebsd man page from last time > > (https://www.freebsd.org/cgi/man.cgi?query=regex=3=freebsd-release-ports) > > says that REG_PEND is explicitly for specifying the length of a string > > containing NULs (without moving the starting point like REG_STARTEND can), > > so > > the plumbing in between already has to support embedded NULs. The man page > > could > > be clearer, hopefully Michael Kerrisk's will be. > > > > > whether it treats the start/end as *additional* constraints, > > > > Not in the systems I've been able to test. (I need to get my freebsd test > > image > > set back up on the new machine...) > > > > (And hey, congrats on musl not doing the strlen() before matching without > > this. > > You don't have the N^2 performance issue glibc and bionic did, and thus > > don't > > need that reason this patch went in. But I didn't keep the previous > > "implement > > match after NULL" code that worked without it just for musl because I don't > > care > > enough about
Re: [Toybox] pathological case in sed s///g
On Mon, May 06, 2019 at 07:05:46PM -0500, Rob Landley wrote: > On 5/6/19 12:48 PM, Rich Felker wrote: > > On Mon, May 06, 2019 at 12:42:44PM -0500, Rob Landley wrote: > >> Huh... I'll assume REG_STARTEND works in bionic since you're pointing me > >> at it. > >> It's not in the regex man page but it _is_ in the glibc headers... > >> > >> https://github.com/bminor/glibc/commit/6fefb4e0b16 > >> > >> Looks like it went into glibc in 2004, which is way past 7 years. I should > >> poke > >> Michael Kerrisk to update the man page. > >> > >> And musl explicitly refused to do it, but Rich makes bad calls all the > >> time: > >> > >> https://www.openwall.com/lists/musl/2013/01/15/26 > >> > >> I still haven't got an #ifdef __MUSL__ in portability because Rich insists > >> his > >> libc is the only perfect piece of software ever written, but I can probe > >> for it > > > > You can #ifdef REG_STARTEND and use it conditionally depending on > > whether the functionality is offered. There is no reason to hard-code > > an assumption that musl doesn't or does have this functionality; > > I did (https://github.com/landley/toybox/commit/48162c4ee3fb) but it's still a > musl-specific workaround. Glibc, uClibc, and bionic all adopted this about 15 > years ago, and it came from freebsd in the first place which presumably covers > macosx (I can't currently check, but somebody would poke me before too long). Regardless of your opinion on whether or why musl does or doesn't have this extension, testing for it with #ifdef REG_STARTEND rather than "testing for musl" is the right approach, because if/when it's added to musl, you automatically get support for it. If you hard-coded an assumption that musl doesn't have it, then users are stuck with that even if it changes. > Musl is the only thing left that _doesn't_ have this functonality, and that's > because ~~~ I don't think it's particularly "because" of this. Rather, the pros and cons of adding this extension have been under discussion for a while, with nobody pushing for further action on the topic. > you dismissed NUL matching as not a "meaningful use of sed", and called > it "hideous hacks" and "nobody will notice the difference with it missing". (I > implemented a regexec0() wrapper to provide NUL matching 3 years before this > performance issue came up.) Use on non-text data is outside the scope of the standard sed utility, and I'm a bit hostile to insistence that the standard utilities be useful with non-text because of what a disaster Austin Group issue 663 turned into. However this is really not a consideration in whether REG_STARTEND is included going forward. > > it's > > been proposed and there's even a patch somewhere. (It's not costly to > > support relative to the current bad regex implementation, but the > > concern is that it might impose a nontrivial cost on a future good > > implementation once we're locked into having it.) > > We have different project philosophies. I wait for actual users to show up > before implementing a lot of things, but when they do I don't tend to think I > know better than they do about what their use cases should be. If I accepted arbitrary requests for stuff in musl, it would just be a clone of glibc. As a community we have developed pretty clear and consistent criteria for what does or doesn't get included. I'm not aware of any indication that REG_STARTEND shouldn't be included except possible future runtime cost; it doesn't have any major complexity cost or incompatibility between different historical implementations AFAIK. > $ echo -e 'hello\0blah' | ./sed 's/blah/boing/' | hd > 68 65 6c 6c 6f 00 62 6f 69 6e 67 0a |hello.boing.| > 000c > > The freebsd man page from last time > (https://www.freebsd.org/cgi/man.cgi?query=regex=3=freebsd-release-ports) > says that REG_PEND is explicitly for specifying the length of a string > containing NULs (without moving the starting point like REG_STARTEND can), so > the plumbing in between already has to support embedded NULs. The man page > could > be clearer, hopefully Michael Kerrisk's will be. > > > whether it treats the start/end as *additional* constraints, > > Not in the systems I've been able to test. (I need to get my freebsd test > image > set back up on the new machine...) > > (And hey, congrats on musl not doing the strlen() before matching without > this. > You don't have the N^2 performance issue glibc and bionic did, and thus don't > need that reason this patch went in. But I didn't keep the previous "implement > match after NULL" code that worked without it just for musl because I don't > care > enough about working around musl's unique and intentional limitations anymore. > It can stay broken for all I care. I'm not quite to the point of removing > existing fixes like the sched_get_priority() syscall wrapping and the nommu > config symbol, but only because bionic doesn't support nommu yet, and I'm not > _quite_
Re: [Toybox] pathological case in sed s///g
On 5/6/19 12:48 PM, Rich Felker wrote: > On Mon, May 06, 2019 at 12:42:44PM -0500, Rob Landley wrote: >> Huh... I'll assume REG_STARTEND works in bionic since you're pointing me at >> it. >> It's not in the regex man page but it _is_ in the glibc headers... >> >> https://github.com/bminor/glibc/commit/6fefb4e0b16 >> >> Looks like it went into glibc in 2004, which is way past 7 years. I should >> poke >> Michael Kerrisk to update the man page. >> >> And musl explicitly refused to do it, but Rich makes bad calls all the time: >> >> https://www.openwall.com/lists/musl/2013/01/15/26 >> >> I still haven't got an #ifdef __MUSL__ in portability because Rich insists >> his >> libc is the only perfect piece of software ever written, but I can probe for >> it > > You can #ifdef REG_STARTEND and use it conditionally depending on > whether the functionality is offered. There is no reason to hard-code > an assumption that musl doesn't or does have this functionality; I did (https://github.com/landley/toybox/commit/48162c4ee3fb) but it's still a musl-specific workaround. Glibc, uClibc, and bionic all adopted this about 15 years ago, and it came from freebsd in the first place which presumably covers macosx (I can't currently check, but somebody would poke me before too long). Musl is the only thing left that _doesn't_ have this functonality, and that's because you dismissed NUL matching as not a "meaningful use of sed", and called it "hideous hacks" and "nobody will notice the difference with it missing". (I implemented a regexec0() wrapper to provide NUL matching 3 years before this performance issue came up.) > it's > been proposed and there's even a patch somewhere. (It's not costly to > support relative to the current bad regex implementation, but the > concern is that it might impose a nontrivial cost on a future good > implementation once we're locked into having it.) We have different project philosophies. I wait for actual users to show up before implementing a lot of things, but when they do I don't tend to think I know better than they do about what their use cases should be. I'll push back a bit or negotiate to make sure they're serious and this is important to them and that I understand what they actually need. Sometimes I ask a second user to speak up to make sure the first isn't an outlier. and I'll often offer alternatives before doing what they asked to see if an easier thing for me to do will meet their needs (and then wait for somebody else to show up needing the full thing)... But I implement what the users need my software to _do_ because that's what this software is _for_. I do NOT tell my users that they're wrong because I disagree with what they're trying to run. I grumble about it a lot while doing it, but I don't pretend I know what my users should be doing better than they do. My bright line boundary is "this command is out of scope, it should not be in toybox at all, use a different implementation". Because if they say "this command you have needs to do this" and mine can't do it, using a different implementation is what they'll do anyway. When it comes to the privilege of being the first command in the $PATH with that name, there can be only one. (There is no _way_ I wouldn't have ripped out tar --to-command during the cleanup otherwise. But it's part of somebody's use case, existing scripts, oh well. Lots of extra work to get it to work with nommu, but I got there eventually...) I've told you a bunch of times that you're wrong about nommu support. The founder of uclinux has told you a bunch of times you're wrong about nommu support. But you insist that you know better than your users, don't even offer a --nommu-probeable build time option, and that everyone else must rewrite their software to be worthy to use your library. I cc'd you on this issue because it seemed impolite to talk about it behind your back, but I stopped trying to argue with you about this sort of thing after https://github.com/landley/toybox/commit/833fb23fe8b4 because there's no point. >> If this _does_ match NUL bytes it lets me remove regexec0() entirely from >> libc, >> which would be very nice. And since it started life as a BSD extension >> (they're >> the only man page _documenting_ it) I get support on FreeBSD and probably >> MacOS too. > > I'm not sure if the proposed patch supports matching NUL or not (i.e. It's not proposed, it's committed, and it worked when I tested it with glibc in devuan's toolchain: $ echo -e "hello\0there" | toybox sed 's/there/blah/' helloblah And with bionic in the Android NDK: $ echo -e 'hello\0blah' | ./sed 's/blah/boing/' | hd 68 65 6c 6c 6f 00 62 6f 69 6e 67 0a |hello.boing.| 000c The freebsd man page from last time (https://www.freebsd.org/cgi/man.cgi?query=regex=3=freebsd-release-ports) says that REG_PEND is explicitly for specifying the length of a string containing NULs (without moving the starting point like REG_STARTEND
Re: [Toybox] pathological case in sed s///g
On Mon, May 06, 2019 at 12:42:44PM -0500, Rob Landley wrote: > Huh... I'll assume REG_STARTEND works in bionic since you're pointing me at > it. > It's not in the regex man page but it _is_ in the glibc headers... > > https://github.com/bminor/glibc/commit/6fefb4e0b16 > > Looks like it went into glibc in 2004, which is way past 7 years. I should > poke > Michael Kerrisk to update the man page. > > And musl explicitly refused to do it, but Rich makes bad calls all the time: > > https://www.openwall.com/lists/musl/2013/01/15/26 > > I still haven't got an #ifdef __MUSL__ in portability because Rich insists his > libc is the only perfect piece of software ever written, but I can probe for > it You can #ifdef REG_STARTEND and use it conditionally depending on whether the functionality is offered. There is no reason to hard-code an assumption that musl doesn't or does have this functionality; it's been proposed and there's even a patch somewhere. (It's not costly to support relative to the current bad regex implementation, but the concern is that it might impose a nontrivial cost on a future good implementation once we're locked into having it.) > If this _does_ match NUL bytes it lets me remove regexec0() entirely from > libc, > which would be very nice. And since it started life as a BSD extension > (they're > the only man page _documenting_ it) I get support on FreeBSD and probably > MacOS too. I'm not sure if the proposed patch supports matching NUL or not (i.e. whether it treats the start/end as *additional* constraints, to work with a substring of a necessarily null-terminated string, or whether they replace the null-termination criterion. If we do adopt it we should ensure we do whatever other implementations do here, especially if that's important to use cases you or others want. Rich ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] pathological case in sed s///g
Huh... I'll assume REG_STARTEND works in bionic since you're pointing me at it. It's not in the regex man page but it _is_ in the glibc headers... https://github.com/bminor/glibc/commit/6fefb4e0b16 Looks like it went into glibc in 2004, which is way past 7 years. I should poke Michael Kerrisk to update the man page. And musl explicitly refused to do it, but Rich makes bad calls all the time: https://www.openwall.com/lists/musl/2013/01/15/26 I still haven't got an #ifdef __MUSL__ in portability because Rich insists his libc is the only perfect piece of software ever written, but I can probe for it at compile time and #define 0 to stub out. (Unlike needing a manual config symbol to work around his nommu insanity where he went out of his WAY to break compile time probes, or needing to wrap syscalls myself in https://github.com/landley/toybox/commit/833fb23fe8b4 because he decided he didn't like a set of syscalls and _removed_ support musl had working fine at one point...) If this _does_ match NUL bytes it lets me remove regexec0() entirely from libc, which would be very nice. And since it started life as a BSD extension (they're the only man page _documenting_ it) I get support on FreeBSD and probably MacOS too. Cool, thanks, Rob On 5/6/19 10:56 AM, enh wrote: > i think you might be able to use REG_STARTEND to avoid the implicit > strlen: > https://www.freebsd.org/cgi/man.cgi?query=regex=3=freebsd-release-ports > > REG_STARTEND >The string is considered to start at string + >pmatch[0].rm_so and to end before the byte located at >string + pmatch[0].rm_eo, regardless of the value of >nmatch. See below for the definition of pmatch and nmatch. >This is an extension, compatible with but not specified by >IEEE Std 1003.2 (``POSIX.2''), and should be used with cau- >tion in software intended to be portable to other systems. > >Without REG_NOTBOL, the position rm_so is considered the >beginning of a line, such that `^' matches before it, and >the beginning of a word if there is a word character at >this position, such that `[[:<:]]' and `\<' match before >it. > >With REG_NOTBOL, the character at position rm_so is treated >as the continuation of a line, and if rm_so is greater than >0, the preceding character is taken into consideration. If >the preceding character is a newline and the regular >expression was compiled with REG_NEWLINE, `^' matches >before the string; if the preceding character is not a word >character but the string starts with a word character, >`[[:<:]]' and `\<' match before the string. > > From: Rob Landley > Date: Sun, May 5, 2019 at 9:41 AM > To: enh > Cc: toybox > >> On 5/3/19 2:42 PM, enh wrote: >>> On Fri, May 3, 2019 at 11:59 AM Rob Landley wrote: On 5/3/19 1:56 PM, Rob Landley wrote: > On 5/3/19 1:05 PM, enh wrote: > But yeah, the new pessimal case after the change I'm making now would be a > megabyte of xyxyxyxy with 's/xy/x/g' _THAT_ would need the one output > buffer > thing... And it can be avoided by an in-place copy that remembers "last gap" and doesn't copy trailing data until the _next_ hit (or we confirm there's no hit) so we know how much of the skipped data is "confirmed" and only copy it once we know we're keeping it. >>> >>> yeah, that's what i meant by only ensuring that the first n characters >>> are right. >>> So every kept byte is only copied once, and it can still be done in place for the shrinking case. (You'll have to realloc when you expand, but that's probably unavoidable...) >> >> Which I just did (not even trying in-place, everything gets copied once to a >> new >> string), and it's still way too slow. >> >> Time to dig out https://jvns.ca/perf-zine.pdf and see where it's spending its >> time... Sigh. I forgot quite how much I hate perf, but: >> >> 68.60% sed sed[.] regexec0 >> 29.93% sed [kernel.kallsyms] [k] nmi >>1.47% sed libc-2.24.so [.] strlen >>0.00% sed [kernel.kallsyms] [k] __acct_update_integrals >> >> It's regexec0(). H... ah, my NUL wrapper is essentially calling strlen() >> on >> the string before doing the regex each time (to figure out where the null to >> skip to next is), that should go _after_ it tries to match this segment... >> >> Ok, and now that I've fixed that it's _still_ taking forever and strlen() is >> dominating, but regexec0() is basically calling regexec() as the first thing >> and >> returning immediately when it hits? The only thing that could be calling >> strlen() is libc's regexec()? Hmmm... >> >> @@ -473,11 +473,15 @@ static void sed_line(char **pline, long plen) >> mlen, off, newlen; >> >>// Loop finding match in remaining line (up to remaining len) >> - while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) { >> +// while
Re: [Toybox] pathological case in sed s///g
i think you might be able to use REG_STARTEND to avoid the implicit strlen: https://www.freebsd.org/cgi/man.cgi?query=regex=3=freebsd-release-ports REG_STARTEND The string is considered to start at string + pmatch[0].rm_so and to end before the byte located at string + pmatch[0].rm_eo, regardless of the value of nmatch. See below for the definition of pmatch and nmatch. This is an extension, compatible with but not specified by IEEE Std 1003.2 (``POSIX.2''), and should be used with cau- tion in software intended to be portable to other systems. Without REG_NOTBOL, the position rm_so is considered the beginning of a line, such that `^' matches before it, and the beginning of a word if there is a word character at this position, such that `[[:<:]]' and `\<' match before it. With REG_NOTBOL, the character at position rm_so is treated as the continuation of a line, and if rm_so is greater than 0, the preceding character is taken into consideration. If the preceding character is a newline and the regular expression was compiled with REG_NEWLINE, `^' matches before the string; if the preceding character is not a word character but the string starts with a word character, `[[:<:]]' and `\<' match before the string. From: Rob Landley Date: Sun, May 5, 2019 at 9:41 AM To: enh Cc: toybox > On 5/3/19 2:42 PM, enh wrote: > > On Fri, May 3, 2019 at 11:59 AM Rob Landley wrote: > >> > >> On 5/3/19 1:56 PM, Rob Landley wrote: > >>> On 5/3/19 1:05 PM, enh wrote: > >>> But yeah, the new pessimal case after the change I'm making now would be a > >>> megabyte of xyxyxyxy with 's/xy/x/g' _THAT_ would need the one output > >>> buffer > >>> thing... > >> > >> And it can be avoided by an in-place copy that remembers "last gap" and > >> doesn't > >> copy trailing data until the _next_ hit (or we confirm there's no hit) so > >> we > >> know how much of the skipped data is "confirmed" and only copy it once we > >> know > >> we're keeping it. > > > > yeah, that's what i meant by only ensuring that the first n characters > > are right. > > > >> So every kept byte is only copied once, and it can still be done in place > >> for > >> the shrinking case. (You'll have to realloc when you expand, but that's > >> probably > >> unavoidable...) > > Which I just did (not even trying in-place, everything gets copied once to a > new > string), and it's still way too slow. > > Time to dig out https://jvns.ca/perf-zine.pdf and see where it's spending its > time... Sigh. I forgot quite how much I hate perf, but: > > 68.60% sed sed[.] regexec0 > 29.93% sed [kernel.kallsyms] [k] nmi >1.47% sed libc-2.24.so [.] strlen >0.00% sed [kernel.kallsyms] [k] __acct_update_integrals > > It's regexec0(). H... ah, my NUL wrapper is essentially calling strlen() > on > the string before doing the regex each time (to figure out where the null to > skip to next is), that should go _after_ it tries to match this segment... > > Ok, and now that I've fixed that it's _still_ taking forever and strlen() is > dominating, but regexec0() is basically calling regexec() as the first thing > and > returning immediately when it hits? The only thing that could be calling > strlen() is libc's regexec()? Hmmm... > > @@ -473,11 +473,15 @@ static void sed_line(char **pline, long plen) > mlen, off, newlen; > >// Loop finding match in remaining line (up to remaining len) > - while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) { > +// while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) { > +while (!strncmp(rline, "x", 1)) { > + match[0].rm_eo = 1; > + match[1].rm_so = 0; > + > > And _that_ completes a megabyte input line basically instantly. It's regexec() > calling strlen() each time pulling (n^2)/2 bytes (half a terabyte?) through > the > l1 cache that's taking up all the time now. > > Grrr. So I either have to write my own regex plumbing (which I can do, I've > done > it before years ago, but I really didn't _want_ to here because LIBC HAS ONE), > or I need to make my wrapper special case literal matches, which means they > don't have... let's see... (For the honor of greyskull "man 7 > regex"...) > > ^|.[*+?(\{$ > > Sigh. If I _was_ going to write my own regex plumbing I could make it take a > length and properly match NUL bytes. On the one hand that's really a post-1.0 > todo item. On the other hand we're talking maybe 200 lines of code here, _and_ > it could probably speed up grep (I could go back to searching for multiple > regexes in parallel... What _is_ involved in implementing lex, anyway...) > > Rob ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] pathological case in sed s///g
On 5/3/19 2:42 PM, enh wrote: > On Fri, May 3, 2019 at 11:59 AM Rob Landley wrote: >> >> On 5/3/19 1:56 PM, Rob Landley wrote: >>> On 5/3/19 1:05 PM, enh wrote: >>> But yeah, the new pessimal case after the change I'm making now would be a >>> megabyte of xyxyxyxy with 's/xy/x/g' _THAT_ would need the one output buffer >>> thing... >> >> And it can be avoided by an in-place copy that remembers "last gap" and >> doesn't >> copy trailing data until the _next_ hit (or we confirm there's no hit) so we >> know how much of the skipped data is "confirmed" and only copy it once we >> know >> we're keeping it. > > yeah, that's what i meant by only ensuring that the first n characters > are right. > >> So every kept byte is only copied once, and it can still be done in place for >> the shrinking case. (You'll have to realloc when you expand, but that's >> probably >> unavoidable...) Which I just did (not even trying in-place, everything gets copied once to a new string), and it's still way too slow. Time to dig out https://jvns.ca/perf-zine.pdf and see where it's spending its time... Sigh. I forgot quite how much I hate perf, but: 68.60% sed sed[.] regexec0 29.93% sed [kernel.kallsyms] [k] nmi 1.47% sed libc-2.24.so [.] strlen 0.00% sed [kernel.kallsyms] [k] __acct_update_integrals It's regexec0(). H... ah, my NUL wrapper is essentially calling strlen() on the string before doing the regex each time (to figure out where the null to skip to next is), that should go _after_ it tries to match this segment... Ok, and now that I've fixed that it's _still_ taking forever and strlen() is dominating, but regexec0() is basically calling regexec() as the first thing and returning immediately when it hits? The only thing that could be calling strlen() is libc's regexec()? Hmmm... @@ -473,11 +473,15 @@ static void sed_line(char **pline, long plen) mlen, off, newlen; // Loop finding match in remaining line (up to remaining len) - while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) { +// while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) { +while (!strncmp(rline, "x", 1)) { + match[0].rm_eo = 1; + match[1].rm_so = 0; + And _that_ completes a megabyte input line basically instantly. It's regexec() calling strlen() each time pulling (n^2)/2 bytes (half a terabyte?) through the l1 cache that's taking up all the time now. Grrr. So I either have to write my own regex plumbing (which I can do, I've done it before years ago, but I really didn't _want_ to here because LIBC HAS ONE), or I need to make my wrapper special case literal matches, which means they don't have... let's see... (For the honor of greyskull "man 7 regex"...) ^|.[*+?(\{$ Sigh. If I _was_ going to write my own regex plumbing I could make it take a length and properly match NUL bytes. On the one hand that's really a post-1.0 todo item. On the other hand we're talking maybe 200 lines of code here, _and_ it could probably speed up grep (I could go back to searching for multiple regexes in parallel... What _is_ involved in implementing lex, anyway...) Rob ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] pathological case in sed s///g
On Fri, May 3, 2019 at 11:59 AM Rob Landley wrote: > > On 5/3/19 1:56 PM, Rob Landley wrote: > > On 5/3/19 1:05 PM, enh wrote: > > But yeah, the new pessimal case after the change I'm making now would be a > > megabyte of xyxyxyxy with 's/xy/x/g' _THAT_ would need the one output buffer > > thing... > > And it can be avoided by an in-place copy that remembers "last gap" and > doesn't > copy trailing data until the _next_ hit (or we confirm there's no hit) so we > know how much of the skipped data is "confirmed" and only copy it once we know > we're keeping it. yeah, that's what i meant by only ensuring that the first n characters are right. > So every kept byte is only copied once, and it can still be done in place for > the shrinking case. (You'll have to realloc when you expand, but that's > probably > unavoidable...) > > Rob ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] pathological case in sed s///g
On 5/3/19 1:05 PM, enh wrote: > BSD seems to avoid all the copying? i think they have a "source" and > "destination" and only write to the latter (solving your issue), but > only move forwards (so i don't think they try to maintain it as "what > the whole result would look like if we only did this many > replacements", rather "here's what the first n characters of the > result will look like"). Oh I know it's solvable. Working on it. The s/x/y/g case A) isn't changing the length of the string, B) has no backrefs. So it doesn't need the copying, yes. Mine only moves forwards too: line and len are the start of the line and the whole line's length, but rline and rlen are the remaining line and remaining length. I just responded to the "don't overwrite backref data while still using it" problem by copying the whole line, when I didn't need to and it opens this pathological case. Hence last email on what I _should_ do instead. (Hmmm... the "one input buffer one output buffer" thing optimizes for the "replace with shorter string" case where you avoid repeated memmove() on the tail of the string. But it slightly pessimizes the cases where you _don't_ and realloc() a lot. It's also a bigger change to the code, lemme see how much this does first...) But yeah, the new pessimal case after the change I'm making now would be a megabyte of xyxyxyxy with 's/xy/x/g' _THAT_ would need the one output buffer thing... Rob ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] pathological case in sed s///g
On 5/3/19 1:56 PM, Rob Landley wrote: > On 5/3/19 1:05 PM, enh wrote: > But yeah, the new pessimal case after the change I'm making now would be a > megabyte of xyxyxyxy with 's/xy/x/g' _THAT_ would need the one output buffer > thing... And it can be avoided by an in-place copy that remembers "last gap" and doesn't copy trailing data until the _next_ hit (or we confirm there's no hit) so we know how much of the skipped data is "confirmed" and only copy it once we know we're keeping it. So every kept byte is only copied once, and it can still be done in place for the shrinking case. (You'll have to realloc when you expand, but that's probably unavoidable...) Rob ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] pathological case in sed s///g
BSD seems to avoid all the copying? i think they have a "source" and "destination" and only write to the latter (solving your issue), but only move forwards (so i don't think they try to maintain it as "what the whole result would look like if we only did this many replacements", rather "here's what the first n characters of the result will look like"). On Fri, May 3, 2019 at 11:02 AM Rob Landley wrote: > > On 5/3/19 12:40 PM, Rob Landley wrote: > > On 5/2/19 9:46 PM, enh via Toybox wrote: > >> i've known about this for a couple of days and haven't had time to > >> look at it properly yet, so i should mention it here... > >> > >> if you have a file with a 1MiB line of 'x'es and you sed 's/x/y/g', > >> BSD or GNU sed finishes immediately, but toybox takes forever. > > > > Because it allocates a replacement string and copies the > > before/replace/after > > parts into it for each replacement. I should switch the xmalloc() on line > > 515 to > > xrealloc() and fiddle with the surrounding code not to do unnecessary > > work... > > And luckily I left myself a comment about why I _didn't_ do that: > > // Allocate new size, copy start/end around match. (Can't extend in > // place because backrefs may refer to text after it's overwritten.) > > I can realloc() if there were no backrefs, or I can make a temp copy of _just_ > the removed text (since all the backrefs have to be within match[0])... > > Hmmm... it's easy to record whether there were backrefs since the code right > above there measures the new length including backrefs, so I can have it set a > flag. The embedded newline support does NOT include matching a NUL (because > that > would involve rewriting the regex library) so there can't be a NUL _within_ > the > match so I can just conditionally xstrndup()... > > It's a bit fiddly, but not too bad so far... > > Rob ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] pathological case in sed s///g
On 5/3/19 12:40 PM, Rob Landley wrote: > On 5/2/19 9:46 PM, enh via Toybox wrote: >> i've known about this for a couple of days and haven't had time to >> look at it properly yet, so i should mention it here... >> >> if you have a file with a 1MiB line of 'x'es and you sed 's/x/y/g', >> BSD or GNU sed finishes immediately, but toybox takes forever. > > Because it allocates a replacement string and copies the before/replace/after > parts into it for each replacement. I should switch the xmalloc() on line 515 > to > xrealloc() and fiddle with the surrounding code not to do unnecessary work... And luckily I left myself a comment about why I _didn't_ do that: // Allocate new size, copy start/end around match. (Can't extend in // place because backrefs may refer to text after it's overwritten.) I can realloc() if there were no backrefs, or I can make a temp copy of _just_ the removed text (since all the backrefs have to be within match[0])... Hmmm... it's easy to record whether there were backrefs since the code right above there measures the new length including backrefs, so I can have it set a flag. The embedded newline support does NOT include matching a NUL (because that would involve rewriting the regex library) so there can't be a NUL _within_ the match so I can just conditionally xstrndup()... It's a bit fiddly, but not too bad so far... Rob ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] pathological case in sed s///g
On 5/2/19 9:46 PM, enh via Toybox wrote: > i've known about this for a couple of days and haven't had time to > look at it properly yet, so i should mention it here... > > if you have a file with a 1MiB line of 'x'es and you sed 's/x/y/g', > BSD or GNU sed finishes immediately, but toybox takes forever. Because it allocates a replacement string and copies the before/replace/after parts into it for each replacement. I should switch the xmalloc() on line 515 to xrealloc() and fiddle with the surrounding code not to do unnecessary work... Rob ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
[Toybox] pathological case in sed s///g
i've known about this for a couple of days and haven't had time to look at it properly yet, so i should mention it here... if you have a file with a 1MiB line of 'x'es and you sed 's/x/y/g', BSD or GNU sed finishes immediately, but toybox takes forever. ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net