(I suffered a system failure during the weekend and lost all my browser tabs and consequently lost track of all my work...)
Might be a stupid question, but what's so special about ". in first position"? Doesn't ANY '.' in ANY position (sans <backslash><dot>) indicate a regex (not fixed pattern)? On Tue, Sep 27, 2022 at 4:45 AM enh <[email protected]> wrote: > > > On Sat, Sep 24, 2022 at 2:31 PM Rob Landley <[email protected]> wrote: > >> On 9/15/22 20:42, Rob Landley wrote: >> > On 9/15/22 16:32, enh wrote: >> >> ah, but on another box with 2021-10-22 it's broken. so it looks like >> "real" xxd >> >> had the same bug and fixed it recently? >> > >> > Easy enough to fix, I was just confused by "the one in debian fails the >> same way"... >> > >> >> > A simple optimization would be to concat all patterns into one >> huge regex >> >> > pattern with '|'. >> >> >> >> Which you could do in your pattern creation just as easily as grep >> could do so >> >> internally? Except when I tried it: >> >> >> >> $ time toybox grep -Ef <(tr '\n' '|' < test_large_10000 | head -c >> -1) \ >> >> test_large_10000 >> >> >> >> The result was even _slower_ on glibc (and gave me "out of memory" >> on musl). >> >> >> >> fwiw, BSD grep which we used to use before >> >> had >> https://android.googlesource.com/platform/system/core/+/kitkat-release/toolbox/grep/fastgrep.c#32 >> >> which (although i haven't tested) i always assumed was for this. >> > >> > Yeah, doing more ourselves is the obvious way to go... >> >> I punted on most of the utf8 stuff in there. Let me know if I should >> revisit >> that. (Right now I'm treating ANY string has char > 127 in it as "hand >> off to >> the regex engine" instead of using the fast path -F logic. In theory I >> only need >> to do that for -i but see recent email about reordering width zero >> unicode code >> points... if the regex engine does it, it's not my fault when it's wrong. >> :) >> > > both users i've seen have wanted this for "filenames in a build", so they > should be fine. (and, yes, "i18n is someone else's problem" is usually the > right choice down at this level.) > > >> >> I could also sort the patterns by first character and loop through >> calling the >> >> appropriate regexes based on character at string position... which >> is harder >> >> than you think because there could be a | later in the pattern so >> that >> >> optimization only applies to a _subset_ of patterns. (First char >> being . or * or >> >> ( kinda throws that off to, and that optimization would also have >> to be aware of >> >> case insensitivity...) >> > >> > I was trying to avoid writing another regex implementation, but a fast >> subset >> > one that does parallel cache-local searching for simple patterns and >> hands off >> > to libc for anything complicated isn't that hard. And I can even cheat >> with a >> > bitmask for starting byte: >> >> What I did was simpler: filter the incoming TT.e list (after it's been >> split at >> \n and TT.f entries collated into it, which the code was already doing) >> into >> fixed and regex categories (ala autodetect when I can apply -F logic, >> with basic >> support for non-unicode case insensitivity, backslash escapes, and "." >> not in >> the first position), and then bucket sort the resulting -F list into an >> array of >> 256 linked lists by starting character (punting for now that starting >> with "." >> means it autodetects as regex not fixed: the actual -F flag overrides >> this), and >> then use TT.fixed[c] being NULL to mean nothing can match here. Further >> optimization: sorting each per-character list by length in the setup >> phase means >> the actual search logic can stop at the first hit because it's the >> longest one. >> >> No bitmask needed: https://github.com/landley/toybox/commit/a7e49c3c7860 >> >> (Then it falls back to running the same old regex logic for any non-fixed >> patterns, where regexec searches the whole string from start to finish >> for each >> pattern. If there aren't any to loop naturally skips itself traversing >> zero >> entries, if there are a small number the performance hit shouldn't be too >> bad.) >> >> It's still slower than the gnu/dammit grep (0.9 seconds vs 0.3 seconds on >> my >> laptop), but hopefully good enough. >> > > heh, yeah, that sounds like plenty :-) i don't know of anyone who's > actually timing this --- the complaint was just about the pathological > behavior! > > >> Try now? >> > > (update in progress. given that yochiang and i are in horribly > incompatible timezones, we might be a while getting back to you...) > > >> Rob >> >> P.S. I added a speed test to grep.test (5 second timeout, should still >> pass on >> the original raspberry pi model A from 2012 not that I have one), using >> base64 >> on a long seq to get the many unique test lines instead of xxd >> /dev/urandom, >> because urandom is slow producing nontrivial amounts of data. Trying to >> feed seq >> into sha3sum or similar was equally slow: running the test should not be >> an >> order of magnitude faster than generating the test data. >> > > certainly fine on my rpi400... > > PASS: grep speed > > specifically it takes: > > real 0m0.609s > user 0m0.581s > sys 0m0.177s > > that seems like a good enough margin of error for slower rpis :-) > >
_______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
