On 9/15/22 20:42, Rob Landley wrote: > On 9/15/22 16:32, enh wrote: >> ah, but on another box with 2021-10-22 it's broken. so it looks like "real" >> xxd >> had the same bug and fixed it recently? > > Easy enough to fix, I was just confused by "the one in debian fails the same > way"... > >> > A simple optimization would be to concat all patterns into one huge >> regex >> > pattern with '|'. >> >> Which you could do in your pattern creation just as easily as grep could >> do so >> internally? Except when I tried it: >> >> $ time toybox grep -Ef <(tr '\n' '|' < test_large_10000 | head -c -1) \ >> test_large_10000 >> >> The result was even _slower_ on glibc (and gave me "out of memory" on >> musl). >> >> fwiw, BSD grep which we used to use before >> had >> https://android.googlesource.com/platform/system/core/+/kitkat-release/toolbox/grep/fastgrep.c#32 >> which (although i haven't tested) i always assumed was for this. > > Yeah, doing more ourselves is the obvious way to go...
I punted on most of the utf8 stuff in there. Let me know if I should revisit that. (Right now I'm treating ANY string has char > 127 in it as "hand off to the regex engine" instead of using the fast path -F logic. In theory I only need to do that for -i but see recent email about reordering width zero unicode code points... if the regex engine does it, it's not my fault when it's wrong. :) >> I could also sort the patterns by first character and loop through >> calling the >> appropriate regexes based on character at string position... which is >> harder >> than you think because there could be a | later in the pattern so that >> optimization only applies to a _subset_ of patterns. (First char being . >> or * or >> ( kinda throws that off to, and that optimization would also have to be >> aware of >> case insensitivity...) > > I was trying to avoid writing another regex implementation, but a fast subset > one that does parallel cache-local searching for simple patterns and hands off > to libc for anything complicated isn't that hard. And I can even cheat with a > bitmask for starting byte: What I did was simpler: filter the incoming TT.e list (after it's been split at \n and TT.f entries collated into it, which the code was already doing) into fixed and regex categories (ala autodetect when I can apply -F logic, with basic support for non-unicode case insensitivity, backslash escapes, and "." not in the first position), and then bucket sort the resulting -F list into an array of 256 linked lists by starting character (punting for now that starting with "." means it autodetects as regex not fixed: the actual -F flag overrides this), and then use TT.fixed[c] being NULL to mean nothing can match here. Further optimization: sorting each per-character list by length in the setup phase means the actual search logic can stop at the first hit because it's the longest one. No bitmask needed: https://github.com/landley/toybox/commit/a7e49c3c7860 (Then it falls back to running the same old regex logic for any non-fixed patterns, where regexec searches the whole string from start to finish for each pattern. If there aren't any to loop naturally skips itself traversing zero entries, if there are a small number the performance hit shouldn't be too bad.) It's still slower than the gnu/dammit grep (0.9 seconds vs 0.3 seconds on my laptop), but hopefully good enough. Try now? Rob P.S. I added a speed test to grep.test (5 second timeout, should still pass on the original raspberry pi model A from 2012 not that I have one), using base64 on a long seq to get the many unique test lines instead of xxd /dev/urandom, because urandom is slow producing nontrivial amounts of data. Trying to feed seq into sha3sum or similar was equally slow: running the test should not be an order of magnitude faster than generating the test data. _______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
