On 9/15/22 16:32, enh wrote: > ah, but on another box with 2021-10-22 it's broken. so it looks like "real" > xxd > had the same bug and fixed it recently?
Easy enough to fix, I was just confused by "the one in debian fails the same way"... > Huh. Oh well, tangent. I'll assume the test you sent me is the test you > meant to > send me. > > > A simple optimization would be to concat all patterns into one huge > regex > > pattern with '|'. > > Which you could do in your pattern creation just as easily as grep could > do so > internally? Except when I tried it: > > $ time toybox grep -Ef <(tr '\n' '|' < test_large_10000 | head -c -1) \ > test_large_10000 > > The result was even _slower_ on glibc (and gave me "out of memory" on > musl). > > fwiw, BSD grep which we used to use before > had > https://android.googlesource.com/platform/system/core/+/kitkat-release/toolbox/grep/fastgrep.c#32 > which (although i haven't tested) i always assumed was for this. Yeah, doing more ourselves is the obvious way to go... > I could also sort the patterns by first character and loop through > calling the > appropriate regexes based on character at string position... which is > harder > than you think because there could be a | later in the pattern so that > optimization only applies to a _subset_ of patterns. (First char being . > or * or > ( kinda throws that off to, and that optimization would also have to be > aware of > case insensitivity...) I was trying to avoid writing another regex implementation, but a fast subset one that does parallel cache-local searching for simple patterns and hands off to libc for anything complicated isn't that hard. And I can even cheat with a bitmask for starting byte: unsigned mask[8], len; char *str; for (len = 0; len<str; len++) { if (!mask[char>>5][char&31]) continue; // actual comparison } > Gnu grep has its own bespoke regex implementation internally, because > gnu. While > I _have_ historically written my own regex implementation from scratch > (using > glob syntax rather than regex but same general idea), I chose not to do > so here > because project philosophy. This has downsides. There are things that can > be > done to speed it up, but I'd prefer a more real world test case to > optimize > against if we go down that path? (This is why I usually wait until > somebody > complains before optimizing or elaborating too much: they bring test > cases. :) > > note that this is the second instance we've had of this problem. the previous > googler to mention this was on a different team, working on something entirely > different, but using grep in a similar way. they both boil down to "looking > for > one of a large number of files in a very large build log" though. > > yochiang's example here only looks artificial because it deliberately is --- > it's a way to reproduce without having to pay to do massive Android builds. Oh sure. Important use case, I'm on it. I just want to make sure I fix it properly. (I asked because I was mentally thumbing through hashing approaches, but 256 bytes is small enough to just have an array.) Plumbing-wise this is sort of an autodetected per-pattern -F logic, except how does that interact with -i... /me wanders off to pace and stare at the ceiling a bit... Rob _______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
