heh... funny you should mention "." in the first position... $ echo "foo.jar" | old-toybox grep '\.jar' foo.jar $ echo "foo.jar" | new-toybox grep '\.jar' $
that a simplified case based on the real-world build break caused by https://cs.android.com/android/platform/superproject/+/master:prebuilts/cmdline-tools/Android.bp;l=94;drc=6fa24e3c9dce05b5440b68997b58df791dddd617 merging zero jar files into one instead of all the jar files :-) On Tue, Sep 27, 2022 at 1:03 AM Rob Landley <[email protected]> wrote: > On 9/27/22 01:47, Yi-yo Chiang wrote: > > (I suffered a system failure during the weekend and lost all my browser > tabs and > > consequently lost track of all my work...) > > > > Might be a stupid question, but what's so special about ". in first > position"? > > Because the new fast path iterates over each character in the input > string, and > at each position checks the patterns that start with that character: > > for (ss = start; ss-line<ulen; ss++) { > ii = FLAG(i) ? toupper(*ss) : *ss; > for (seek = TT.fixed[ii]; seek; seek = seek->next) { > > If I allow "." in first position then I need to repeat the inner loop > starting > with seek = TT.fixed['.'] at each position, which is doable but awkward > enough I > thought I'd wait for somebody to complain. :) > > > Doesn't ANY '.' in ANY position (sans <backslash><dot>) indicate a regex > (not > > fixed pattern)? > > The new fixed pattern search logic can (when FLAG(F) isn't set) handle some > simple regex logic: the "." single character wildcard character, ^ at > start and > $ at end, and \ escapes that make regex chars literal. (This is a subset > of what > the openbsd one was doing. They also did unicode stuff which I punted on: > any > byte > 127 puts the pattern into the slow path that falls back to the old > regcomp and have regexec scan each string from start to finish for each > pattern.) > > The new fast path logic has two main chunks, in parse_regex() after the > comment: > > // Convert to regex where appropriate > > It works out which patterns the fast path can handle. (Including some magic > tests like if what you've \ escaped is a ( and you're not in -E extended > regex > mode, then the escape is actually ACTIVATING it not deactivating it. Regex > escaping is generally weird, as I believe I complained here. The fast path > logic > only handles a subset of it, and punts to the rest to the slow path.) > > Once it's identified which patterns can be peeled out into the fast path, > it > then bucket sorts them by first character into a 256 entry array. (Since I > punt > anything with UTF-8 in it to the slow path the array would only need to be > 128 > bytes long... except -F can make ANY string take the fast path because it > forces > literal match with no special behavior of any character. So I need the top > half > of the array for that.) > > Then the logic that actually USES the -F patterns is in do_grep() after > the comment: > > // Handle "fixed" (literal) matches (if any) > > Which includes those two loops I pasted above. > > I can complicate the fast path a bit more if additional use cases show up, > but > in the absence of people giving me real world data... I implemented more > than > just the new test case, but less than openbsd was doing. > > Rob > > P.S. if you have a file of filenames you probably want both -F and -f > because > otherwise "potato.txt" will match "potatootxt" because regex. But > presumably > that was true of the gnu/dammit and openbsd vesions too... >
_______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
