Re: , , and backtracking.
Jonathan Scott Duff wrote: How do C and C differ with respect to backtracking? For instance, foobar ~~ / [a..z]+ [ ... ] /; Both sides of the C happen in parallel, so I would guess that they both match foo then stop. Please correct me if that's wrong. As written, this match would fail, since '[a..z]+' would match foobar while '[ ... ]' would match foo. '' requires that both sides match the same start and end points. I suppose that it might be worth considering requiring only the start points to match, and having the conjoined match use the earliest end point; so that the above match would then match foo instead of failing. But it's entirely possible that there's something faulty with this approach. The difference between '' and '' is that '' evaluates the two sides in an arbitrary order, possibly even in parallel; '' always evaluates the left side first, and only bothers with the right side if the left side matched. -- Jonathan Dataweaver Lang
Re: , , and backtracking.
On Thu, Sep 06, 2007 at 01:25:12PM -0500, Patrick R. Michaud wrote: : Were we using the procedural conjunction: : : foobar ~~ / [a..z]+ [ ... ] /; : : I would guess that the LHS matches as much as it can (foobar), then : the RHS matches foo [...and then backtracks the LHS until a : conjunctional match is found...] : : Or it's much simpler than that and both of the regexes above just fail : because of the greediness of C+ and there is no intra-conjunction : backtracking. : : I think we definitely allow intra-conjunction backtracking. : PGE implements it that way. That's what I think. : On a somewhat similar question, what happens with a pattern : such as : : foobar ~~ / foo.+? | fooba / : : The LHS initially matches foob, but with backtracking could : eventually match foobar. Do the longest-token semantics : in this case cause the RHS to be dispatched first, even : though the token declaration of the LHS _could_ match a : longer token prefix? Yow. ICATBW. Non-greedy matching is somewhat antithetical to longest-token matching. But basically it boils down to this: Does the longest-token matcher ignore the ? and do foo.+ | fooba or is there an implicit ordering above and beyond the DFA engine of foob | fooba || fooba | fooba || foobar | fooba || I think longest-token semantics have to trump minimal matching here, and my argument is this. Most uses of *? have additional information on what terminates it, either implicitly in what it is matching, or explicitly in the next bit of regex. That is, you'd typically see either foo\w+? | fooba or foo.+? wb | fooba In either case, the clear intent is to match foobar over fooba. Therefore I think the DFA matcher just strips ? and does its ordinary character by character match, relying on that extra info to match the real extent of the quantifier. Larry
Re: , , and backtracking.
On Thu, 2007-09-06 at 12:37 -0700, Larry Wall wrote: Yow. ICATBW. The what now? -'f
Re: , , and backtracking.
On Thu, Sep 06, 2007 at 12:37:37PM -0700, Larry Wall wrote: On Thu, Sep 06, 2007 at 01:25:12PM -0500, Patrick R. Michaud wrote: : On a somewhat similar question, what happens with a pattern : such as : : foobar ~~ / foo.+? | fooba / : : The LHS initially matches foob, but with backtracking could : eventually match foobar. Do the longest-token semantics : in this case cause the RHS to be dispatched first, even : though the token declaration of the LHS _could_ match a : longer token prefix? Yow. ICATBW. Non-greedy matching is somewhat antithetical to longest-token matching. I agree. One thought I had was that perhaps non-greedy matching could also terminate the token prefix. [...] I think longest-token semantics have to trump minimal matching here, and my argument is this. Most uses of *? have additional information on what terminates it, either implicitly in what it is matching, or explicitly in the next bit of regex. That is, you'd typically see either foo\w+? | fooba or foo.+? wb | fooba In either case, the clear intent is to match foobar over fooba. Therefore I think the DFA matcher just strips ? and does its ordinary character by character match, relying on that extra info to match the real extent of the quantifier. Does this still hold true for a non-greedy quantifier in the middle of an expression... ? I.e., foobazbar deborah ~~ /foo .+? b.r | fooba | foobazb / matches foobazbar debor ? (I completely grant that the examples I'm coming up with here may be completely nonsensical in real application, but I'm just exploring the space a bit.) Pm
Re: , , and backtracking.
On Thu, Sep 06, 2007 at 03:49:42PM -0700, Larry Wall wrote: : On Thu, Sep 06, 2007 at 04:02:19PM -0500, Patrick R. Michaud wrote: : : I agree. One thought I had was that perhaps non-greedy matching : : could also terminate the token prefix. : : Well, that's more or less arguing it the other way. It kind of assumes : your fooba-ish arguments are smart enough to test for non-r after. The more I think about it, the more I like this approach. Most token should be matched in ratchet mode, and I think minimal matching is perhaps a form of cheating: Look for the nearest end of this without me actually bothering to parse the middle. And certainly any reasonable grammer is going to be checking that a keyword is properly terminated by a word boundary anyway. Larry
Re: , , and backtracking.
On Thu, Sep 06, 2007 at 04:02:19PM -0500, Patrick R. Michaud wrote: : I agree. One thought I had was that perhaps non-greedy matching : could also terminate the token prefix. Well, that's more or less arguing it the other way. It kind of assumes your fooba-ish arguments are smart enough to test for non-r after. : [...] : I think longest-token semantics have to trump minimal matching here, : and my argument is this. Most uses of *? have additional information : on what terminates it, either implicitly in what it is matching, or : explicitly in the next bit of regex. That is, you'd typically see : either : foo\w+? | fooba : or : foo.+? wb | fooba : : In either case, the clear intent is to match foobar over fooba. : Therefore I think the DFA matcher just strips ? and does its ordinary : character by character match, relying on that extra info to match : the real extent of the quantifier. : : Does this still hold true for a non-greedy quantifier in the : middle of an expression... ? I.e., : : foobazbar deborah ~~ /foo .+? b.r | fooba | foobazb / : : matches foobazbar debor ? I suppose one approach would be to limit the DFA to the longest known constant string, and then rerun any variable branches individually to see if they will actually match something that long. Alternately, if the DFA engine is smart enough to kick out all the matches of the foo .+? b.r branch then it could conceivably be taught to return the shortest one rather than the longest one. Course, the question after than is, what if you mix minimal with maximal quantifiers? Another consideration is the efficiency of matching that pattern against foobazbar ~ 'b' x 1. That would definitely run faster if you cut off the DFA after the longest known token, or as soon as the DFA reduces to one possibility, presuming you can know such a thing. Any way we work it, there will be a way to use it wrong. Larry