, , and backtracking.
How do C and C differ with respect to backtracking? For instance, foobar ~~ / [a..z]+ [ ... ] /; Both sides of the C happen in parallel, so I would guess that they both match foo then stop. Please correct me if that's wrong. Were we using the procedural conjunction: foobar ~~ / [a..z]+ [ ... ] /; I would guess that the LHS matches as much as it can (foobar), then the RHS matches foo. Since foobar ne foo, the regex engine backtracks on the LHS matching fooba and then the RHS matches foo again. Again, since fooba ne foo, the regex engine backtracks as before and this behavior continues until both sides match foo (or, in the general case, it can determine that the conjunctions can't match the same portion of the string) Or it's much simpler than that and both of the regexes above just fail because of the greediness of C+ and there is no intra-conjunction backtracking. So ... anyone care to enlighten me on how this is supposed to work? Thanks, -Scott -- Jonathan Scott Duff [EMAIL PROTECTED]
Re: , , and backtracking.
Jonathan Scott Duff wrote: How do C and C differ with respect to backtracking? For instance, foobar ~~ / [a..z]+ [ ... ] /; Both sides of the C happen in parallel, so I would guess that they both match foo then stop. Please correct me if that's wrong. As written, this match would fail, since '[a..z]+' would match foobar while '[ ... ]' would match foo. '' requires that both sides match the same start and end points. I suppose that it might be worth considering requiring only the start points to match, and having the conjoined match use the earliest end point; so that the above match would then match foo instead of failing. But it's entirely possible that there's something faulty with this approach. The difference between '' and '' is that '' evaluates the two sides in an arbitrary order, possibly even in parallel; '' always evaluates the left side first, and only bothers with the right side if the left side matched. -- Jonathan Dataweaver Lang
Re: , , and backtracking.
On Thu, Sep 06, 2007 at 01:25:12PM -0500, Patrick R. Michaud wrote: : Were we using the procedural conjunction: : : foobar ~~ / [a..z]+ [ ... ] /; : : I would guess that the LHS matches as much as it can (foobar), then : the RHS matches foo [...and then backtracks the LHS until a : conjunctional match is found...] : : Or it's much simpler than that and both of the regexes above just fail : because of the greediness of C+ and there is no intra-conjunction : backtracking. : : I think we definitely allow intra-conjunction backtracking. : PGE implements it that way. That's what I think. : On a somewhat similar question, what happens with a pattern : such as : : foobar ~~ / foo.+? | fooba / : : The LHS initially matches foob, but with backtracking could : eventually match foobar. Do the longest-token semantics : in this case cause the RHS to be dispatched first, even : though the token declaration of the LHS _could_ match a : longer token prefix? Yow. ICATBW. Non-greedy matching is somewhat antithetical to longest-token matching. But basically it boils down to this: Does the longest-token matcher ignore the ? and do foo.+ | fooba or is there an implicit ordering above and beyond the DFA engine of foob | fooba || fooba | fooba || foobar | fooba || I think longest-token semantics have to trump minimal matching here, and my argument is this. Most uses of *? have additional information on what terminates it, either implicitly in what it is matching, or explicitly in the next bit of regex. That is, you'd typically see either foo\w+? | fooba or foo.+? wb | fooba In either case, the clear intent is to match foobar over fooba. Therefore I think the DFA matcher just strips ? and does its ordinary character by character match, relying on that extra info to match the real extent of the quantifier. Larry
Re: , , and backtracking.
On Thu, 2007-09-06 at 12:37 -0700, Larry Wall wrote: Yow. ICATBW. The what now? -'f
Re: [svn:perl6-synopsis] r14447 - doc/trunk/design/syn
So 'orelse' is exactly like '//', except that the result of the left side gets passed to the right side as an error message. Is there a reason to make this exception, as opposed to altering '//' to behave exactly like 'orelse' does? -- Jonathan Dataweaver Lang
Re: , , and backtracking.
On Thu, Sep 06, 2007 at 12:37:37PM -0700, Larry Wall wrote: On Thu, Sep 06, 2007 at 01:25:12PM -0500, Patrick R. Michaud wrote: : On a somewhat similar question, what happens with a pattern : such as : : foobar ~~ / foo.+? | fooba / : : The LHS initially matches foob, but with backtracking could : eventually match foobar. Do the longest-token semantics : in this case cause the RHS to be dispatched first, even : though the token declaration of the LHS _could_ match a : longer token prefix? Yow. ICATBW. Non-greedy matching is somewhat antithetical to longest-token matching. I agree. One thought I had was that perhaps non-greedy matching could also terminate the token prefix. [...] I think longest-token semantics have to trump minimal matching here, and my argument is this. Most uses of *? have additional information on what terminates it, either implicitly in what it is matching, or explicitly in the next bit of regex. That is, you'd typically see either foo\w+? | fooba or foo.+? wb | fooba In either case, the clear intent is to match foobar over fooba. Therefore I think the DFA matcher just strips ? and does its ordinary character by character match, relying on that extra info to match the real extent of the quantifier. Does this still hold true for a non-greedy quantifier in the middle of an expression... ? I.e., foobazbar deborah ~~ /foo .+? b.r | fooba | foobazb / matches foobazbar debor ? (I completely grant that the examples I'm coming up with here may be completely nonsensical in real application, but I'm just exploring the space a bit.) Pm
Re: [svn:perl6-synopsis] r14447 - doc/trunk/design/syn
On Thu, Sep 06, 2007 at 01:40:20PM -0700, Jonathan Lang wrote: : So 'orelse' is exactly like '//', except that the result of the left : side gets passed to the right side as an error message. Is there a : reason to make this exception, as opposed to altering '//' to behave : exactly like 'orelse' does? How 'bout, it's convenient to have the simpler defaulting semantics when you don't care what kind of undefined value is on the left. Larry
Re: , , and backtracking.
On Thu, Sep 06, 2007 at 03:49:42PM -0700, Larry Wall wrote: : On Thu, Sep 06, 2007 at 04:02:19PM -0500, Patrick R. Michaud wrote: : : I agree. One thought I had was that perhaps non-greedy matching : : could also terminate the token prefix. : : Well, that's more or less arguing it the other way. It kind of assumes : your fooba-ish arguments are smart enough to test for non-r after. The more I think about it, the more I like this approach. Most token should be matched in ratchet mode, and I think minimal matching is perhaps a form of cheating: Look for the nearest end of this without me actually bothering to parse the middle. And certainly any reasonable grammer is going to be checking that a keyword is properly terminated by a word boundary anyway. Larry
Re: [svn:perl6-synopsis] r14447 - doc/trunk/design/syn
[EMAIL PROTECTED] wrote: Author: larry Date: Thu Sep 6 09:31:16 2007 New Revision: 14447 + +C infix:andthen , proceed on success + +test1() andthen test2() + +Returns the left argument if the left argument indicates failure +(that is, if the result is undefined). Otherwise it +evaluates and returns the right argument. + +If the right side is a block or pointy block, the result of the left +side is bound to any arguments of the block. If the right side is +not a block, a block scope is assumed around the right side, and the +result of the left side is implicitly bound to C$_ for the scope +of the right side. That is, + +test1() andthen test2() + +is equivalent to + +test1() andthen - $_ { test2() } + +There is no corresponding high-precedence version. + =back +C infix:orelse , proceed on failure -$value err $default +test1() orelse test2() -Returns the left argument if it's defined, otherwise evaluates and -returns the right argument. In list context forces a false return -to mean C(). See C// above for high-precedence version. +Returns the left argument if the left argument indicates success +(that is, if the result is defined). Otherwise it evaluates and +returns the right argument. + +If the right side is a block or pointy block, the result of the left +side is bound to any arguments of the block. If the right side is +not a block, a block scope is assumed around the right side, and the +result of the left side is implicitly bound to C$! for the scope +of the right side. That is, + +test1() orelse test2() + +is equivalent to + +test1() orelse - $! { test2() } + +(The low-precedence C// operator is similar, but does not set C$! or +treat blocks specially.) =back Do the results of andthen and orelse really bind to ANY arguments of the second block? If the second block has two parameters it makes more sense to me for the results to bind to the first parameter and nothing to bind to the second parameter. Joe Gottman
[svn:perl6-synopsis] r14449 - doc/trunk/design/syn
Author: larry Date: Thu Sep 6 17:12:02 2007 New Revision: 14449 Modified: doc/trunk/design/syn/S05.pod Log: old ?foo is now +foo to suppress capture new ?foo now is zero-width like !foo clarifications on backtracking and longest-token semantics minimal quantifiers are now considered to terminate a longest token Modified: doc/trunk/design/syn/S05.pod == --- doc/trunk/design/syn/S05.pod(original) +++ doc/trunk/design/syn/S05.podThu Sep 6 17:12:02 2007 @@ -14,9 +14,9 @@ Maintainer: Patrick Michaud [EMAIL PROTECTED] and Larry Wall [EMAIL PROTECTED] Date: 24 Jun 2002 - Last Modified: 17 Aug 2007 + Last Modified: 6 Sep 2007 Number: 5 - Version: 63 + Version: 64 This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them Iregex rather than regular @@ -247,13 +247,13 @@ The new C:s (C:sigspace) modifier causes whitespace sequences to be considered significant; they are replaced by a whitespace -matching rule, C ?ws . That is, +matching rule, C +ws . That is, m:s/ next cmd = condition/ is the same as: - m/ ?ws next ?ws cmd ?ws = ?ws condition/ + m/ +ws next +ws cmd +ws = +ws condition/ which is effectively the same as: @@ -265,18 +265,18 @@ or equivalently, - m { (a|\*) ?ws (b|\+) } + m { (a|\*) +ws (b|\+) } -C ?ws can't decide what to do until it sees the data. -It still does the right thing. If not, define your own C ?ws +C +ws can't decide what to do until it sees the data. +It still does the right thing. If not, define your own C ws and C:sigspace will use that. In general you don't need to use C:sigspace within grammars because the parser rules automatically handle whitespace policy for you. In this context, whitespace often includes comments, depending on how the grammar chooses to define its whitespace rule. Although the -default C ?ws subrule recognizes no comment construct, any -grammar is free to override the rule. The C ?ws rule is not +default C +ws subrule recognizes no comment construct, any +grammar is free to override the rule. The C +ws rule is not intended to mean the same thing everywhere. It's also possible to pass an argument to C:sigspace specifying @@ -285,7 +285,7 @@ important to distinguish the significant whitespace in the pattern from the whitespace being matched, so we'll call the pattern's whitespace Isigspace, and generally reserve Iwhitespace to indicate whatever -C ?ws matches in the current grammar. The correspondence +C +ws matches in the current grammar. The correspondence between sigspace and whitespace is primarily metaphorical, which is why the correspondence is both useful and (potentially) confusing. @@ -336,16 +336,16 @@ If followed by an Cx, it means repetition. Use C:x(4) for the general form. So - s:4x [ (?ident) = (\N+) $$] [$0 = $1]; + s:4x [ (+ident) = (\N+) $$] [$0 = $1]; is the same as: - s:x(4) [ (?ident) = (\N+) $$] [$0 = $1]; + s:x(4) [ (+ident) = (\N+) $$] [$0 = $1]; which is almost the same as: $_.pos = 0; - s:c[ (?ident) = (\N+) $$] = $0 = $1 for 1..4; + s:c[ (+ident) = (\N+) $$] = $0 = $1 for 1..4; except that the string is unchanged unless all four matches are found. However, ranges are allowed, so you can say C:x(1..4) to change anywhere @@ -450,7 +450,7 @@ and these are equivalent to $string ~~ m/^ \d+: $/; -$string ~~ m/^ ?ws \d+: ?ws $/; +$string ~~ m/^ +ws \d+: +ws $/; =item * @@ -584,7 +584,8 @@ erroneous to assume either order happens consistently. The C form guarantees left-to-right order, and backtracking makes the right argument vary faster than the left. In other words, C and C|| establish -sequence points. +sequence points. The left side may be backtracked into when backtracking +is allowed into the construct as a whole. The C operator is list associative like C|, but has slightly tighter precedence. Likewise C has slightly tighter precedence @@ -1008,10 +1009,14 @@ To pass a string with leading whitespace, or to interpolate any values into the string, you must use the parenthesized form. -If the first character is a plus or minus, the initial identifier -is taken as a character class, so the first character after the -identifier doesn't matter in this case, and you can use whitespace -however you like. Therefore +If the first character is a plus or minus, the rest of the assertion +is parsed as a set of character classes (though the definition of +character class is intentionally vague, and may include any other rule +whether it matches characters exclusively or not). + +An initial identifier is taken as a character class, so the first +character after the identifier doesn't matter in this case, and you +can use whitespace however you like. Therefore foo+bar-baz @@ -1054,15 +1059,21 @@
Re: [svn:perl6-synopsis] r14447 - doc/trunk/design/syn
Larry Wall wrote: Jonathan Lang wrote: : So 'orelse' is exactly like '//', except that the result of the left : side gets passed to the right side as an error message. Is there a : reason to make this exception, as opposed to altering '//' to behave : exactly like 'orelse' does? How 'bout, it's convenient to have the simpler defaulting semantics when you don't care what kind of undefined value is on the left. If you don't care what kind of undefined value is on the left, you probably won't notice if it's passed to the right; and the optimizer may very well end up implementing the simpler semantics in this case, even if you use 'orelse'. And if you explicitly want to use the simpler semantics, you can come very close by blocking the right side in a block that takes no arguments. Conversely: if you do care about the undefined value, the subtle distinction between '//' and 'orelse' (beyond the implied difference in precedence) will be yet another language quirk that you'll have to remember. Very inconvenient. -- Jonathan Dataweaver Lang
[svn:perl6-synopsis] r14453 - doc/trunk/design/syn
Author: larry Date: Thu Sep 6 20:56:39 2007 New Revision: 14453 Modified: doc/trunk/design/syn/S03.pod Log: fixed it right this time... Modified: doc/trunk/design/syn/S03.pod == --- doc/trunk/design/syn/S03.pod(original) +++ doc/trunk/design/syn/S03.podThu Sep 6 20:56:39 2007 @@ -3632,7 +3632,7 @@ If you already have a capture variable, you can interpolate all of its bits at once using the C prefix:| operator: -my $capture := func(); +my (|$capture) := func(); push |$capture; =head1 Traversing arrays in parallel
[svn:perl6-synopsis] r14452 - doc/trunk/design/syn
Author: larry Date: Thu Sep 6 20:50:04 2007 New Revision: 14452 Modified: doc/trunk/design/syn/S03.pod Log: extra \ found by sunnavy++ Modified: doc/trunk/design/syn/S03.pod == --- doc/trunk/design/syn/S03.pod(original) +++ doc/trunk/design/syn/S03.podThu Sep 6 20:50:04 2007 @@ -3632,7 +3632,7 @@ If you already have a capture variable, you can interpolate all of its bits at once using the C prefix:| operator: -my \$capture := func(); +my $capture := func(); push |$capture; =head1 Traversing arrays in parallel
[svn:perl6-synopsis] r14451 - doc/trunk/design/syn
Author: larry Date: Thu Sep 6 20:36:26 2007 New Revision: 14451 Modified: doc/trunk/design/syn/S03.pod Log: typo spotted by sunnavy++ Modified: doc/trunk/design/syn/S03.pod == --- doc/trunk/design/syn/S03.pod(original) +++ doc/trunk/design/syn/S03.podThu Sep 6 20:36:26 2007 @@ -2830,7 +2830,7 @@ with C+, C~ and C? respectively to indicate whether the bitwise operation is done on a number, a string, or a single bit. But that's just a naming convention, and if you wanted to add a new -bitwise C¬ operator, you'd have the add the C+¬, C~¬, and C?¬ +bitwise C¬ operator, you'd have to add the C+¬, C~¬, and C?¬ operators yourself. Similarly, the carets that exclude the endpoints on ranges are there by convention only.
Re: [svn:perl6-synopsis] r14447 - doc/trunk/design/syn
On Thu, Sep 06, 2007 at 08:32:55PM -0400, Joe Gottman wrote: : Do the results of andthen and orelse really bind to ANY arguments of : the second block? If the second block has two parameters it makes more : sense to me for the results to bind to the first parameter and nothing : to bind to the second parameter. I suspect the left side is a Capture context, so test1() could return multiple arguments to be bound to the right side. In any case, if test1() calls fail() then the orelse should fail regardless of whether the orelse is in list or scalar context. Should probably make the orelse itself fail if the binding doesn't work too. Larry
[svn:perl6-synopsis] r14450 - doc/trunk/design/syn
Author: larry Date: Thu Sep 6 19:29:21 2007 New Revision: 14450 Modified: doc/trunk/design/syn/S05.pod Log: typo from [particle]++ Modified: doc/trunk/design/syn/S05.pod == --- doc/trunk/design/syn/S05.pod(original) +++ doc/trunk/design/syn/S05.podThu Sep 6 19:29:21 2007 @@ -1280,7 +1280,7 @@ alpha # match a letter and capture in $alpha +alpha# match a letter, don't capture -?alpha# much null before a letter, don't capture +?alpha# match null before a letter, don't capture =item *
Re: [svn:perl6-synopsis] r14449 - doc/trunk/design/syn
On 9/6/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: @@ -1254,6 +1273,17 @@ =item * +A leading C? indicates a positive zero-width assertion, and like C! +merely reparses the rest of the assertion recursively as if the C? +were not there. In addition to forcing zero-width, it also suppresses +any named capture: + +alpha # match a letter and capture in $alpha ++alpha# match a letter, don't capture +?alpha# much null before a letter, don't capture + +=item * + A leading C~~ indicates a recursive call back into some or all of the current rule. An optional argument indicates which subpattern to re-use, and if provided must resolve to a single subpattern. s/much/match/
Re: , , and backtracking.
On Thu, Sep 06, 2007 at 04:02:19PM -0500, Patrick R. Michaud wrote: : I agree. One thought I had was that perhaps non-greedy matching : could also terminate the token prefix. Well, that's more or less arguing it the other way. It kind of assumes your fooba-ish arguments are smart enough to test for non-r after. : [...] : I think longest-token semantics have to trump minimal matching here, : and my argument is this. Most uses of *? have additional information : on what terminates it, either implicitly in what it is matching, or : explicitly in the next bit of regex. That is, you'd typically see : either : foo\w+? | fooba : or : foo.+? wb | fooba : : In either case, the clear intent is to match foobar over fooba. : Therefore I think the DFA matcher just strips ? and does its ordinary : character by character match, relying on that extra info to match : the real extent of the quantifier. : : Does this still hold true for a non-greedy quantifier in the : middle of an expression... ? I.e., : : foobazbar deborah ~~ /foo .+? b.r | fooba | foobazb / : : matches foobazbar debor ? I suppose one approach would be to limit the DFA to the longest known constant string, and then rerun any variable branches individually to see if they will actually match something that long. Alternately, if the DFA engine is smart enough to kick out all the matches of the foo .+? b.r branch then it could conceivably be taught to return the shortest one rather than the longest one. Course, the question after than is, what if you mix minimal with maximal quantifiers? Another consideration is the efficiency of matching that pattern against foobazbar ~ 'b' x 1. That would definitely run faster if you cut off the DFA after the longest known token, or as soon as the DFA reduces to one possibility, presuming you can know such a thing. Any way we work it, there will be a way to use it wrong. Larry