, , and backtracking.

2007-09-06 Thread Jonathan Scott Duff
How do C and C differ with respect to backtracking?  For instance,

foobar ~~ / [a..z]+  [ ... ] /;

Both sides of the C happen in parallel, so I would guess that they
both match foo then stop. Please correct me if that's wrong.

Were we using the procedural conjunction:

foobar ~~ / [a..z]+  [ ... ] /;

I would guess that the LHS matches as much as it can (foobar), then
the RHS matches foo. Since foobar ne foo, the regex engine
backtracks on the LHS matching fooba and then the RHS matches foo
again. Again, since fooba ne foo, the regex engine backtracks as
before and this behavior continues until both sides match foo (or, in
the general case, it can determine that the conjunctions can't match the
same portion of the string)

Or it's much simpler than that and both of the regexes above just fail
because of the greediness of C+ and there is no intra-conjunction
backtracking.

So ... anyone care to enlighten me on how this is supposed to work?

Thanks,

-Scott
-- 
Jonathan Scott Duff
[EMAIL PROTECTED]


Re: , , and backtracking.

2007-09-06 Thread Jonathan Lang
Jonathan Scott Duff wrote:
 How do C and C differ with respect to backtracking?  For instance,

 foobar ~~ / [a..z]+  [ ... ] /;

 Both sides of the C happen in parallel, so I would guess that they
 both match foo then stop. Please correct me if that's wrong.

As written, this match would fail, since '[a..z]+' would match
foobar while '[ ... ]' would match foo.  '' requires that both
sides match the same start and end points.  I suppose that it might be
worth considering requiring only the start points to match, and having
the conjoined match use the earliest end point; so that the above
match would then match foo instead of failing.  But it's entirely
possible that there's something faulty with this approach.

The difference between '' and '' is that '' evaluates the two
sides in an arbitrary order, possibly even in parallel; '' always
evaluates the left side first, and only bothers with the right side if
the left side matched.

-- 
Jonathan Dataweaver Lang


Re: , , and backtracking.

2007-09-06 Thread Larry Wall
On Thu, Sep 06, 2007 at 01:25:12PM -0500, Patrick R. Michaud wrote:
:  Were we using the procedural conjunction:
:  
:  foobar ~~ / [a..z]+  [ ... ] /;
:  
:  I would guess that the LHS matches as much as it can (foobar), then
:  the RHS matches foo [...and then backtracks the LHS until a 
:  conjunctional match is found...]
: 
:  Or it's much simpler than that and both of the regexes above just fail
:  because of the greediness of C+ and there is no intra-conjunction
:  backtracking.
: 
: I think we definitely allow intra-conjunction backtracking.
: PGE implements it that way.

That's what I think.

: On a somewhat similar question, what happens with a pattern
: such as
: 
: foobar ~~ / foo.+? | fooba /
: 
: The LHS initially matches foob, but with backtracking could
: eventually match foobar.  Do the longest-token semantics
: in this case cause the RHS to be dispatched first, even
: though the token declaration of the LHS _could_ match a 
: longer token prefix?  

Yow.  ICATBW.  Non-greedy matching is somewhat antithetical to
longest-token matching.  But basically it boils down to this:
Does the longest-token matcher ignore the ?  and do

foo.+ | fooba

or is there an implicit ordering above and beyond the DFA engine of

foob | fooba ||
fooba | fooba ||
foobar | fooba ||

I think longest-token semantics have to trump minimal matching here,
and my argument is this.  Most uses of *? have additional information
on what terminates it, either implicitly in what it is matching, or
explicitly in the next bit of regex.  That is, you'd typically see
either

foo\w+? | fooba

or

foo.+? wb | fooba

In either case, the clear intent is to match foobar over fooba.
Therefore I think the DFA matcher just strips ? and does its ordinary
character by character match, relying on that extra info to match
the real extent of the quantifier.

Larry


Re: , , and backtracking.

2007-09-06 Thread Geoffrey Broadwell
On Thu, 2007-09-06 at 12:37 -0700, Larry Wall wrote:
 Yow.  ICATBW.

The what now?


-'f




Re: [svn:perl6-synopsis] r14447 - doc/trunk/design/syn

2007-09-06 Thread Jonathan Lang
So 'orelse' is exactly like '//', except that the result of the left
side gets passed to the right side as an error message.  Is there a
reason to make this exception, as opposed to altering '//' to behave
exactly like 'orelse' does?

-- 
Jonathan Dataweaver Lang


Re: , , and backtracking.

2007-09-06 Thread Patrick R. Michaud
On Thu, Sep 06, 2007 at 12:37:37PM -0700, Larry Wall wrote:
 On Thu, Sep 06, 2007 at 01:25:12PM -0500, Patrick R. Michaud wrote:
 : On a somewhat similar question, what happens with a pattern
 : such as
 : 
 : foobar ~~ / foo.+? | fooba /
 : 
 : The LHS initially matches foob, but with backtracking could
 : eventually match foobar.  Do the longest-token semantics
 : in this case cause the RHS to be dispatched first, even
 : though the token declaration of the LHS _could_ match a 
 : longer token prefix?  
 
 Yow.  ICATBW.  Non-greedy matching is somewhat antithetical to
 longest-token matching.  

I agree.  One thought I had was that perhaps non-greedy matching
could also terminate the token prefix.

 [...]
 I think longest-token semantics have to trump minimal matching here,
 and my argument is this.  Most uses of *? have additional information
 on what terminates it, either implicitly in what it is matching, or
 explicitly in the next bit of regex.  That is, you'd typically see
 either
 foo\w+? | fooba
 or
 foo.+? wb | fooba
 
 In either case, the clear intent is to match foobar over fooba.
 Therefore I think the DFA matcher just strips ? and does its ordinary
 character by character match, relying on that extra info to match
 the real extent of the quantifier.

Does this still hold true for a non-greedy quantifier in the
middle of an expression... ?  I.e.,

foobazbar deborah ~~ /foo .+? b.r | fooba | foobazb /

matches foobazbar debor ?

(I completely grant that the examples I'm coming up with here
may be completely nonsensical in real application, but I'm
just exploring the space a bit.)

Pm


Re: [svn:perl6-synopsis] r14447 - doc/trunk/design/syn

2007-09-06 Thread Larry Wall
On Thu, Sep 06, 2007 at 01:40:20PM -0700, Jonathan Lang wrote:
: So 'orelse' is exactly like '//', except that the result of the left
: side gets passed to the right side as an error message.  Is there a
: reason to make this exception, as opposed to altering '//' to behave
: exactly like 'orelse' does?

How 'bout, it's convenient to have the simpler defaulting semantics
when you don't care what kind of undefined value is on the left.

Larry


Re: , , and backtracking.

2007-09-06 Thread Larry Wall
On Thu, Sep 06, 2007 at 03:49:42PM -0700, Larry Wall wrote:
: On Thu, Sep 06, 2007 at 04:02:19PM -0500, Patrick R. Michaud wrote:
: : I agree.  One thought I had was that perhaps non-greedy matching
: : could also terminate the token prefix.
: 
: Well, that's more or less arguing it the other way.  It kind of assumes
: your fooba-ish arguments are smart enough to test for non-r after.

The more I think about it, the more I like this approach.  Most token
should be matched in ratchet mode, and I think minimal matching
is perhaps a form of cheating: Look for the nearest end of this
without me actually bothering to parse the middle.  And certainly
any reasonable grammer is going to be checking that a keyword is
properly terminated by a word boundary anyway.

Larry


Re: [svn:perl6-synopsis] r14447 - doc/trunk/design/syn

2007-09-06 Thread Joe Gottman

[EMAIL PROTECTED] wrote:

Author: larry
Date: Thu Sep  6 09:31:16 2007
New Revision: 14447

+
+C infix:andthen , proceed on success
+
+test1() andthen test2()
+
+Returns the left argument if the left argument indicates failure
+(that is, if the result is undefined).  Otherwise it
+evaluates and returns the right argument.
+
+If the right side is a block or pointy block, the result of the left
+side is bound to any arguments of the block.  If the right side is
+not a block, a block scope is assumed around the right side, and the
+result of the left side is implicitly bound to C$_ for the scope
+of the right side.  That is,
+
+test1() andthen test2()
+
+is equivalent to
+
+test1() andthen - $_ { test2() }
+
+There is no corresponding high-precedence version.
+
 =back
 


+C infix:orelse , proceed on failure
 
-$value err $default

+test1() orelse test2()
 
-Returns the left argument if it's defined, otherwise evaluates and

-returns the right argument.  In list context forces a false return
-to mean C().  See C// above for high-precedence version.
+Returns the left argument if the left argument indicates success
+(that is, if the result is defined).  Otherwise it evaluates and
+returns the right argument.
+
+If the right side is a block or pointy block, the result of the left
+side is bound to any arguments of the block.  If the right side is
+not a block, a block scope is assumed around the right side, and the
+result of the left side is implicitly bound to C$! for the scope
+of the right side.  That is,
+
+test1() orelse test2()
+
+is equivalent to
+
+test1() orelse - $! { test2() }
+
+(The low-precedence C// operator is similar, but does not set C$! or
+treat blocks specially.)
 
 =back
 

  
  Do the results of andthen and orelse really bind to ANY arguments of 
the second block?  If the second block has two parameters it makes more 
sense to me for the results to bind to the first parameter and nothing 
to bind to the second parameter.


Joe Gottman


[svn:perl6-synopsis] r14449 - doc/trunk/design/syn

2007-09-06 Thread larry
Author: larry
Date: Thu Sep  6 17:12:02 2007
New Revision: 14449

Modified:
   doc/trunk/design/syn/S05.pod

Log:
old ?foo is now +foo to suppress capture
new ?foo now is zero-width like !foo
clarifications on backtracking and longest-token semantics
minimal quantifiers are now considered to terminate a longest token


Modified: doc/trunk/design/syn/S05.pod
==
--- doc/trunk/design/syn/S05.pod(original)
+++ doc/trunk/design/syn/S05.podThu Sep  6 17:12:02 2007
@@ -14,9 +14,9 @@
Maintainer: Patrick Michaud [EMAIL PROTECTED] and
Larry Wall [EMAIL PROTECTED]
Date: 24 Jun 2002
-   Last Modified: 17 Aug 2007
+   Last Modified: 6 Sep 2007
Number: 5
-   Version: 63
+   Version: 64
 
 This document summarizes Apocalypse 5, which is about the new regex
 syntax.  We now try to call them Iregex rather than regular
@@ -247,13 +247,13 @@
 
 The new C:s (C:sigspace) modifier causes whitespace sequences
 to be considered significant; they are replaced by a whitespace
-matching rule, C ?ws .  That is,
+matching rule, C +ws .  That is,
 
  m:s/ next cmd =   condition/
 
 is the same as:
 
- m/ ?ws next ?ws cmd ?ws = ?ws condition/
+ m/ +ws next +ws cmd +ws = +ws condition/
 
 which is effectively the same as:
 
@@ -265,18 +265,18 @@
 
 or equivalently,
 
- m { (a|\*) ?ws (b|\+) }
+ m { (a|\*) +ws (b|\+) }
 
-C ?ws  can't decide what to do until it sees the data.
-It still does the right thing.  If not, define your own C ?ws 
+C +ws  can't decide what to do until it sees the data.
+It still does the right thing.  If not, define your own C ws 
 and C:sigspace will use that.
 
 In general you don't need to use C:sigspace within grammars because
 the parser rules automatically handle whitespace policy for you.
 In this context, whitespace often includes comments, depending on
 how the grammar chooses to define its whitespace rule.  Although the
-default C ?ws  subrule recognizes no comment construct, any
-grammar is free to override the rule.  The C ?ws  rule is not
+default C +ws  subrule recognizes no comment construct, any
+grammar is free to override the rule.  The C +ws  rule is not
 intended to mean the same thing everywhere.
 
 It's also possible to pass an argument to C:sigspace specifying
@@ -285,7 +285,7 @@
 important to distinguish the significant whitespace in the pattern from
 the whitespace being matched, so we'll call the pattern's whitespace
 Isigspace, and generally reserve Iwhitespace to indicate whatever
-C ?ws  matches in the current grammar. The correspondence
+C +ws  matches in the current grammar. The correspondence
 between sigspace and whitespace is primarily metaphorical, which is
 why the correspondence is both useful and (potentially) confusing.
 
@@ -336,16 +336,16 @@
 If followed by an Cx, it means repetition.  Use C:x(4) for the
 general form.  So
 
- s:4x [ (?ident) = (\N+) $$] [$0 = $1];
+ s:4x [ (+ident) = (\N+) $$] [$0 = $1];
 
 is the same as:
 
- s:x(4) [ (?ident) = (\N+) $$] [$0 = $1];
+ s:x(4) [ (+ident) = (\N+) $$] [$0 = $1];
 
 which is almost the same as:
 
  $_.pos = 0;
- s:c[ (?ident) = (\N+) $$] = $0 = $1 for 1..4;
+ s:c[ (+ident) = (\N+) $$] = $0 = $1 for 1..4;
 
 except that the string is unchanged unless all four matches are found.
 However, ranges are allowed, so you can say C:x(1..4) to change anywhere
@@ -450,7 +450,7 @@
 and these are equivalent to
 
 $string ~~ m/^ \d+: $/;
-$string ~~ m/^ ?ws \d+: ?ws $/;
+$string ~~ m/^ +ws \d+: +ws $/;
 
 =item *
 
@@ -584,7 +584,8 @@
 erroneous to assume either order happens consistently.  The C
 form guarantees left-to-right order, and backtracking makes the right
 argument vary faster than the left.  In other words, C and C|| establish
-sequence points.
+sequence points.  The left side may be backtracked into when backtracking
+is allowed into the construct as a whole.
 
 The C operator is list associative like C|, but has slightly
 tighter precedence.  Likewise C has slightly tighter precedence
@@ -1008,10 +1009,14 @@
 To pass a string with leading whitespace, or to interpolate any values
 into the string, you must use the parenthesized form.
 
-If the first character is a plus or minus, the initial identifier
-is taken as a character class, so the first character after the
-identifier doesn't matter in this case, and you can use whitespace
-however you like.  Therefore
+If the first character is a plus or minus, the rest of the assertion
+is parsed as a set of character classes (though the definition of
+character class is intentionally vague, and may include any other rule
+whether it matches characters exclusively or not).
+
+An initial identifier is taken as a character class, so the first
+character after the identifier doesn't matter in this case, and you
+can use whitespace however you like.  Therefore
 
 foo+bar-baz
 
@@ -1054,15 +1059,21 @@
 
 

Re: [svn:perl6-synopsis] r14447 - doc/trunk/design/syn

2007-09-06 Thread Jonathan Lang
Larry Wall wrote:
 Jonathan Lang wrote:
 : So 'orelse' is exactly like '//', except that the result of the left
 : side gets passed to the right side as an error message.  Is there a
 : reason to make this exception, as opposed to altering '//' to behave
 : exactly like 'orelse' does?

 How 'bout, it's convenient to have the simpler defaulting semantics
 when you don't care what kind of undefined value is on the left.

If you don't care what kind of undefined value is on the left, you
probably won't notice if it's passed to the right; and the optimizer
may very well end up implementing the simpler semantics in this case,
even if you use 'orelse'.  And if you explicitly want to use the
simpler semantics, you can come very close by blocking the right side
in a block that takes no arguments.

Conversely: if you do care about the undefined value, the subtle
distinction between '//' and 'orelse' (beyond the implied difference
in precedence) will be yet another language quirk that you'll have to
remember.  Very inconvenient.

--
Jonathan Dataweaver Lang


[svn:perl6-synopsis] r14453 - doc/trunk/design/syn

2007-09-06 Thread larry
Author: larry
Date: Thu Sep  6 20:56:39 2007
New Revision: 14453

Modified:
   doc/trunk/design/syn/S03.pod

Log:
fixed it right this time...


Modified: doc/trunk/design/syn/S03.pod
==
--- doc/trunk/design/syn/S03.pod(original)
+++ doc/trunk/design/syn/S03.podThu Sep  6 20:56:39 2007
@@ -3632,7 +3632,7 @@
 If you already have a capture variable, you can interpolate all of
 its bits at once using the C prefix:|  operator:
 
-my $capture := func();
+my (|$capture) := func();
 push |$capture;
 
 =head1 Traversing arrays in parallel


[svn:perl6-synopsis] r14452 - doc/trunk/design/syn

2007-09-06 Thread larry
Author: larry
Date: Thu Sep  6 20:50:04 2007
New Revision: 14452

Modified:
   doc/trunk/design/syn/S03.pod

Log:
extra \ found by sunnavy++


Modified: doc/trunk/design/syn/S03.pod
==
--- doc/trunk/design/syn/S03.pod(original)
+++ doc/trunk/design/syn/S03.podThu Sep  6 20:50:04 2007
@@ -3632,7 +3632,7 @@
 If you already have a capture variable, you can interpolate all of
 its bits at once using the C prefix:|  operator:
 
-my \$capture := func();
+my $capture := func();
 push |$capture;
 
 =head1 Traversing arrays in parallel


[svn:perl6-synopsis] r14451 - doc/trunk/design/syn

2007-09-06 Thread larry
Author: larry
Date: Thu Sep  6 20:36:26 2007
New Revision: 14451

Modified:
   doc/trunk/design/syn/S03.pod

Log:
typo spotted by sunnavy++


Modified: doc/trunk/design/syn/S03.pod
==
--- doc/trunk/design/syn/S03.pod(original)
+++ doc/trunk/design/syn/S03.podThu Sep  6 20:36:26 2007
@@ -2830,7 +2830,7 @@
 with C+, C~ and C? respectively to indicate whether the bitwise
 operation is done on a number, a string, or a single bit.
 But that's just a naming convention, and if you wanted to add a new
-bitwise C¬ operator, you'd have the add the C+¬, C~¬, and C?¬
+bitwise C¬ operator, you'd have to add the C+¬, C~¬, and C?¬
 operators yourself.  Similarly, the carets that exclude the endpoints
 on ranges are there by convention only.
 


Re: [svn:perl6-synopsis] r14447 - doc/trunk/design/syn

2007-09-06 Thread Larry Wall
On Thu, Sep 06, 2007 at 08:32:55PM -0400, Joe Gottman wrote:
:   Do the results of andthen and orelse really bind to ANY arguments of 
: the second block?  If the second block has two parameters it makes more 
: sense to me for the results to bind to the first parameter and nothing 
: to bind to the second parameter.

I suspect the left side is a Capture context, so test1() could return
multiple arguments to be bound to the right side.  In any case, if
test1() calls fail() then the orelse should fail regardless of whether
the orelse is in list or scalar context.  Should probably make the
orelse itself fail if the binding doesn't work too.

Larry


[svn:perl6-synopsis] r14450 - doc/trunk/design/syn

2007-09-06 Thread larry
Author: larry
Date: Thu Sep  6 19:29:21 2007
New Revision: 14450

Modified:
   doc/trunk/design/syn/S05.pod

Log:
typo from [particle]++


Modified: doc/trunk/design/syn/S05.pod
==
--- doc/trunk/design/syn/S05.pod(original)
+++ doc/trunk/design/syn/S05.podThu Sep  6 19:29:21 2007
@@ -1280,7 +1280,7 @@
 
 alpha # match a letter and capture in $alpha
 +alpha# match a letter, don't capture
-?alpha# much null before a letter, don't capture
+?alpha# match null before a letter, don't capture
 
 =item *
 


Re: [svn:perl6-synopsis] r14449 - doc/trunk/design/syn

2007-09-06 Thread jerry gay
On 9/6/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 @@ -1254,6 +1273,17 @@

  =item *

 +A leading C? indicates a positive zero-width assertion, and like C!
 +merely reparses the rest of the assertion recursively as if the C?
 +were not there.  In addition to forcing zero-width, it also suppresses
 +any named capture:
 +
 +alpha # match a letter and capture in $alpha
 ++alpha# match a letter, don't capture
 +?alpha# much null before a letter, don't capture
 +
 +=item *
 +
  A leading C~~ indicates a recursive call back into some or all of
  the current rule.  An optional argument indicates which subpattern
  to re-use, and if provided must resolve to a single subpattern.

s/much/match/


Re: , , and backtracking.

2007-09-06 Thread Larry Wall
On Thu, Sep 06, 2007 at 04:02:19PM -0500, Patrick R. Michaud wrote:
: I agree.  One thought I had was that perhaps non-greedy matching
: could also terminate the token prefix.

Well, that's more or less arguing it the other way.  It kind of assumes
your fooba-ish arguments are smart enough to test for non-r after.

:  [...]
:  I think longest-token semantics have to trump minimal matching here,
:  and my argument is this.  Most uses of *? have additional information
:  on what terminates it, either implicitly in what it is matching, or
:  explicitly in the next bit of regex.  That is, you'd typically see
:  either
:  foo\w+? | fooba
:  or
:  foo.+? wb | fooba
:  
:  In either case, the clear intent is to match foobar over fooba.
:  Therefore I think the DFA matcher just strips ? and does its ordinary
:  character by character match, relying on that extra info to match
:  the real extent of the quantifier.
: 
: Does this still hold true for a non-greedy quantifier in the
: middle of an expression... ?  I.e.,
: 
: foobazbar deborah ~~ /foo .+? b.r | fooba | foobazb /
: 
: matches foobazbar debor ?

I suppose one approach would be to limit the DFA to the longest known
constant string, and then rerun any variable branches individually
to see if they will actually match something that long.  Alternately,
if the DFA engine is smart enough to kick out all the matches of the
foo .+? b.r branch then it could conceivably be taught to return
the shortest one rather than the longest one.  Course, the question
after than is, what if you mix minimal with maximal quantifiers?

Another consideration is the efficiency of matching that pattern
against foobazbar ~ 'b' x 1.  That would definitely run
faster if you cut off the DFA after the longest known token, or as
soon as the DFA reduces to one possibility, presuming you can know
such a thing.

Any way we work it, there will be a way to use it wrong.

Larry