Re: , , and backtracking.

2007-09-06 Thread Jonathan Lang
Jonathan Scott Duff wrote:
 How do C and C differ with respect to backtracking?  For instance,

 foobar ~~ / [a..z]+  [ ... ] /;

 Both sides of the C happen in parallel, so I would guess that they
 both match foo then stop. Please correct me if that's wrong.

As written, this match would fail, since '[a..z]+' would match
foobar while '[ ... ]' would match foo.  '' requires that both
sides match the same start and end points.  I suppose that it might be
worth considering requiring only the start points to match, and having
the conjoined match use the earliest end point; so that the above
match would then match foo instead of failing.  But it's entirely
possible that there's something faulty with this approach.

The difference between '' and '' is that '' evaluates the two
sides in an arbitrary order, possibly even in parallel; '' always
evaluates the left side first, and only bothers with the right side if
the left side matched.

-- 
Jonathan Dataweaver Lang


Re: , , and backtracking.

2007-09-06 Thread Larry Wall
On Thu, Sep 06, 2007 at 01:25:12PM -0500, Patrick R. Michaud wrote:
:  Were we using the procedural conjunction:
:  
:  foobar ~~ / [a..z]+  [ ... ] /;
:  
:  I would guess that the LHS matches as much as it can (foobar), then
:  the RHS matches foo [...and then backtracks the LHS until a 
:  conjunctional match is found...]
: 
:  Or it's much simpler than that and both of the regexes above just fail
:  because of the greediness of C+ and there is no intra-conjunction
:  backtracking.
: 
: I think we definitely allow intra-conjunction backtracking.
: PGE implements it that way.

That's what I think.

: On a somewhat similar question, what happens with a pattern
: such as
: 
: foobar ~~ / foo.+? | fooba /
: 
: The LHS initially matches foob, but with backtracking could
: eventually match foobar.  Do the longest-token semantics
: in this case cause the RHS to be dispatched first, even
: though the token declaration of the LHS _could_ match a 
: longer token prefix?  

Yow.  ICATBW.  Non-greedy matching is somewhat antithetical to
longest-token matching.  But basically it boils down to this:
Does the longest-token matcher ignore the ?  and do

foo.+ | fooba

or is there an implicit ordering above and beyond the DFA engine of

foob | fooba ||
fooba | fooba ||
foobar | fooba ||

I think longest-token semantics have to trump minimal matching here,
and my argument is this.  Most uses of *? have additional information
on what terminates it, either implicitly in what it is matching, or
explicitly in the next bit of regex.  That is, you'd typically see
either

foo\w+? | fooba

or

foo.+? wb | fooba

In either case, the clear intent is to match foobar over fooba.
Therefore I think the DFA matcher just strips ? and does its ordinary
character by character match, relying on that extra info to match
the real extent of the quantifier.

Larry


Re: , , and backtracking.

2007-09-06 Thread Geoffrey Broadwell
On Thu, 2007-09-06 at 12:37 -0700, Larry Wall wrote:
 Yow.  ICATBW.

The what now?


-'f




Re: , , and backtracking.

2007-09-06 Thread Patrick R. Michaud
On Thu, Sep 06, 2007 at 12:37:37PM -0700, Larry Wall wrote:
 On Thu, Sep 06, 2007 at 01:25:12PM -0500, Patrick R. Michaud wrote:
 : On a somewhat similar question, what happens with a pattern
 : such as
 : 
 : foobar ~~ / foo.+? | fooba /
 : 
 : The LHS initially matches foob, but with backtracking could
 : eventually match foobar.  Do the longest-token semantics
 : in this case cause the RHS to be dispatched first, even
 : though the token declaration of the LHS _could_ match a 
 : longer token prefix?  
 
 Yow.  ICATBW.  Non-greedy matching is somewhat antithetical to
 longest-token matching.  

I agree.  One thought I had was that perhaps non-greedy matching
could also terminate the token prefix.

 [...]
 I think longest-token semantics have to trump minimal matching here,
 and my argument is this.  Most uses of *? have additional information
 on what terminates it, either implicitly in what it is matching, or
 explicitly in the next bit of regex.  That is, you'd typically see
 either
 foo\w+? | fooba
 or
 foo.+? wb | fooba
 
 In either case, the clear intent is to match foobar over fooba.
 Therefore I think the DFA matcher just strips ? and does its ordinary
 character by character match, relying on that extra info to match
 the real extent of the quantifier.

Does this still hold true for a non-greedy quantifier in the
middle of an expression... ?  I.e.,

foobazbar deborah ~~ /foo .+? b.r | fooba | foobazb /

matches foobazbar debor ?

(I completely grant that the examples I'm coming up with here
may be completely nonsensical in real application, but I'm
just exploring the space a bit.)

Pm


Re: , , and backtracking.

2007-09-06 Thread Larry Wall
On Thu, Sep 06, 2007 at 03:49:42PM -0700, Larry Wall wrote:
: On Thu, Sep 06, 2007 at 04:02:19PM -0500, Patrick R. Michaud wrote:
: : I agree.  One thought I had was that perhaps non-greedy matching
: : could also terminate the token prefix.
: 
: Well, that's more or less arguing it the other way.  It kind of assumes
: your fooba-ish arguments are smart enough to test for non-r after.

The more I think about it, the more I like this approach.  Most token
should be matched in ratchet mode, and I think minimal matching
is perhaps a form of cheating: Look for the nearest end of this
without me actually bothering to parse the middle.  And certainly
any reasonable grammer is going to be checking that a keyword is
properly terminated by a word boundary anyway.

Larry


Re: , , and backtracking.

2007-09-06 Thread Larry Wall
On Thu, Sep 06, 2007 at 04:02:19PM -0500, Patrick R. Michaud wrote:
: I agree.  One thought I had was that perhaps non-greedy matching
: could also terminate the token prefix.

Well, that's more or less arguing it the other way.  It kind of assumes
your fooba-ish arguments are smart enough to test for non-r after.

:  [...]
:  I think longest-token semantics have to trump minimal matching here,
:  and my argument is this.  Most uses of *? have additional information
:  on what terminates it, either implicitly in what it is matching, or
:  explicitly in the next bit of regex.  That is, you'd typically see
:  either
:  foo\w+? | fooba
:  or
:  foo.+? wb | fooba
:  
:  In either case, the clear intent is to match foobar over fooba.
:  Therefore I think the DFA matcher just strips ? and does its ordinary
:  character by character match, relying on that extra info to match
:  the real extent of the quantifier.
: 
: Does this still hold true for a non-greedy quantifier in the
: middle of an expression... ?  I.e.,
: 
: foobazbar deborah ~~ /foo .+? b.r | fooba | foobazb /
: 
: matches foobazbar debor ?

I suppose one approach would be to limit the DFA to the longest known
constant string, and then rerun any variable branches individually
to see if they will actually match something that long.  Alternately,
if the DFA engine is smart enough to kick out all the matches of the
foo .+? b.r branch then it could conceivably be taught to return
the shortest one rather than the longest one.  Course, the question
after than is, what if you mix minimal with maximal quantifiers?

Another consideration is the efficiency of matching that pattern
against foobazbar ~ 'b' x 1.  That would definitely run
faster if you cut off the DFA after the longest known token, or as
soon as the DFA reduces to one possibility, presuming you can know
such a thing.

Any way we work it, there will be a way to use it wrong.

Larry