Re: Does ::: constrain the pattern engine implementation?

2002-08-29 Thread Jerome Vouillon

On Wed, Aug 28, 2002 at 10:36:54AM -0700, Larry Wall wrote:
 That is a worthy consideration, but expressiveness takes precedence
 over it in this case.  DFAs are really only good for telling you
 *whether* and *where* a pattern matches as a whole.  They are
 relatively useless for telling you *how* a pattern matches.

Actually, DFAs can also tell you *how* a pattern matches matches.  For
instance the RE library (http://www.sourceforge.net/projects/libre/)
is DFA-based and support a lot of Perl 5 regular expression features:
- submatches
- left-most matching semantics
- greedy and non-greedy operators (*, *?, ?, ??)
- zero-length assertions such as ^, $, \b.

I don't know how to implement back-references and embedded Perl code
using a DFA.  Look-ahead and look-behind assertion can probably be
implemented, though this looks tricky.  But, these features are not
used that often.  So, I believe than most real-life Perl 5 regular
expressions can be handled by a DFA.

 Add to that the fact that most real-life patterns don't generally do
 much backtracking, because they're written to succeed, not to fail.
 This pattern never backtracks, for instance:
 
 my ($num) = /^Items: (\d+)/;

Another advantage of DFAs over NFAs is that the core of a DFA-based
pattern matching implementation is a small simple loop which is
executed very efficiently by modern CPUs.  On the other hand, the core
of a NFA-based implementation is much larger and much more complex,
and I'm not sure it is executed as efficiently.  In particular, there
is probably a lot more branch mispredictions.

As an example of what performance improvement one can sometimes
achieve using a DFA-based rather than a NFA-based implementation,
here are the results I get on my computer for the regular expression
benchmark from the Great Computer Language Shootout.
(http://www.bagley.org/~doug/shootout/bench/regexmatch/)

Perl 5  3.49s
OCaml with PCRE (NFA-based) 3.17s
OCaml with RE (DFA-based)   0.34s

-- Jerome



Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Deven T. Corzine

I have no objection to pattern operators like ::: in principle, but I do
have a potential concern about them.

Given that the operators are actually defined in terms of backtracking 
within the RE engine, does this constrain the implementation such that it 
MUST be a backtracking implementation to behave correctly?

If these operators are purely effeciency optimization hints, that would be 
one thing, but I get the sense that ignoring the hints might lead to 
incorrect behavior.  (The cut operator might be a special concern.)

Suppose, for the sake of argument, that someone wanted to make a pattern 
engine implementation, compatible with Perl 6 patterns, which was highly 
optimized for speed at the expense of memory, by RE-NFA-DFA construction 
for simultaneous evaluation of multiple alternatives without backtracking.

This might be extremely expensive in memory, but there may be some niche 
applications where run-time speed is paramount, and a pattern is used so 
heavily in such a critical way that the user might be willing to expend 
hundreds of megabytes of RAM to make the patterns execute several times 
faster than normal.  (Obviously, such a tradeoff would be unacceptable in
the general case!)

Would it be _possible_ to create a non-backtracking implementation of a 
Perl 6 pattern engine, or does the existence of backtracking-related 
operators preclude this possibility in advance?

I hope we're not constraining the implementation options by the language 
design, but I'm worried that this might be the case with these operators.

Shouldn't it be an implementation decision whether to use backtracking?

Deven




Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Simon Cozens

[EMAIL PROTECTED] (Deven T. Corzine) writes:
 Would it be _possible_ to create a non-backtracking implementation of a 
 Perl 6 pattern engine

I don't believe that it is, but not just because of : and friends.
Why does it matter?

-- 
Life sucks, but it's better than the alternative.
-- Peter da Silva



Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Dan Sugalski

At 10:57 AM -0400 8/28/02, Deven T. Corzine wrote:
Would it be _possible_ to create a non-backtracking implementation of a
Perl 6 pattern engine, or does the existence of backtracking-related
operators preclude this possibility in advance?

In general, no of course it's not possible to create a 
non-backtracking perl regex engine. Far too much of perl's regexes 
requires backtracking.

That doesn't mean you can't write one for a specific subset of perl's 
regexes, though. A medium-term goal for the regex engine is to note 
where a DFA would give correct behaviour and use one, rather than 
going through the more expensive generalized regex engine we'd 
otherwise use.

If you want to head over to [EMAIL PROTECTED] and pitch in on 
the regex implementation (it's being worked on now) that'd be great.
-- 
 Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk



Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Deven T. Corzine

On 28 Aug 2002, Simon Cozens wrote:

 [EMAIL PROTECTED] (Deven T. Corzine) writes:
  Would it be _possible_ to create a non-backtracking implementation of a 
  Perl 6 pattern engine
 
 I don't believe that it is, but not just because of : and friends.
 Why does it matter?

I'm not saying we should dump the operators -- if we get more power by 
assuming a backtracking implementation, maybe that's a worthwhile tradeoff.

On the other hand, if we can keep the implementation possibilities more 
open, that's always a worthwhile goal, even if we're not sure if or when 
we'll ever take advantage of those possibilities, or if we even could...

It seems like backtracking is a Bad Thing, in that it leads to reprocessing 
data that we've already looked at.  On the other hand, it seems to be a 
Necessary Evil because of the memory costs of avoiding backtracking, and 
because we might have to give up valuable features without backtracking.

It may be that backreferences already demand backtracking.  Or some other 
feature might.  I don't know; I haven't thought it through.

If we must have backtracking, so be it.  But if that's a tradeoff we're 
making for more expressive and powerful patterns, we should still at least 
make that tradeoff with our eyes open.  And if the tradeoff can be avoided, 
that's even better.

Deven





Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Deven T. Corzine

On Wed, 28 Aug 2002, Dan Sugalski wrote:

 At 10:57 AM -0400 8/28/02, Deven T. Corzine wrote:
 Would it be _possible_ to create a non-backtracking implementation of a
 Perl 6 pattern engine, or does the existence of backtracking-related
 operators preclude this possibility in advance?
 
 In general, no of course it's not possible to create a 
 non-backtracking perl regex engine. Far too much of perl's regexes 
 requires backtracking.

Given that Perl 5 regex's are no longer regular (much less Perl 6), I'm 
sure this is probably true.  There may be a regular subset which could be 
implemented without backtracking if problematic features are avoided, but 
surely a complete non-backtracking implementation is beyond reach.

On the other hand, :, ::, ::: and commit don't necessarily need to be a 
problem if they can be treated as hints that can be ignored.  If allowing 
the normal engine to backtrack despite the hints would change the results, 
that might be a problem.  I don't know; cut may pose special problems.

Even if the new operators can't work without backtracking, maybe it doesn't 
matter, since there's surely a few others inherited from Perl 5 as well...

 That doesn't mean you can't write one for a specific subset of perl's 
 regexes, though. A medium-term goal for the regex engine is to note 
 where a DFA would give correct behaviour and use one, rather than 
 going through the more expensive generalized regex engine we'd 
 otherwise use.

I think this is a more realistic goal, and more or less what I had in mind.

I believe there are many subpatterns which might be beneficial to compile 
to a DFA (or DFA-like) form, where runtime performance is important.  For 
example, if a pattern is matching dates, a (Jan|Feb|Mar|Apr|...) subpattern
would be more efficient to implement as a DFA than with backtracking.  With 
a large amount of data to process, that represent significant savings...

 If you want to head over to [EMAIL PROTECTED] and pitch in on 
 the regex implementation (it's being worked on now) that'd be great.

I'd like to do that, if I can find the time.  It would be interesting to 
make a small experimental prototype to see if DFA construction could really 
improve performance over backtracking, but it would probably need to be a 
very restricted subset of regex operations to test the idea...

However, while I'm still on perl6-language, I have two language issues to 
discuss first:

(1) Can we have a :study modifier in Perl 6 for patterns?

It could be a no-op if necessary, but it could take the place of Perl 5's 
study operator and indicate that the programmer WANTS the pattern 
optimized for maximum runtime speed, even at the cost of compile time or 
memory.  (Hey, how about a :cram modifier for extreme optimization? :-)

(2) Would simple alternation impede DFA-style optimizations?

Currently, a pattern like (Jun|June) would never match June because the 
leftmost match Jun would always take precedence, despite the normal 
longest-match behavior of regexes in general.  This example could be 
implemented in a DFA; would that always be the case?

Would it be better for the matching of (Jun|June) to be undefined and 
implementation-dependent?  Or is it best to require leftmost semantics?

Deven




Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Sean O'Rourke

On Wed, 28 Aug 2002, Deven T. Corzine wrote:
 Would it be better for the matching of (Jun|June) to be undefined and
 implementation-dependent?  Or is it best to require leftmost semantics?

For an alternation spelled out explicitly in the pattern, it seems like
undefined matching would be confusing.  I regularly order the branches of
regexes assuming they are tried left-to-right.

On the other hand, and on a related note of constrained implementation, do
we require leftmost matching for interpolated arrays of literals (e.g.
/x/)?  If, as with hyper-operators, we said the order of evaluation is
undefined, we could use a fast algorithm (Aho-Corasick?) that doesn't
preserve order.

/s




Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Larry Wall

On Wed, 28 Aug 2002, Deven T. Corzine wrote:
: I'm not saying we should dump the operators -- if we get more power by 
: assuming a backtracking implementation, maybe that's a worthwhile tradeoff.
: 
: On the other hand, if we can keep the implementation possibilities more 
: open, that's always a worthwhile goal, even if we're not sure if or when 
: we'll ever take advantage of those possibilities, or if we even could...

That is a worthy consideration, but expressiveness takes precedence
over it in this case.  DFAs are really only good for telling you
*whether* and *where* a pattern matches as a whole.  They are
relatively useless for telling you *how* a pattern matches.
For instance, a DFA can tell you that you have a valid computer
program, but can't hand you back the syntax tree, because it has
no way to decide between shifting and reducing.  It has to do both
simultaneously.

: It seems like backtracking is a Bad Thing, in that it leads to reprocessing 
: data that we've already looked at.  On the other hand, it seems to be a 
: Necessary Evil because of the memory costs of avoiding backtracking, and 
: because we might have to give up valuable features without backtracking.
: 
: It may be that backreferences already demand backtracking.  Or some other 
: feature might.  I don't know; I haven't thought it through.

I believe you are correct that backrefs require backtracking.  Maybe some smart
person will find a way to trace back through the states by which a DFA matched
to retrieve backref info, but that's probably worth a PhD or two.

Minimal matching is also difficult in a DFA, I suspect.

: If we must have backtracking, so be it.  But if that's a tradeoff we're 
: making for more expressive and powerful patterns, we should still at least 
: make that tradeoff with our eyes open.  And if the tradeoff can be avoided, 
: that's even better.

I refer you to http:://history.org/grandmother/teach/egg/suck.  :-)

That's a tradeoff I knowingly made in Perl 1.  I saw that awk had a
DFA implementation, and suffered for it as a useful tool.

And it's not just the backrefs.  An NFA can easily incorporate
strategies such as Boyer-Moore, which actually skips looking at many of
the characters, or the scream algorithm used by study, which can skip
even more.  All the DFAs I've seen have to look at every character,
albeit only once.  I suppose it's theoretically possible to compile
to a Boyer-Moore-ish state machine, but I've never seen it done.

Add to that the fact that most real-life patterns don't generally do
much backtracking, because they're written to succeed, not to fail.
This pattern never backtracks, for instance:

my ($num) = /^Items: (\d+)/;

I'm not against applying a DFA implementation where it's useful
and practical, but just because it's the best in some limited
theoretical framework doesn't cut it for me.  Humans do a great
deal of backtracking in real life, once the limits of their parallel
pattern matching circuits are exceeded.  Even in language we often
have to reparse sentences that are garden pathological.  Why should
computers be exempt? :-)

Larry




Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Larry Wall

On Wed, 28 Aug 2002, Deven T. Corzine wrote:
: I'd like to do that, if I can find the time.  It would be interesting to 
: make a small experimental prototype to see if DFA construction could really 
: improve performance over backtracking, but it would probably need to be a 
: very restricted subset of regex operations to test the idea...

That'd be cool.

: However, while I'm still on perl6-language, I have two language issues to 
: discuss first:
: 
: (1) Can we have a :study modifier in Perl 6 for patterns?
: 
: It could be a no-op if necessary, but it could take the place of Perl 5's 
: study operator and indicate that the programmer WANTS the pattern 
: optimized for maximum runtime speed, even at the cost of compile time or 
: memory.  (Hey, how about a :cram modifier for extreme optimization? :-)

Well, studied isn't really a property of a pattern--it's a property of a
string that knows it will have multiple patterns matched against it.  One
could put a :study on the first pattern, but that's somewhat deceptive.

: (2) Would simple alternation impede DFA-style optimizations?
: 
: Currently, a pattern like (Jun|June) would never match June because the 
: leftmost match Jun would always take precedence, despite the normal 
: longest-match behavior of regexes in general.  This example could be 
: implemented in a DFA; would that always be the case?

Well, June can match if what follows fails to match after Jun.

: Would it be better for the matching of (Jun|June) to be undefined and 
: implementation-dependent?  Or is it best to require leftmost semantics?

Well, the semantics shouldn't generally wobble around like that, but it'd
be pretty easy to let them wobble on purpose via pragma (or via :modifier,
which are really just pragmas that scope to regex groups).

Larry




Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Dan Sugalski

At 10:36 AM -0700 8/28/02, Larry Wall wrote:
On Wed, 28 Aug 2002, Deven T. Corzine wrote:
: I'm not saying we should dump the operators -- if we get more power by
: assuming a backtracking implementation, maybe that's a worthwhile tradeoff.
:
: On the other hand, if we can keep the implementation possibilities more
: open, that's always a worthwhile goal, even if we're not sure if or when
: we'll ever take advantage of those possibilities, or if we even could...

That is a worthy consideration, but expressiveness takes precedence
over it in this case.  DFAs are really only good for telling you
*whether* and *where* a pattern matches as a whole.  They are
relatively useless for telling you *how* a pattern matches.
For instance, a DFA can tell you that you have a valid computer
program, but can't hand you back the syntax tree, because it has
no way to decide between shifting and reducing.  It has to do both
simultaneously.

While true, there are reasonably common cases where you don't care 
about what or where, just whether. For a set of mushed-together 
examples:

while () {
last if /END_OF_DATA/;
$line .= $_ if /=$/;
next unless /$user_entered_string/;
}

Sure, it's a restricted subset of the stuff people do, and that's 
cool. I'd not even want to put in DFA-detecting code in the main 
regex compilation grammar. But in those cases where it is useful, a 
:dfa switch for regexes would be nifty.


(Though *please* don't yet--we've not gotten the current grammar 
fully implemented :)
-- 
 Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk



Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Deven T. Corzine

On Wed, 28 Aug 2002, Larry Wall wrote:

 : (1) Can we have a :study modifier in Perl 6 for patterns?
 : 
 : It could be a no-op if necessary, but it could take the place of Perl 5's 
 : study operator and indicate that the programmer WANTS the pattern 
 : optimized for maximum runtime speed, even at the cost of compile time or 
 : memory.  (Hey, how about a :cram modifier for extreme optimization? :-)
 
 Well, studied isn't really a property of a pattern--it's a property of a
 string that knows it will have multiple patterns matched against it.  One
 could put a :study on the first pattern, but that's somewhat deceptive.

Oh yeah.  I forgot it applied to the string, not the pattern!  I forgot 
since I never use it! :-)  Still, it could be considered a parallel...

Perhaps a better approach would be to allow the optimization priorities to 
be specified, perhaps even as numerical ranges for relative importance?  
The three obvious dimensions to quantify would be compile time, run-time 
speed, and memory usage.  There's often tradeoffs between these three, and 
allowing the ability for a programmer to specify his/her preferences could 
allow for aggressive optimizations that are normally inappropriate...

Of course, these would be useful not only as modifiers for compiling any 
regexes, but as general pragmas controlling optimizing behavior of the 
entire Perl 6 compiler/optimizer...

I'm not sure if it's good enough to just say optimize for run-time speed 
at the expense of compile time and memory (or variations for only having 
one of the two sacrificed) -- or it it's better to have a scale (say, 0-9) 
for how important each dimension is.

For the extreme case where long compile time and large memory usage is 
irrelevant, but extreme run-time speed is a must, the programmer might 
specify optimization priorities of compile=0, memory=0, speed=9.  I'm not 
sure what sort of syntax would be appropriate for such specifications...

 : (2) Would simple alternation impede DFA-style optimizations?
 : 
 : Currently, a pattern like (Jun|June) would never match June because the 
 : leftmost match Jun would always take precedence, despite the normal 
 : longest-match behavior of regexes in general.  This example could be 
 : implemented in a DFA; would that always be the case?
 
 Well, June can match if what follows fails to match after Jun.

True enough.  Couldn't that still be implemented in a DFA?  (Possibly at 
the cost of doubling the size of the DFA for the later part of the regex!)

 : Would it be better for the matching of (Jun|June) to be undefined and 
 : implementation-dependent?  Or is it best to require leftmost semantics?
 
 Well, the semantics shouldn't generally wobble around like that, but it'd
 be pretty easy to let them wobble on purpose via pragma (or via :modifier,
 which are really just pragmas that scope to regex groups).

Yeah, it's probably safer not to have that much room for undefined behavior 
since people will just try it and assume that their implementation is the 
universal behavior...

Would there be a good way to say don't care about the longest-vs-leftmost 
matching semantics?  Would it be worthwhile to have longest-trumps-leftmost 
as an optional modifier?  (This might be very expensive if implemented in a 
backtracking engine, since it could no longer shortcut alternations...)

Dan suggested :dfa for DFA semantics -- is that the best answer, or would 
it be better to define the modifiers in terms of visible behavior rather 
than implementation, if possible?

Deven




Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Steve Fink

On Wed, Aug 28, 2002 at 12:55:44PM -0400, Deven T. Corzine wrote:
 On Wed, 28 Aug 2002, Dan Sugalski wrote:
  At 10:57 AM -0400 8/28/02, Deven T. Corzine wrote:
 
 On the other hand, :, ::, ::: and commit don't necessarily need to be a 
 problem if they can be treated as hints that can be ignored.  If allowing 
 the normal engine to backtrack despite the hints would change the results, 
 that might be a problem.  I don't know; cut may pose special problems.

They do change the semantics.

 (June|Jun) ebug

matches the string Junebug, but

 (June|Jun): ebug

does not. Similarly,

 (June|Jun) ebug
 (June|Jun) :: ebug
 (June|Jun) ::: ebug
 (June|Jun) commit ebug

all behave differently when embedded in a larger grammar.

However, it is very possible that in many (the majority?) of actual
uses, they may be intended purely as optimizations and so any
differing behavior is unintentional. It may be worth having a flag
noting that (maybe combined with a warning you said this :: isn't
supposed to change what can match, but it does.)

  That doesn't mean you can't write one for a specific subset of perl's 
  regexes, though. A medium-term goal for the regex engine is to note 
  where a DFA would give correct behaviour and use one, rather than 
  going through the more expensive generalized regex engine we'd 
  otherwise use.
 
 I think this is a more realistic goal, and more or less what I had in mind.
 
 I believe there are many subpatterns which might be beneficial to compile 
 to a DFA (or DFA-like) form, where runtime performance is important.  For 
 example, if a pattern is matching dates, a (Jan|Feb|Mar|Apr|...) subpattern
 would be more efficient to implement as a DFA than with backtracking.  With 
 a large amount of data to process, that represent significant savings...

I agree. I will reply to this in perl6-internals, though.

 (2) Would simple alternation impede DFA-style optimizations?
 
 Currently, a pattern like (Jun|June) would never match June because the 
 leftmost match Jun would always take precedence, despite the normal 
 longest-match behavior of regexes in general.  This example could be 
 implemented in a DFA; would that always be the case?

You should read Friedl's Mastering Regular Expressions, if you haven't
already. A POSIX NFA would be required to find the longest match (it
has to work as if it were a DFA). A traditional NFA produces what
would result from the straightforward backtracking implementation,
which often gives an answer closer to what the user expects. IMO,
these are often different, and the DFA would surprise users fairly
often.

 Would it be better for the matching of (Jun|June) to be undefined and 
 implementation-dependent?  Or is it best to require leftmost semantics?

I'm voting leftmost, because I've frequently seen people depend on
it. I'm not so sure that Larry's suggestion of adding a :dfa flag is
always the right approach, because I think this might actually be
something you'd want to set for subsets of a grammar or a single
expression. I don't think it's useful enough to go as far as proposing
that || mean alternate without defining the order of preference, but
perhaps some angle-bracketed thing would work. (Or can you embed
flags in expressions, like perl5's (?imsx:R) thing? Then the :dfa flag
is of course adequate!)



Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Deven T. Corzine

On Wed, 28 Aug 2002, Larry Wall wrote:

 That is a worthy consideration, but expressiveness takes precedence
 over it in this case.

I see nothing wrong with expressiveness taking precedence -- I'm only 
saying that it would be best to be cognizant of any restrictions we're 
hardcoding into the design (in case there's a less restrictive means to the 
same ends) and make that design tradeoff knowingly rather than by default.

If we can find general solutions that don't demand a particular style of 
implementation, that's probably an improvement.  There may be unavoidable 
cases, in which case we decide to accept the limitation for expressiveness, 
and that's a perfectly reasonable design choice.

I'd just hate to ignore the issue now, and have someone later say here's a 
great way it could have been done that would have allowed this improvement 
in the implementation...

 DFAs are really only good for telling you *whether* and *where* a pattern 
 matches as a whole.  They are relatively useless for telling you *how* a 
 pattern matches.  For instance, a DFA can tell you that you have a valid 
 computer program, but can't hand you back the syntax tree, because it has
 no way to decide between shifting and reducing.  It has to do both
 simultaneously.

Yes and no.  You're right, but see below.

 : It may be that backreferences already demand backtracking.  Or some other 
 : feature might.  I don't know; I haven't thought it through.
 
 I believe you are correct that backrefs require backtracking.  Maybe some 
 smart person will find a way to trace back through the states by which a 
 DFA matched to retrieve backref info, but that's probably worth a PhD or 
 two.

Well, there are certainly PhD students out there doing new research all the 
time.  Who knows what one will come up with one day?  It would suck if one 
gets a PhD for a super-clever pattern-matching algorithm, and we find that 
we can't use it because of hardcoded assumptions in the language design...

As for backtracing states of the DFA, see below.

 Minimal matching is also difficult in a DFA, I suspect.

Is it?  I'm not sure.  Since the DFA effectively follows all branches of 
the NFA at once, perhaps minimal matching is no more dificult than maximal?

Then again, maybe not. :-)

 : If we must have backtracking, so be it.  But if that's a tradeoff we're 
 : making for more expressive and powerful patterns, we should still at least 
 : make that tradeoff with our eyes open.  And if the tradeoff can be avoided, 
 : that's even better.
 
 I refer you to http:://history.org/grandmother/teach/egg/suck.  :-)

Um, huh?

 That's a tradeoff I knowingly made in Perl 1.  I saw that awk had a
 DFA implementation, and suffered for it as a useful tool.

I suspect it's not practical to have an all-DFA implementation with nearly 
the power and expressiveness of Perl 4 or Perl 5 regexes, much less Perl 6.

On the other hand, many patterns have subpatterns which might benefit from 
using a DFA as an optimization.  You don't lose the expressiveness here,
if the backtracking NFA is available as well.

I'm just asking that we consider the impact on such optimizations, and see 
if we can leave the door open to reap the benefits without compromising the 
power and expressiveness we all want.  Maybe this just amounts to adding a 
few modifiers to allow semantic variants (like longest-trumps-leftmost), to 
enable optimizations that would otherwise impinge on correctness...

 And it's not just the backrefs.  An NFA can easily incorporate
 strategies such as Boyer-Moore, which actually skips looking at many of
 the characters, or the scream algorithm used by study, which can skip
 even more.  All the DFAs I've seen have to look at every character,
 albeit only once.  I suppose it's theoretically possible to compile
 to a Boyer-Moore-ish state machine, but I've never seen it done.

Okay, I confess that I've been saying DFA when I don't necessarily mean 
precisely that.  What I really mean is a non-backtracking state machine 
of some sort, but I'm calling it a DFA because it would be similar to one 
(to the degree possible) and people know what a DFA is.  I could say NBSM, 
but that seems confusing. :-)

Your objections to the limitations of a DFA are quite correct, of course.  
Modifications would be required to overcome the limits, and then it's no 
longer really a DFA, just like Perl 5's regular expressions are no longer 
really regular expressions in the mathematical sense.  I'm envisioning a 
state machine of some sort, which has a lot in common with a DFA but isn't 
strictly a DFA anymore.  If you prefer, I'll call it an NBSM, or I'm open 
to better suggestions for a name!

Anyway, to respond to your objections to a DFA:

* While you couldn't hand back a syntax tree from a true DFA, it should be
  possible to create an NBSM from a DFA recognizer, modified to record 
  whatever extra information is needed to execute the code that constructs 
  the syntax tree.  The 

Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Deven T. Corzine

On Wed, 28 Aug 2002, Steve Fink wrote:

 On Wed, Aug 28, 2002 at 12:55:44PM -0400, Deven T. Corzine wrote:
  On Wed, 28 Aug 2002, Dan Sugalski wrote:
   At 10:57 AM -0400 8/28/02, Deven T. Corzine wrote:
  
  On the other hand, :, ::, ::: and commit don't necessarily need to be a 
  problem if they can be treated as hints that can be ignored.  If allowing 
  the normal engine to backtrack despite the hints would change the results, 
  that might be a problem.  I don't know; cut may pose special problems.
 
 They do change the semantics.
 
  (June|Jun) ebug
 
 matches the string Junebug, but
 
  (June|Jun): ebug
 
 does not. Similarly,
 
  (June|Jun) ebug
  (June|Jun) :: ebug
  (June|Jun) ::: ebug
  (June|Jun) commit ebug
 
 all behave differently when embedded in a larger grammar.

Good point.  Okay, they definitely change the semantics.  Still, could such 
semantics be implemented in a non-backtracking state machine, whether or 
not it's a strict DFA?

 However, it is very possible that in many (the majority?) of actual
 uses, they may be intended purely as optimizations and so any
 differing behavior is unintentional. It may be worth having a flag
 noting that (maybe combined with a warning you said this :: isn't
 supposed to change what can match, but it does.)

I think this is like the leftmost matching semantics -- it may exist for 
the sake of implementation efficiency, yet it has semantic consequences as 
well.  In many cases, those semantic differences may be immaterial, yet 
some code relies on it.  Allowing flags to specify that such differences in
semantics are immaterial to your pattern would be helpful.  (Would it make 
sense for one flag to say don't care to the semantic differences for BOTH 
leftmost matching and the :/::/:::/etc. operators?)

  I believe there are many subpatterns which might be beneficial to compile 
  to a DFA (or DFA-like) form, where runtime performance is important.  For 
  example, if a pattern is matching dates, a (Jan|Feb|Mar|Apr|...) subpattern
  would be more efficient to implement as a DFA than with backtracking.  With 
  a large amount of data to process, that represent significant savings...
 
 I agree. I will reply to this in perl6-internals, though.

Yes, the discussion of details belongs there, when it's not infringing on 
issues of language design, as the semantic consequences do...

  (2) Would simple alternation impede DFA-style optimizations?
  
  Currently, a pattern like (Jun|June) would never match June because the 
  leftmost match Jun would always take precedence, despite the normal 
  longest-match behavior of regexes in general.  This example could be 
  implemented in a DFA; would that always be the case?
 
 You should read Friedl's Mastering Regular Expressions, if you haven't
 already. A POSIX NFA would be required to find the longest match (it
 has to work as if it were a DFA). A traditional NFA produces what
 would result from the straightforward backtracking implementation,
 which often gives an answer closer to what the user expects. IMO,
 these are often different, and the DFA would surprise users fairly
 often.

Rather than following a traditional approach to NFA/DFA construction, would 
it be possible to use a modified approach that preserves leftmost matching?
(If so, would it be more expensive or just different?)

  Would it be better for the matching of (Jun|June) to be undefined and 
  implementation-dependent?  Or is it best to require leftmost semantics?
 
 I'm voting leftmost, because I've frequently seen people depend on it.

I was going to agree, until I read your next paragraph.

However, it would be useful to be able to say don't care to the semantic 
distinction -- it might even be useful to be able to demand longest-match 
take precedence over leftmost matching, but that would incur an extra cost 
in the normal regex engine...

 I'm not so sure that Larry's suggestion of adding a :dfa flag is
 always the right approach, because I think this might actually be
 something you'd want to set for subsets of a grammar or a single
 expression. I don't think it's useful enough to go as far as proposing
 that || mean alternate without defining the order of preference, but
 perhaps some angle-bracketed thing would work. (Or can you embed
 flags in expressions, like perl5's (?imsx:R) thing? Then the :dfa flag
 is of course adequate!)

You know, when you bring up an idea like ||, I start thinking that maybe 
the default should be NOT to have a preference (since it normally doesn't 
matter) and to only guarantee the leftmost short-circuit behavior with || 
instead of |.  That would allow for more implementation flexibility, and 
provide a beautiful parallel with C -- in C, only || short-circuits and 
the | operator still evaluates all parts.  (Granted, that's because it's 
bitwise, but there's still a nice parallel there.)

For the few cases where someone wants to rely on leftmost matching of any 
alternation, they could simply use ||