Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Deven T. Corzine

I have no objection to pattern operators like ::: in principle, but I do
have a potential concern about them.

Given that the operators are actually defined in terms of backtracking 
within the RE engine, does this constrain the implementation such that it 
MUST be a backtracking implementation to behave correctly?

If these operators are purely effeciency optimization hints, that would be 
one thing, but I get the sense that ignoring the hints might lead to 
incorrect behavior.  (The cut operator might be a special concern.)

Suppose, for the sake of argument, that someone wanted to make a pattern 
engine implementation, compatible with Perl 6 patterns, which was highly 
optimized for speed at the expense of memory, by RE-NFA-DFA construction 
for simultaneous evaluation of multiple alternatives without backtracking.

This might be extremely expensive in memory, but there may be some niche 
applications where run-time speed is paramount, and a pattern is used so 
heavily in such a critical way that the user might be willing to expend 
hundreds of megabytes of RAM to make the patterns execute several times 
faster than normal.  (Obviously, such a tradeoff would be unacceptable in
the general case!)

Would it be _possible_ to create a non-backtracking implementation of a 
Perl 6 pattern engine, or does the existence of backtracking-related 
operators preclude this possibility in advance?

I hope we're not constraining the implementation options by the language 
design, but I'm worried that this might be the case with these operators.

Shouldn't it be an implementation decision whether to use backtracking?

Deven




Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Deven T. Corzine

On 28 Aug 2002, Simon Cozens wrote:

 [EMAIL PROTECTED] (Deven T. Corzine) writes:
  Would it be _possible_ to create a non-backtracking implementation of a 
  Perl 6 pattern engine
 
 I don't believe that it is, but not just because of : and friends.
 Why does it matter?

I'm not saying we should dump the operators -- if we get more power by 
assuming a backtracking implementation, maybe that's a worthwhile tradeoff.

On the other hand, if we can keep the implementation possibilities more 
open, that's always a worthwhile goal, even if we're not sure if or when 
we'll ever take advantage of those possibilities, or if we even could...

It seems like backtracking is a Bad Thing, in that it leads to reprocessing 
data that we've already looked at.  On the other hand, it seems to be a 
Necessary Evil because of the memory costs of avoiding backtracking, and 
because we might have to give up valuable features without backtracking.

It may be that backreferences already demand backtracking.  Or some other 
feature might.  I don't know; I haven't thought it through.

If we must have backtracking, so be it.  But if that's a tradeoff we're 
making for more expressive and powerful patterns, we should still at least 
make that tradeoff with our eyes open.  And if the tradeoff can be avoided, 
that's even better.

Deven





Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Deven T. Corzine

On Wed, 28 Aug 2002, Dan Sugalski wrote:

 At 10:57 AM -0400 8/28/02, Deven T. Corzine wrote:
 Would it be _possible_ to create a non-backtracking implementation of a
 Perl 6 pattern engine, or does the existence of backtracking-related
 operators preclude this possibility in advance?
 
 In general, no of course it's not possible to create a 
 non-backtracking perl regex engine. Far too much of perl's regexes 
 requires backtracking.

Given that Perl 5 regex's are no longer regular (much less Perl 6), I'm 
sure this is probably true.  There may be a regular subset which could be 
implemented without backtracking if problematic features are avoided, but 
surely a complete non-backtracking implementation is beyond reach.

On the other hand, :, ::, ::: and commit don't necessarily need to be a 
problem if they can be treated as hints that can be ignored.  If allowing 
the normal engine to backtrack despite the hints would change the results, 
that might be a problem.  I don't know; cut may pose special problems.

Even if the new operators can't work without backtracking, maybe it doesn't 
matter, since there's surely a few others inherited from Perl 5 as well...

 That doesn't mean you can't write one for a specific subset of perl's 
 regexes, though. A medium-term goal for the regex engine is to note 
 where a DFA would give correct behaviour and use one, rather than 
 going through the more expensive generalized regex engine we'd 
 otherwise use.

I think this is a more realistic goal, and more or less what I had in mind.

I believe there are many subpatterns which might be beneficial to compile 
to a DFA (or DFA-like) form, where runtime performance is important.  For 
example, if a pattern is matching dates, a (Jan|Feb|Mar|Apr|...) subpattern
would be more efficient to implement as a DFA than with backtracking.  With 
a large amount of data to process, that represent significant savings...

 If you want to head over to [EMAIL PROTECTED] and pitch in on 
 the regex implementation (it's being worked on now) that'd be great.

I'd like to do that, if I can find the time.  It would be interesting to 
make a small experimental prototype to see if DFA construction could really 
improve performance over backtracking, but it would probably need to be a 
very restricted subset of regex operations to test the idea...

However, while I'm still on perl6-language, I have two language issues to 
discuss first:

(1) Can we have a :study modifier in Perl 6 for patterns?

It could be a no-op if necessary, but it could take the place of Perl 5's 
study operator and indicate that the programmer WANTS the pattern 
optimized for maximum runtime speed, even at the cost of compile time or 
memory.  (Hey, how about a :cram modifier for extreme optimization? :-)

(2) Would simple alternation impede DFA-style optimizations?

Currently, a pattern like (Jun|June) would never match June because the 
leftmost match Jun would always take precedence, despite the normal 
longest-match behavior of regexes in general.  This example could be 
implemented in a DFA; would that always be the case?

Would it be better for the matching of (Jun|June) to be undefined and 
implementation-dependent?  Or is it best to require leftmost semantics?

Deven




Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Deven T. Corzine

On Wed, 28 Aug 2002, Larry Wall wrote:

 : (1) Can we have a :study modifier in Perl 6 for patterns?
 : 
 : It could be a no-op if necessary, but it could take the place of Perl 5's 
 : study operator and indicate that the programmer WANTS the pattern 
 : optimized for maximum runtime speed, even at the cost of compile time or 
 : memory.  (Hey, how about a :cram modifier for extreme optimization? :-)
 
 Well, studied isn't really a property of a pattern--it's a property of a
 string that knows it will have multiple patterns matched against it.  One
 could put a :study on the first pattern, but that's somewhat deceptive.

Oh yeah.  I forgot it applied to the string, not the pattern!  I forgot 
since I never use it! :-)  Still, it could be considered a parallel...

Perhaps a better approach would be to allow the optimization priorities to 
be specified, perhaps even as numerical ranges for relative importance?  
The three obvious dimensions to quantify would be compile time, run-time 
speed, and memory usage.  There's often tradeoffs between these three, and 
allowing the ability for a programmer to specify his/her preferences could 
allow for aggressive optimizations that are normally inappropriate...

Of course, these would be useful not only as modifiers for compiling any 
regexes, but as general pragmas controlling optimizing behavior of the 
entire Perl 6 compiler/optimizer...

I'm not sure if it's good enough to just say optimize for run-time speed 
at the expense of compile time and memory (or variations for only having 
one of the two sacrificed) -- or it it's better to have a scale (say, 0-9) 
for how important each dimension is.

For the extreme case where long compile time and large memory usage is 
irrelevant, but extreme run-time speed is a must, the programmer might 
specify optimization priorities of compile=0, memory=0, speed=9.  I'm not 
sure what sort of syntax would be appropriate for such specifications...

 : (2) Would simple alternation impede DFA-style optimizations?
 : 
 : Currently, a pattern like (Jun|June) would never match June because the 
 : leftmost match Jun would always take precedence, despite the normal 
 : longest-match behavior of regexes in general.  This example could be 
 : implemented in a DFA; would that always be the case?
 
 Well, June can match if what follows fails to match after Jun.

True enough.  Couldn't that still be implemented in a DFA?  (Possibly at 
the cost of doubling the size of the DFA for the later part of the regex!)

 : Would it be better for the matching of (Jun|June) to be undefined and 
 : implementation-dependent?  Or is it best to require leftmost semantics?
 
 Well, the semantics shouldn't generally wobble around like that, but it'd
 be pretty easy to let them wobble on purpose via pragma (or via :modifier,
 which are really just pragmas that scope to regex groups).

Yeah, it's probably safer not to have that much room for undefined behavior 
since people will just try it and assume that their implementation is the 
universal behavior...

Would there be a good way to say don't care about the longest-vs-leftmost 
matching semantics?  Would it be worthwhile to have longest-trumps-leftmost 
as an optional modifier?  (This might be very expensive if implemented in a 
backtracking engine, since it could no longer shortcut alternations...)

Dan suggested :dfa for DFA semantics -- is that the best answer, or would 
it be better to define the modifiers in terms of visible behavior rather 
than implementation, if possible?

Deven




Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Deven T. Corzine

On Wed, 28 Aug 2002, Larry Wall wrote:

 That is a worthy consideration, but expressiveness takes precedence
 over it in this case.

I see nothing wrong with expressiveness taking precedence -- I'm only 
saying that it would be best to be cognizant of any restrictions we're 
hardcoding into the design (in case there's a less restrictive means to the 
same ends) and make that design tradeoff knowingly rather than by default.

If we can find general solutions that don't demand a particular style of 
implementation, that's probably an improvement.  There may be unavoidable 
cases, in which case we decide to accept the limitation for expressiveness, 
and that's a perfectly reasonable design choice.

I'd just hate to ignore the issue now, and have someone later say here's a 
great way it could have been done that would have allowed this improvement 
in the implementation...

 DFAs are really only good for telling you *whether* and *where* a pattern 
 matches as a whole.  They are relatively useless for telling you *how* a 
 pattern matches.  For instance, a DFA can tell you that you have a valid 
 computer program, but can't hand you back the syntax tree, because it has
 no way to decide between shifting and reducing.  It has to do both
 simultaneously.

Yes and no.  You're right, but see below.

 : It may be that backreferences already demand backtracking.  Or some other 
 : feature might.  I don't know; I haven't thought it through.
 
 I believe you are correct that backrefs require backtracking.  Maybe some 
 smart person will find a way to trace back through the states by which a 
 DFA matched to retrieve backref info, but that's probably worth a PhD or 
 two.

Well, there are certainly PhD students out there doing new research all the 
time.  Who knows what one will come up with one day?  It would suck if one 
gets a PhD for a super-clever pattern-matching algorithm, and we find that 
we can't use it because of hardcoded assumptions in the language design...

As for backtracing states of the DFA, see below.

 Minimal matching is also difficult in a DFA, I suspect.

Is it?  I'm not sure.  Since the DFA effectively follows all branches of 
the NFA at once, perhaps minimal matching is no more dificult than maximal?

Then again, maybe not. :-)

 : If we must have backtracking, so be it.  But if that's a tradeoff we're 
 : making for more expressive and powerful patterns, we should still at least 
 : make that tradeoff with our eyes open.  And if the tradeoff can be avoided, 
 : that's even better.
 
 I refer you to http:://history.org/grandmother/teach/egg/suck.  :-)

Um, huh?

 That's a tradeoff I knowingly made in Perl 1.  I saw that awk had a
 DFA implementation, and suffered for it as a useful tool.

I suspect it's not practical to have an all-DFA implementation with nearly 
the power and expressiveness of Perl 4 or Perl 5 regexes, much less Perl 6.

On the other hand, many patterns have subpatterns which might benefit from 
using a DFA as an optimization.  You don't lose the expressiveness here,
if the backtracking NFA is available as well.

I'm just asking that we consider the impact on such optimizations, and see 
if we can leave the door open to reap the benefits without compromising the 
power and expressiveness we all want.  Maybe this just amounts to adding a 
few modifiers to allow semantic variants (like longest-trumps-leftmost), to 
enable optimizations that would otherwise impinge on correctness...

 And it's not just the backrefs.  An NFA can easily incorporate
 strategies such as Boyer-Moore, which actually skips looking at many of
 the characters, or the scream algorithm used by study, which can skip
 even more.  All the DFAs I've seen have to look at every character,
 albeit only once.  I suppose it's theoretically possible to compile
 to a Boyer-Moore-ish state machine, but I've never seen it done.

Okay, I confess that I've been saying DFA when I don't necessarily mean 
precisely that.  What I really mean is a non-backtracking state machine 
of some sort, but I'm calling it a DFA because it would be similar to one 
(to the degree possible) and people know what a DFA is.  I could say NBSM, 
but that seems confusing. :-)

Your objections to the limitations of a DFA are quite correct, of course.  
Modifications would be required to overcome the limits, and then it's no 
longer really a DFA, just like Perl 5's regular expressions are no longer 
really regular expressions in the mathematical sense.  I'm envisioning a 
state machine of some sort, which has a lot in common with a DFA but isn't 
strictly a DFA anymore.  If you prefer, I'll call it an NBSM, or I'm open 
to better suggestions for a name!

Anyway, to respond to your objections to a DFA:

* While you couldn't hand back a syntax tree from a true DFA, it should be
  possible to create an NBSM from a DFA recognizer, modified to record 
  whatever extra information is needed to execute the code that constructs 
  the syntax tree.  The 

Re: Does ::: constrain the pattern engine implementation?

2002-08-28 Thread Deven T. Corzine

On Wed, 28 Aug 2002, Steve Fink wrote:

 On Wed, Aug 28, 2002 at 12:55:44PM -0400, Deven T. Corzine wrote:
  On Wed, 28 Aug 2002, Dan Sugalski wrote:
   At 10:57 AM -0400 8/28/02, Deven T. Corzine wrote:
  
  On the other hand, :, ::, ::: and commit don't necessarily need to be a 
  problem if they can be treated as hints that can be ignored.  If allowing 
  the normal engine to backtrack despite the hints would change the results, 
  that might be a problem.  I don't know; cut may pose special problems.
 
 They do change the semantics.
 
  (June|Jun) ebug
 
 matches the string Junebug, but
 
  (June|Jun): ebug
 
 does not. Similarly,
 
  (June|Jun) ebug
  (June|Jun) :: ebug
  (June|Jun) ::: ebug
  (June|Jun) commit ebug
 
 all behave differently when embedded in a larger grammar.

Good point.  Okay, they definitely change the semantics.  Still, could such 
semantics be implemented in a non-backtracking state machine, whether or 
not it's a strict DFA?

 However, it is very possible that in many (the majority?) of actual
 uses, they may be intended purely as optimizations and so any
 differing behavior is unintentional. It may be worth having a flag
 noting that (maybe combined with a warning you said this :: isn't
 supposed to change what can match, but it does.)

I think this is like the leftmost matching semantics -- it may exist for 
the sake of implementation efficiency, yet it has semantic consequences as 
well.  In many cases, those semantic differences may be immaterial, yet 
some code relies on it.  Allowing flags to specify that such differences in
semantics are immaterial to your pattern would be helpful.  (Would it make 
sense for one flag to say don't care to the semantic differences for BOTH 
leftmost matching and the :/::/:::/etc. operators?)

  I believe there are many subpatterns which might be beneficial to compile 
  to a DFA (or DFA-like) form, where runtime performance is important.  For 
  example, if a pattern is matching dates, a (Jan|Feb|Mar|Apr|...) subpattern
  would be more efficient to implement as a DFA than with backtracking.  With 
  a large amount of data to process, that represent significant savings...
 
 I agree. I will reply to this in perl6-internals, though.

Yes, the discussion of details belongs there, when it's not infringing on 
issues of language design, as the semantic consequences do...

  (2) Would simple alternation impede DFA-style optimizations?
  
  Currently, a pattern like (Jun|June) would never match June because the 
  leftmost match Jun would always take precedence, despite the normal 
  longest-match behavior of regexes in general.  This example could be 
  implemented in a DFA; would that always be the case?
 
 You should read Friedl's Mastering Regular Expressions, if you haven't
 already. A POSIX NFA would be required to find the longest match (it
 has to work as if it were a DFA). A traditional NFA produces what
 would result from the straightforward backtracking implementation,
 which often gives an answer closer to what the user expects. IMO,
 these are often different, and the DFA would surprise users fairly
 often.

Rather than following a traditional approach to NFA/DFA construction, would 
it be possible to use a modified approach that preserves leftmost matching?
(If so, would it be more expensive or just different?)

  Would it be better for the matching of (Jun|June) to be undefined and 
  implementation-dependent?  Or is it best to require leftmost semantics?
 
 I'm voting leftmost, because I've frequently seen people depend on it.

I was going to agree, until I read your next paragraph.

However, it would be useful to be able to say don't care to the semantic 
distinction -- it might even be useful to be able to demand longest-match 
take precedence over leftmost matching, but that would incur an extra cost 
in the normal regex engine...

 I'm not so sure that Larry's suggestion of adding a :dfa flag is
 always the right approach, because I think this might actually be
 something you'd want to set for subsets of a grammar or a single
 expression. I don't think it's useful enough to go as far as proposing
 that || mean alternate without defining the order of preference, but
 perhaps some angle-bracketed thing would work. (Or can you embed
 flags in expressions, like perl5's (?imsx:R) thing? Then the :dfa flag
 is of course adequate!)

You know, when you bring up an idea like ||, I start thinking that maybe 
the default should be NOT to have a preference (since it normally doesn't 
matter) and to only guarantee the leftmost short-circuit behavior with || 
instead of |.  That would allow for more implementation flexibility, and 
provide a beautiful parallel with C -- in C, only || short-circuits and 
the | operator still evaluates all parts.  (Granted, that's because it's 
bitwise, but there's still a nice parallel there.)

For the few cases where someone wants to rely on leftmost matching of any 
alternation, they could simply use

Copy-restore on parameters? (was Re: Autovivi)

2002-08-15 Thread Deven T. Corzine

On Wed, 14 Aug 2002, Luke Palmer wrote:

 Why?  We could make arglists exactly equivilent to the way they're done in 
 Perl 5, which is a good way.
 
   sub foo($a, $b, *@c) {...}
 
 Would be exactly equivilent to Perl 5's
 
   sub foo { my ($a, $b, c) = _; ... }

I've got another idea.  How about using a copy-restore technique?

First, assume all arguments are pass-by-reference, whether constant or not.

Second, allow is ro as a modifier to force constant semantics -- in this 
case, don't make a copy, just use the reference and refuse to modify it.

Third, allow is rw as a modifier to enable copy-restore, and to make the 
reference modifiable.

Fourth, copy all named parameters into local variables from the passed 
reference (except for is ro parameters, which should be aliased) -- and 
then keep _ with the actual references, as in Perl 5.  (Unlike Perl 5, the 
is ro parameters would not be modifiable even via _.)

Finally, for is rw parameters, copy the modified local variable back to 
the referenced parameter (the original one, even if _ was shifted!), but 
ONLY for a normal subroutine exit.

This would allow pass-by-value semantics by default for convenience, easily 
forced read-only parameters, pass-by-reference semantics when desired, but 
with a twist -- if the function aborts (throws an exception), no parameters 
have been changed, which offers hassle-free transaction-like atomicity.

If a function really wants to modify its parameter, despite throwing an 
exception, it could do so via _ directly.  Or, we could also allow another 
modifier is ref to use simple (modifiable) pass-by-reference semantics, 
but default to copy-restore for is rw...

This seems like it could be the best of all worlds -- what do people think?

Deven




Re: Copy-restore on parameters? (was Re: Autovivi)

2002-08-15 Thread Deven T. Corzine

On Thu, 15 Aug 2002, Larry Wall wrote:

 On Thu, 15 Aug 2002, Deven T. Corzine wrote:
 : I've got another idea.  How about using a copy-restore technique?
 
 I suspect that would make Perl 6's sub calls even slower than Perl 5's.

Yes and no.

For the normal case (pass-by-value semantics), it should be about the same 
speed as Perl 5, since it would be doing the same thing: making a copy of a 
value that was passed by reference.

For the is ro or is ref case, it should be FASTER, since it could avoid 
the copy by aliasing the reference.

Only the is rw case would be slower, since it would involve two copies 
rather than one.  However, it would offer a benefit of rolling back any 
changes to the parameters in case of an abnormal exit from the function.  
This seems like a valuable feature that could often be useful -- and anyone 
who wants to avoid the speed penalty could stick with is ref, assuming 
that option is also allowed...

Deven





Re: Autovivi

2002-08-13 Thread Deven T. Corzine

On Mon, 5 Aug 2002, Dan Sugalski wrote:

 At 1:30 PM +1000 8/6/02, Damian Conway wrote:
 Luke Palmer asked:
 
 Does:
 
  print %foo{bar}{baz};
 
 still create %foo{bar} as a hashref if it doesn't exist?
 
 It is my very strong hope that it will not.
 
 Unless Larry declares otherwise, it won't.

I bought up this question on 12/22/00, and it didn't get much response, 
except for Graham Barr, who pointed out that there might be some trouble 
with constructs like func($x{1}{2}{3}) that may or may not need to 
autovivify the argument if the function assigns to $_[0].  (This seems a 
bit like the difficulty of implementing hypothetical values in regexes; 
perhaps a similar solution could apply to both?)

Subsequent to that mini-discussion, I had occasion to talk to Larry in 
person, and asked him about the autovivification behavior -- his response 
was that he considered Perl 5's behavior of autovivifying read-only values 
to be a misfeature, and that he hopes Perl 6 won't do so.

Now, since Larry, Dan and Damian all seem to be in agreement on this issue, 
I sincerely hope Perl 6 will indeed break from Perl 5's misbehavior here...

However, will the func($x{1}{2}{3}) case cause an implementation problem?

Deven




Re: Autovivi

2002-08-13 Thread Deven T. Corzine

On Tue, 13 Aug 2002, Larry Wall wrote:

 On Tue, 13 Aug 2002, Deven T. Corzine wrote:
 : However, will the func($x{1}{2}{3}) case cause an implementation problem?
 
 This is why the new function type signatures will assume that
 parameters are constant.  If you want a modifiable parameter, you
 have to say is rw.  Then it's considered an lvalue.  In Perl 5,
 all function args had to be considered potential lvalues.

That will certainly help.  What about the is rw case, though?  Will we be 
able to avoid autovivifying the parameter if it's never actually assigned, 
or do we assume this edge case isn't worth the hassle, and assume it will 
follow Perl 5's behavior for such variables?

 In essence, all Perl 5 functions have a signature of (*@_ is rw).
 Perhaps the translator can turn some of those into (*@_).  What'd
 really be cool is if it could pick up an initial
 
 my ($a, $b, $c) = _;
 
 and turn it into an appropriate signature.

That would be pretty cool.  It would also be nice to handle the common 
alternative form of using shift as well...

 Of course, there are issues here if the code modifies those variables, 
 since the issue of whether a variable is rw is really distinct from 
 whether it represents a pass by value or reference.  Slapping a 
 constant on it is a bald-faced attempt to get the speed of 
 pass-by-reference with the guarantees of pass-by-value.

The only accurate way to know if the code modifies the variables is to do 
some sort of dataflow analysis, and it can't be 100% accurate even then.  
(Suppose a shift may or may not happen, depending on the parameters, then 
$_[2] is modified?)  Of course, it should often be possible (at least in 
principle) to determine that it's impossible for a particular parameter to 
be modified, and play it safe by assuming is rw for undeterminable cases.

 Perhaps there should be a way to declare a parameter to be pass-by-value, 
 producing a modifiable variable that does not affect the caller's value.  
 But I'm not sure saving one assignment in the body is worth the extra 
 mental baggage.

I'd argue that it's worthwhile -- not because of the cost of that single 
assignment, but because you'd now have two variables to keep track of, the 
original parameter and the modifiable copy.  This seems like greater mental 
bagger (to me) than remembering another modifier.

How about is copied as a pass-by-value modifier?  (I was inclined to say 
is a copy but I didn't think a 3-word phrase would be desirable here...)

Deven




Re: Autovivi

2002-08-13 Thread Deven T. Corzine

On Tue, 13 Aug 2002, Nicholas Clark wrote:

 On Tue, Aug 13, 2002 at 03:06:40PM -0400, Deven T. Corzine wrote:
 
  The only accurate way to know if the code modifies the variables is to do 
  some sort of dataflow analysis, and it can't be 100% accurate even then.  
  (Suppose a shift may or may not happen, depending on the parameters, then 
  $_[2] is modified?)  Of course, it should often be possible (at least in 
  principle) to determine that it's impossible for a particular parameter to 
  be modified, and play it safe by assuming is rw for undeterminable cases.
 
 Well, perl5 does already manage to avoid auto-vivifying hash keys when they
 are used as subroutine arguments. It uses magic, rather than dataflow
 analysis:

Yes, magic obviously can work, but it's a greater run-time penalty -- if 
you do some sort of dataflow analysis, you pay the penalty at compile time 
instead.  I'm not sure which is better, ultimately.  It probably varies.

Anyway, the Perl 5 magic you speak of presumably only works for a single 
level of hash referencing -- we were specifically discussing a multilevel 
reference, which immediately creates missing substructures in Perl 5, even 
used in a read-only rvalue context in a simple Perl 5 expression...

Deven





Re: Autovivification behavior

2000-12-23 Thread Deven T. Corzine


On Sat, 23 Dec 2000, Graham Barr wrote:

 This has been discussed on p5p many many times. And many times
 I have agreed with what you wrote. However one thing you did not mention,
 but does need to be considered is
 
   func($x{1}{2}{3})
 
 at this point you do not know if this is a read or write access as
 the sub could do $_[0] = 'fred'. If this can be handled in someway
 so that the autoviv does not happen until the assignment then OK,
 but otherwise I think we would have to stick with perl5 semantics

This is a good point.  Similar arguments apply to grep, map, foreach, etc.
Could we have some sort of "lazy evaluation" mode for lvalues where a
reference to an undefined substructure sets a flag and saves the expression
for the lvalue, returning undef whenever evaluated in a read-only context
and autovivifying when necessary to write to the lvalue?

 Also we would probably need a use autoviv pragma so that perl5
 scripts which are ported, and rely on the autivivify of hashes,
 will still work as before.

This makes sense, if any programs rely on the current behavior.  (Do you
know of any such programs?)

Deven




Autovivification behavior

2000-12-22 Thread Deven T. Corzine

Can the autovivication behavior be changed slightly for Perl 6?

Specificly, can we suppress autovivication in read-only circumstances, and 
evaluate the expression as "undef" immediately instead of autovivicating
empty data structures that didn't exist before?

The current behavior in Perl 5 is inconsistent.  Attempting to reference a
hash or array element doesn't normally cause that element to spring into
existence, but attempting to reference substructures DOES suddenly cause
that substructure to spring into existence.  This behavior is desirable
when writing to that substructure, but undesirable when reading from it,
and inconsistent with the normal behavior of single-level perl structures.

Referencing a nonexistent entry in a hash or array doesn't create that
entry UNLESS it's autovivified as a reference.  "%x = (); $x{1};" does NOT
have the same effect as "%x = (1 = undef);"; why does "%x = (); $x{1}{1};"
have the same effect as "%x = (1 = {});"?  If the programmer has never
stored any values into %x, it would be reasonable to expect it to remain
empty; why violate this reasonable expectation just because of an attempted
reference to a nonexistent structure?

Here is an example (in the Perl debugger) demonstrating this inconsistency:

  DB1 x \%x   # %x starts empty
0  HASH(0x82136d0)
 empty hash
  DB2 x $x{1} # access a nonexistent entry
0  undef
  DB3 x \%x   # %x remains empty
0  HASH(0x82136d0)
 empty hash
  DB4 x $x{1}{2}{3}{4}# access nonexistent substructures
0  undef
  DB5 x \%x   # %x no longer empty!
0  HASH(0x82136d0)
   1 = HASH(0x829a33c)
  2 = HASH(0x829a378)
 3 = HASH(0x829a408)
  empty hash
  DB6 x \@x   # @x also starts empty
0  ARRAY(0x829a420)
 empty array
  DB7 x $x[3] # access a nonexistent entry
0  undef
  DB8 x \@x   # @x remains empty
0  ARRAY(0x829a420)
 empty array
  DB9 x $x[3][4]  # access nonexistent substructure
0  undef
  DB10 x \@x  # @x no longer empty!
0  ARRAY(0x829a420)
   0  undef
   1  undef
   2  undef
   3  ARRAY(0x829a554)
empty array

Now, I understand why this happens, but is it really a good thing that it
happens in the first?  We're not storing into the nonexistent substructures
(in which case you DO want autovivification, of course), only attempting to
access a possible value in a substructure that may not exist.  This access
attempt is creating the intervening substructures, despite appearing to be
a "read-only" operation on the data structure it modifies.  (This would be
closely analogous to having Unix create intervening directories for a
nonexistent pathname before reporting a "No such file or directory" error 
for the full pathname...)

Currently, it's sometimes necessary to do "defined" tests on intermediate
levels to workaround this behavior before testing inside substructures that
may not exist.

Couldn't we change this for Perl 6?

Deven




Re: Perl 5's non-greedy matching can be TOO greedy!

2000-12-15 Thread Deven T. Corzine


On Thu, 14 Dec 2000, Nathan Torkington wrote:

 Deven T. Corzine writes:
  I haven't even SEEN an example where the current behavior is actually
  preferable than my proposed behavior, have you?  (And I'd expect at least a
  FEW, though I suspect there are probably more counterexamples.)
 
 I think the biggest problem with your idea is that it requires the
 engine to keep looking even after it finds a match, to see if there's
 another shorter match.  This would make *every* match much much
 slower, potentially heatdeathoftheuniverse slower.
 
 I like the current semantics because it's very easy to visualize the
 engine acting on your instructions and stopping as soon as it finds a
 match.  I am a programmer, and I prefer programs to descriptions.

Not at all.  I don't want it to keep looking after it finds the first
match.  I want it to make sure that match isn't unnecessarily long, if
non-greedy matching was in use.  Conceptually (I don't think this would be
a good implementation), you find the first match as the current engine
does, then search for the smallest possible match WITHIN that first match.
Since it will already be as short as possible from the starting point, this
amounts to advancing the starting point as far as possible without changing
the ending point, as long as it still matches.

Deven




Re: Perl 5's non-greedy matching can be TOO greedy!

2000-12-15 Thread Deven T. Corzine


On 14 Dec 2000, Randal L. Schwartz wrote:

  "Deven" == Deven T Corzine [EMAIL PROTECTED] writes:
 
 Deven I haven't even SEEN an example where the current behavior is
 Deven actually preferable than my proposed behavior, have you?  (And
 Deven I'd expect at least a FEW, though I suspect there are probably
 Deven more counterexamples.)
 
 If I want the leftmost B that is followed by the shortest number
 of characters to get to a D, I write /B.*?D/.  The "leftmost" part
 keeps getting left out of regex discussions, and maybe that's why
 you're thrown off.  But it's a pretty consistent overriding factor.

That's not an example, that's a hypothetical.  I meant that I've never seen
a concrete, realistic example where the current behavior is more beneficial
to the programmer than my proposed behavior.  (I imagine in most cases, it
will be a moot point, since the match will usually be the same.)

 If you want something odd like "not necessarily the leftmost", then
 you'll need to speak more.  But "leftmost" is fundamental to the
 design of regex.  Don't mess with that.  Or don't call it a regex any
 more.

Strange argument.  Greedy matching was once considered fundamental to the
design of regex, and the "leftmost" behavior is 100% consistent with greedy
matching.  Yet Perl 5 added non-greedy modifiers, changing a fundamental
aspect of every preceding regex system, and still called it a regex...

Extending the design of the regex system to allow for non-greedy matching
was an excellent extension, and made Perl 5's regular expressions far more
powerful than anyone else's.  But they're certainly different than before.

I wish I had been involved in the discussions when the idea of non-greedy
matching was proposed and the details ironed out.  I don't think my ideas
would have had the same kind of concerted opposition back then as now...

Deven




Re: Perl 5's non-greedy matching can be TOO greedy!

2000-12-15 Thread Deven T. Corzine


On 15 Dec 2000, Randal L. Schwartz wrote:

  "Deven" == Deven T Corzine [EMAIL PROTECTED] writes:
 
 Deven As for special-case rules, I believe that my proposed modification would
 Deven REMOVE a special-case semantic rule, at the cost of added complexity at the
 Deven implementation level.  (The cost decision of whether that added complexity
 Deven is worthwhile is a separate consideration.)
 
 No, it would break a much higher overriding rule of "left most match
 wins".  That's at the top of the chart.  You're *adding* an exception
 to that.  Tell me how you can do that without breaking much existing
 code.

Can you give a concrete, real-life example of code that my proposed change
would actually break, not a contrived hypothetical case design to break?

Adding the requirement for "\@" in double-quoted strings broke a LOT of
Perl 4 code.  I doubt this would break much at all.

 Deven And I'd really appreciate it if everyone would refrain from
 Deven suggesting that I don't understand the behavior.  I understand
 Deven it fine; I just don't agree with it.  In the language of the
 Deven Supreme Court, "I respectfully dissent."  Just because I don't
 Deven perfectly agree with the semantics that were chosen doesn't
 Deven mean I don't understand them.
 
 You don't understand the motivation, apparently.  That's what I'm
 referencing.

I don't understand the motivation behind taking leftmost matching to such
extremes.  I understand the importance of leftmost matching in general;
this is a particular situation where I believe it's being taken a little
too far for the good of the whole.

However, some of the comments that have been made seem to imply that I'm
simply ignorant about regular expressions and I need to learn them better.

My questioning the behavior arises out of philosophy, not ignorance.

Deven




Re: Perl 5's non-greedy matching can be TOO greedy!

2000-12-15 Thread Deven T. Corzine


On Thu, 14 Dec 2000, Jeff Pinyan wrote:

 You could use my sexeger technique to get this behavior (possibly):
 
   $string = "aaabbbcccdddeee";
   # regex to be reversed: /b(.*?)d/
   $revstr = reverse $string;
   ($match) = $revstr =~ /d(.*?)b/;
 
 No, that doesn't quite work.  It works when you unroll the .*? loop:
 
   # regex to be reversed: /b([^bd]*)d/
   ($match) = $revstr =~ /d([^bd]*)b/;
 
 But then why would you use sexeger at all, when you can unroll the loop in
 the forward direction?
 
   ($match) = $string =~ /b([^bd]*)d/;

I'm not asking for help with workarounds.  If I ever need to workaround the
current behavior, I won't have any trouble doing so.  It's just that (on a
philosophical note) I don't believe this is something that should require a
workaround.  Obviously, others disagree.

Deven




Re: Perl 5's non-greedy matching can be TOO greedy!

2000-12-15 Thread Deven T. Corzine


On Thu, 14 Dec 2000, James Mastros wrote:

 On Thu, Dec 14, 2000 at 04:10:12PM -0500, Deven T. Corzine wrote:
  The crux of the problem is that non-greedy qualifiers don't affect the
  "earliest match" behavior, which makes the matches more greedy than they
  really ought to be.
 Right.  We've got a terminoligy issue.  There's two axes here:

Not the only terminology issue, but I've addressed that in other messages.

 Greed: Greedy by default.  Will try to match as many chars after its
 starting point as possible.  Non-greedy will try to match as few chars after
 its starting point as possible, and is denoted by putting a '?' after a
 quantifer.
 
 Lazyness: Eager by default.  An eager assertation will try to match as early
 as possible.  It isn't possible to use lazy assertations at present.  (Not
 simply, anyway.  It's probably possible to combine other (?) features to get
 the same effect.)
 
 You seem to want the choice between greedy, eager assertations, and
 non-greedy, lazy assertations.  If you came up with a good way to specify
 along both axes, I think we'd have a winner.

I want the maximum level of OVERALL consistency for regular expressions as
a whole, rather than immutable adherence to the "leftmost trumps nongreedy"
rule currently in place.  Most of the time, I agree with the precedence of
leftmost over nongreedy.  The example I gave is a case where I believe the
strict adherence to the leftmost rule actually introduces complexity and
makes the regular expression system less self-consistent.

I don't know if anyone else will ever believe me or not.

Deven





Re: Perl 5's non-greedy matching can be TOO greedy!

2000-12-15 Thread Deven T. Corzine


On Fri, 15 Dec 2000, Deven T. Corzine wrote:

 Not at all.  I don't want it to keep looking after it finds the first
 match.  I want it to make sure that match isn't unnecessarily long, if
 non-greedy matching was in use.  Conceptually (I don't think this would be
 a good implementation), you find the first match as the current engine
 does, then search for the smallest possible match WITHIN that first match.
 Since it will already be as short as possible from the starting point, this
 amounts to advancing the starting point as far as possible without changing
 the ending point, as long as it still matches.

Actually, I'm not sure -- it's conceivable that the ending point would ALSO
move inward for a different starting point within the original match.  But
the ending point should NEVER be advanced further -- that's where the
"leftmost over nongreedy" rule should apply instead...

Deven




Re: Perl 5's non-greedy matching can be TOO greedy!

2000-12-15 Thread Deven T. Corzine


On 15 Dec 2000, Randal L. Schwartz wrote:

  "Deven" == Deven T Corzine [EMAIL PROTECTED] writes:
 
 Deven What surprised me was how vigorously people would defend the
 Deven status quo, and insist on the correctness of the current
 Deven behavior without thinking it through.
 
 No, I thought it through quite completely.  As have others.

You and others spent plenty of time thinking this through a long time ago.
I wish I had been involved with that process, to raise these points then.

Have you thought it through NOW, on a purely semantic level (in isolation
from implementation issues and historical precedent), or are you rejecting
the idea out-of-hand because you believe you've already given the matter
enough consideration?

 Deven Given how invested people are in the exact current behavior, I
 Deven now believe it was a poor choice of words to describe it as a
 Deven "flaw", simply because it presumed an implied consensus on the
 Deven higher-level semantics that obviously wasn't there.
 
 Quite the opposite.  You seem to be one of the very few who expects it
 to act other than as documented.

Really?  I haven't taken a survey, but I did ask one co-worker for his
first impression of what the regexp (from my example) would match.  Not
being an experienced Perl programmer, but being familiar with regular
expressions, he believed he understood the idea of non-greedy matching.
His expectation?  That would match "bd", not "d".

Is it that few people expect this, or that they get smacked down so fast
when they suggest it that they never dare raise the question again?

 Deven It seems to have been interpreted as a value judgement on my
 Deven part, which it wasn't.  It merely occurred to me that Perl 6
 Deven might provide an opportunity to eliminate a minor quirk in the
 Deven regular expression system.  I didn't mean to imply that the
 Deven current behavior is BAD, simply that it's not quite right (at
 Deven least in my mind) -- since there's serious disagreement about
 Deven this, I'd like to make a shift in terminology and start
 Deven referring to this behavior as a "semantic anomaly" rather than
 Deven a "flaw" or a "bug", and hope that will be a more neutral term.
 
 It's not an anomoly at all.  It comes out completely accurate with
 some very simple rules for how regex works.

It's an anomaly in my mind.  Whether it's a real anomaly or an imaginary
one, I left that one up in the air for the moment.  Personally, I believe
the anomaly is real, but quite dependent on the perspective from which you
view the situation.

Yes, the existing rules are accurate, predictable and simple.  Are they the
BEST possible rules, or just what made the most sense at the time?

 Admit it... it bit you, and you are just refusing to believe that you
 don't have a complete and accurate model for how regex work.  Please,
 admit this, and we can MOVE ON.

Actually, it's never bit me.  I only discovered it by experimenting with
contrived examples to see exactly how the new (at that time) non-greedy
matching feature worked.  I thought the behavior was a little odd, but the
explanation made enough sense that it didn't seem worth arguing about.

Now that Perl 6 is on the horizon, it seems time to consider the idea.

I have yet to run into a real-life situation where this anomaly really
matters much.  Which might be an argument not to worry about it.  But the
fact that I was already extremely familiar with greedy regexps, and (at
first) surprised by this behavior, makes me wonder what's the best way.

 Deven Hopefully, we can have a rational discussion about whether this
 Deven semantic anomaly is real or imagined, what impact "fixing" it
 Deven would have on the implementation (if it's deemed real), and
 Deven whether it's worth "fixing".
 
 You can't fix what isn't broken.

I put "fixing" in quotes because it depends on whether it's broken or not.

I did the same thing that everyone else is doing, skipping the first part
of the discussion on the assumption that the answer was OBVIOUS.  Well, it
clearly isn't as obvious as we all believe, since what seems obvious to me
and what seems obvious to you are in conflict.  Everyone keeps trying to
frame it as my presumed "misunderstanding" of how it should work, when it's
actually a question of perspective.

Since nobody seems willing to contemplate this from the perspective that
I'm presenting, it's difficult to conclude whether or not it's actually
broken.  Everyone simply assumes I'm wrong and doesn't care beyond that.

 Deven If the final decision is not to change the current behavior,
 Deven for whichever reason, I'd like to see this documented in an RFC
 Deven that says "here's what was requested and why it isn't going to
 Deven be done".  I'll volunteer to help with that (even if I remain
 Deven in the minority), whether by summarizing or cutting an

Re: Perl 5's non-greedy matching can be TOO greedy!

2000-12-15 Thread Deven T. Corzine

[I delayed responding to this message because it was the longest.]

On Thu, 14 Dec 2000, Tom Christiansen wrote:

 No question that's how it's been implemented.  But WHY would anyone want
 such behavior?  When is it beneficial?
 
 It is beneficial because this is how it's always been, because it
 is faster, because it is more expressive, because it is more powerful,
 because it is more intuitive, and because it is more perlian.

Nice summary, but I'm not buying what you're selling in the elaboration.

 In elaboration:
 
 0) All NFAs before POSIX acted this way.  It is historically 
consistent and perfectly expected.

First, all those older regular expression systems were inherently greedy
matching algorithms.  There is absolutely no conflict between earliest
match and greedy matching.  The conflict only arises when non-greedy
matching is introduced into the equation.

Second, by referring to NFA's, you're already dropping down to a lower
level, and as I said in a prior message, you've lost the gestalt by then.
Once you drop to that level (and stop referencing the semantics of the
whole), my point is moot, because it no longer makes sense at that level.

Finally, historical precedent is a always a justification not to change
anything.  Of course, that argued against the wonderful regex extensions
that Perl 5 added, since they went against the way "it's always been"...

 1) It is obviously faster to come to an answer earlier on in the
execution than it would be to come to an answer later.  It's
like an expression whose evaluation short-circuits.  Also, when
the matching sematics permit back tracking and back references,
the combinatoric possibilities can easily explode into virtual
unsolvability as the 2**N algorithm loses its race to the heat
death of the universe.  Yes, if Perl did overall-longest or
overall-shorted, this would produce a more predictable time;
however, as we see with DFAs and POSIX NFAs, this prediction
plays out as guaranteed *WORST-CASE* time.  It is not acceptable
to make everyone pay the worst-case time.  Never penalize
the whole world for the needs or desires or the few.

I don't think it should be implemented unless the cost is small or applied
only to those situations where it's wanted despite the cost.

However, I don't believe it would NECESSARILY be slower to execute -- yes,
there's more complexity.  The price for that complexity must be paid, but
it may be payable in memory by making a larger (but equally fast) NFA for
the regular expression.  Or maybe it can't.  I don't know, because I have
not investigated it.  It's premature to simply assume it MUST be slower.

 2) Consider the simple case, /A|B/.  In your overall longest/shortest,
guaranteed worst-case time, both submatch A and submatch B must
be calculated, and then the lengths of their matches both be compared.
Perl, fortunately, does not do that.  Rather, the first one in that
sequence wins.  That means that under the current scheme, the 
patterns /A|B/ and /B|A/ have different semantics.  Under your 
worst-case scheme, they do not.  Because /A|B/ and /B|A/ mean
something different, more expressivity is provided.  This is the
same scenario, albeit expressed slightly differently, as your
situation.  The issues manifest in both are equivalent.

Then that is another semantic anomaly, because the alternation is supposed
to mean that one or the other pattern matches.  Logically, "or" should be
communitive.  If they're not quite, that's another disconnect between the
high-level model and the implementation.  Maybe /A|B/ and /B|A/ _do_ mean
different things to the engine, but they do NOT mean different things in
the high-level semantics -- except when you force the high-level semantics
to change by fiat, to match the implementation.

 3) This leads to increased power.  It's like the difference between
a short-circuiting "or" and one that blindly plods ahead trying
to figure something out even when all is for naught.  Compare AB
with AB, for example.  If A is 0, then B need not be computed, 
yet in the second version, one runs subexpression B nevertheless.
If according to the rules of one particular system, patX and
patY mean different things, whereas in a second system, they are
completely interchangeable, then the first system can express
nuances that the second one cannot.  When you have more nuances,
more expressivity, then you have more power, because you can say
things you could not otherwise say.  Why do C and its derivatives
such as Perl have short-circuiting Boolean operators?  Because
in older languages, such as Fortran and Pascal, where you did
not have them, one quickly found that this was cumbersome and
annoying.

It's a question of speed vs. correctness.  Correctness is important, but
occasionally a little incorrectness is worthwhile for increased speed.

Short-circuiting in C 

Perl 5's non-greedy matching can be TOO greedy!

2000-12-14 Thread Deven T. Corzine

Non-greedy matching is a very valuable Perl 5 regular expression feature
that simplifies many regular expressions.  However, early on I discovered
what seems to be a failure of the mechanism -- matches were MORE greedy
than necessary.  I'm not sure if other people noticed this, or just failed
to care about it.  Maybe they would disagree with my interpretation or want
to maintain backward-compatibility now that it's implemented in Perl 5 this
way.  I always meant to raise it for discussion long ago, but I never found
the time to participate in the p5p list...

The crux of the problem is that non-greedy qualifiers don't affect the
"earliest match" behavior, which makes the matches more greedy than they
really ought to be.

Here is a simple example: (tested with perl 5.005_03)

 $_ = "";
 ($greedy) = /(b.*d)/;  # "" (correct)
 ($non_greedy) = /(b.*?d)/; # "d" (should be "bd"!)

I'm sure this will complicate the NFA for the regexp, but it seems like
this really ought to be fixed, at least in Perl 6.  (There's a good case to
be made for fixing it in Perl 5, but people have ignored/missed it for this
long already...)

Does anyone disagree with the premise, and believe that "d" is the
CORRECT match for the non-greedy regexp above?

For what it's worth, here's a quote from a Perl 5.005_03 "perlre" manpage:

 By default, a quantified subpattern is "greedy", that is, it will
 match as many times as possible (given a particular starting
 location) while still allowing the rest of the pattern to match.
 If you want it to match the minimum number of times possible,
 follow the quantifier with a "?".  Note that the meanings don't 
 change, just the "greediness":

I don't believe that ".*?" matching "bbb" above qualifies as "to match
the minimum number of times possible", when it is possible only to match
the "" and still match the full regexp.  Since the documentation makes
no mention of earliest-match in this paragraph, I can only assume this is
unintended behavior, but I'm asking to check my assumptions.  Any devil's
advocates out there who want to argue for the current behavior?

Deven

P.S. Meta question: Is this the right way to propose this, or should I
write it up as an RFC and subit it first?  (I'm not clear if the RFC is
intended to be submitted out of the blue, or after initial discussion...)




Re: Perl 5's non-greedy matching can be TOO greedy!

2000-12-14 Thread Deven T. Corzine


On Thu, 14 Dec 2000, Tom Christiansen wrote:

 Does anyone disagree with the premise, and believe that "d" is the
 CORRECT match for the non-greedy regexp above?
 
 Yes.  The Camel's regex chapter reads:
 
 You might say that eagerness holds priority over greed (or thrift).

No question that's how it's been implemented.  But WHY would anyone want
such behavior?  When is it beneficial?

 For what it's worth, here's a quote from a Perl 5.005_03 "perlre" manpage:
 
  By default, a quantified subpattern is "greedy", that is, it will
  match as many times as possible (given a particular starting
  location) while still allowing the rest of the pattern to match.
  If you want it to match the minimum number of times possible,
  follow the quantifier with a "?".  Note that the meanings don't 
  change, just the "greediness":
 
 I don't believe that ".*?" matching "bbb" above qualifies as "to match
 the minimum number of times possible", when it is possible only to match
 the "" and still match the full regexp.  Since the documentation makes
 no mention of earliest-match in this paragraph, I can only assume this is
 unintended behavior, but I'm asking to check my assumptions.  Any devil's
 advocates out there who want to argue for the current behavior?
 
 The simple story is this:
 
 Rule 1: Given two matches at *different* starting points, the
   one that occurs earlier wins.
 
 *OTHERWISE*
 
 Rule 2: Given two matches at the *same* starting points, the
   one that is longer wins.

 Or, more lengthly:
 
[...]
   "exasperate" =~ /e(.*?)e/   #  $1 now "xasp"
[...]

I didn't need the long-winded explanation, and I don't need help with
understanding how that regexp matches what it does.  I understand it
perfectly well already.  I'm no neophyte with regular expressions, even if
Perl 5 does offer some regexp features I've never bothered to exploit...

My point is that the current behavior, while reasonable, isn't quite right.
It's very close, but not quite.

I thought the point of Perl 6 was to improve the language through feedback
and involvement from the Perl community?  That's what I'm trying to do,
offer feedback on what I consider to be a subtle (and yes, fairly minor)
design flaw in Perl 5's regular expressions.  I was hoping we could have a
rational debate about it, rather than propagating the same design flaw into
Perl 6 through inertia or for legacy reasons.  (Is there some reason that
the Perl 5 regular expressions should be exempt from refinement?)

The example I gave has 16 possible matches, "", "ddd",
"dd", "d", "bbb", "bbbddd", "bbbdd",
"bbbd", "bb", "bbddd", "bbdd", "bbd", "b",
"bddd", "bdd", "bd".  _All_ of these possible matches are
OVERLAPPING matches.  Obviously greedy matching should (and does) match the
longest match, "".  My contention is that non-greedy matching
should match the shortest match, "bd".

Now, the "exasperate" example quoted above is a good one.  It's also an
example of overlapping matches.  And I AGREE that "xasp" is what $1 should
match in the quoted example, not the shorter, later "rat" match.

Basically, I agree that eagerness SHOULD take priority, but not at the
expense of making a "non-greedy" gratuitously more greedy than necessary.

 [...] It's only when multiple possible matches start at the same point
 that you use minimal or maximal matching to break the tie.  If the
 starting points differ, there's no tie to break.

This is the point I take issue with.  In my original example, "d"
had an earlier starting point than "bd", and I believe that WAS a tie
to break with non-greedy matching, to prefer "bd" over "d".
After considering the interesting example of matching "exasperate", I've
got a more precise formulation to propose:  "tie-breaking" should be done
when potential matches have the same starting OR ending points.

Since the "rat" match in "exasperate" would require pushing the ending
point back further, I'm in agreement that "xasp" is the correct match.
However, "bd" had the SAME ending point as "d", so considering
"d" a better match is really VERY arbitrary, and I believe flawed.

I might also note that this is probably a subtle effect of the fact that
Perl and other regular expression systems are based on English, therefore
making leftmost matching seem more "natural".  This is a cultural bias,
which would be different if English were a right-to-left language like
Hebrew or Arabic.  If it were, you can bet that we'd be arguing over
whether "bd" or "b" was the proper match in my example!

In fact, this brings to mind another possibility to consider for the Perl 6
regular expression engine.  It would probably be helpful for international
purposes to support rightmost matching (as a modifier), since it would
probably be helpful 

Re: Perl 5's non-greedy matching can be TOO greedy!

2000-12-14 Thread Deven T. Corzine


On Thu, 14 Dec 2000, Jeff Pinyan wrote:

 On Dec 14, Deven T. Corzine said:
 
  You're asking for something like
  
/(?!b)(b.*?d)/
  
  which is an "optimization" you'll have to incorporate on your own.
 
 Thanks for the example.  Unfortunately, your attempted workaround doesn't
 even work for the example string; the "a" preceding "d" isn't a
 "b", so the regexp engine is still perfectly happy with the same match.
 Even if it worked as you intended, it would have failed with something like
 "bbbabbb", since the ".*?" would happily match "bbabbb"...
 
 Sorry, I was thinking backwards (when I was supposed to think forwards).
 
 Using:
 
   /(b(?!b).*?d)/
 
 is what I'd meant to say.

True, this one works, and I should have thought of it.  Unlike the wrong
example of /(b[^b].*?d)/, using the zero-width lookahead assertion fixes
the problem for this regexp, even for matches of "bd".  (Again, not all
regexps will be as simple.)

 I wouldn't call the current behavior a design flaw, though.

I do.  From an implementation standpoint, it may have been more convenient
than getting the semantics EXACTLY right, and the differences are subtle
enough that it's debatable how much effort is worth expending on fixing the
problem.  That's why I said it was "understandable", but it remains a flaw.

I haven't even SEEN an example where the current behavior is actually
preferable than my proposed behavior, have you?  (And I'd expect at least a
FEW, though I suspect there are probably more counterexamples.)

This is certainly a minor issue, but I believe the design is flawed in the
current Perl 5 implementation; why not fix the design properly for Perl 6?

Deven