Does ::: constrain the pattern engine implementation?
I have no objection to pattern operators like ::: in principle, but I do have a potential concern about them. Given that the operators are actually defined in terms of backtracking within the RE engine, does this constrain the implementation such that it MUST be a backtracking implementation to behave correctly? If these operators are purely effeciency optimization hints, that would be one thing, but I get the sense that ignoring the hints might lead to incorrect behavior. (The cut operator might be a special concern.) Suppose, for the sake of argument, that someone wanted to make a pattern engine implementation, compatible with Perl 6 patterns, which was highly optimized for speed at the expense of memory, by RE-NFA-DFA construction for simultaneous evaluation of multiple alternatives without backtracking. This might be extremely expensive in memory, but there may be some niche applications where run-time speed is paramount, and a pattern is used so heavily in such a critical way that the user might be willing to expend hundreds of megabytes of RAM to make the patterns execute several times faster than normal. (Obviously, such a tradeoff would be unacceptable in the general case!) Would it be _possible_ to create a non-backtracking implementation of a Perl 6 pattern engine, or does the existence of backtracking-related operators preclude this possibility in advance? I hope we're not constraining the implementation options by the language design, but I'm worried that this might be the case with these operators. Shouldn't it be an implementation decision whether to use backtracking? Deven
Re: Does ::: constrain the pattern engine implementation?
On 28 Aug 2002, Simon Cozens wrote: [EMAIL PROTECTED] (Deven T. Corzine) writes: Would it be _possible_ to create a non-backtracking implementation of a Perl 6 pattern engine I don't believe that it is, but not just because of : and friends. Why does it matter? I'm not saying we should dump the operators -- if we get more power by assuming a backtracking implementation, maybe that's a worthwhile tradeoff. On the other hand, if we can keep the implementation possibilities more open, that's always a worthwhile goal, even if we're not sure if or when we'll ever take advantage of those possibilities, or if we even could... It seems like backtracking is a Bad Thing, in that it leads to reprocessing data that we've already looked at. On the other hand, it seems to be a Necessary Evil because of the memory costs of avoiding backtracking, and because we might have to give up valuable features without backtracking. It may be that backreferences already demand backtracking. Or some other feature might. I don't know; I haven't thought it through. If we must have backtracking, so be it. But if that's a tradeoff we're making for more expressive and powerful patterns, we should still at least make that tradeoff with our eyes open. And if the tradeoff can be avoided, that's even better. Deven
Re: Does ::: constrain the pattern engine implementation?
On Wed, 28 Aug 2002, Dan Sugalski wrote: At 10:57 AM -0400 8/28/02, Deven T. Corzine wrote: Would it be _possible_ to create a non-backtracking implementation of a Perl 6 pattern engine, or does the existence of backtracking-related operators preclude this possibility in advance? In general, no of course it's not possible to create a non-backtracking perl regex engine. Far too much of perl's regexes requires backtracking. Given that Perl 5 regex's are no longer regular (much less Perl 6), I'm sure this is probably true. There may be a regular subset which could be implemented without backtracking if problematic features are avoided, but surely a complete non-backtracking implementation is beyond reach. On the other hand, :, ::, ::: and commit don't necessarily need to be a problem if they can be treated as hints that can be ignored. If allowing the normal engine to backtrack despite the hints would change the results, that might be a problem. I don't know; cut may pose special problems. Even if the new operators can't work without backtracking, maybe it doesn't matter, since there's surely a few others inherited from Perl 5 as well... That doesn't mean you can't write one for a specific subset of perl's regexes, though. A medium-term goal for the regex engine is to note where a DFA would give correct behaviour and use one, rather than going through the more expensive generalized regex engine we'd otherwise use. I think this is a more realistic goal, and more or less what I had in mind. I believe there are many subpatterns which might be beneficial to compile to a DFA (or DFA-like) form, where runtime performance is important. For example, if a pattern is matching dates, a (Jan|Feb|Mar|Apr|...) subpattern would be more efficient to implement as a DFA than with backtracking. With a large amount of data to process, that represent significant savings... If you want to head over to [EMAIL PROTECTED] and pitch in on the regex implementation (it's being worked on now) that'd be great. I'd like to do that, if I can find the time. It would be interesting to make a small experimental prototype to see if DFA construction could really improve performance over backtracking, but it would probably need to be a very restricted subset of regex operations to test the idea... However, while I'm still on perl6-language, I have two language issues to discuss first: (1) Can we have a :study modifier in Perl 6 for patterns? It could be a no-op if necessary, but it could take the place of Perl 5's study operator and indicate that the programmer WANTS the pattern optimized for maximum runtime speed, even at the cost of compile time or memory. (Hey, how about a :cram modifier for extreme optimization? :-) (2) Would simple alternation impede DFA-style optimizations? Currently, a pattern like (Jun|June) would never match June because the leftmost match Jun would always take precedence, despite the normal longest-match behavior of regexes in general. This example could be implemented in a DFA; would that always be the case? Would it be better for the matching of (Jun|June) to be undefined and implementation-dependent? Or is it best to require leftmost semantics? Deven
Re: Does ::: constrain the pattern engine implementation?
On Wed, 28 Aug 2002, Larry Wall wrote: : (1) Can we have a :study modifier in Perl 6 for patterns? : : It could be a no-op if necessary, but it could take the place of Perl 5's : study operator and indicate that the programmer WANTS the pattern : optimized for maximum runtime speed, even at the cost of compile time or : memory. (Hey, how about a :cram modifier for extreme optimization? :-) Well, studied isn't really a property of a pattern--it's a property of a string that knows it will have multiple patterns matched against it. One could put a :study on the first pattern, but that's somewhat deceptive. Oh yeah. I forgot it applied to the string, not the pattern! I forgot since I never use it! :-) Still, it could be considered a parallel... Perhaps a better approach would be to allow the optimization priorities to be specified, perhaps even as numerical ranges for relative importance? The three obvious dimensions to quantify would be compile time, run-time speed, and memory usage. There's often tradeoffs between these three, and allowing the ability for a programmer to specify his/her preferences could allow for aggressive optimizations that are normally inappropriate... Of course, these would be useful not only as modifiers for compiling any regexes, but as general pragmas controlling optimizing behavior of the entire Perl 6 compiler/optimizer... I'm not sure if it's good enough to just say optimize for run-time speed at the expense of compile time and memory (or variations for only having one of the two sacrificed) -- or it it's better to have a scale (say, 0-9) for how important each dimension is. For the extreme case where long compile time and large memory usage is irrelevant, but extreme run-time speed is a must, the programmer might specify optimization priorities of compile=0, memory=0, speed=9. I'm not sure what sort of syntax would be appropriate for such specifications... : (2) Would simple alternation impede DFA-style optimizations? : : Currently, a pattern like (Jun|June) would never match June because the : leftmost match Jun would always take precedence, despite the normal : longest-match behavior of regexes in general. This example could be : implemented in a DFA; would that always be the case? Well, June can match if what follows fails to match after Jun. True enough. Couldn't that still be implemented in a DFA? (Possibly at the cost of doubling the size of the DFA for the later part of the regex!) : Would it be better for the matching of (Jun|June) to be undefined and : implementation-dependent? Or is it best to require leftmost semantics? Well, the semantics shouldn't generally wobble around like that, but it'd be pretty easy to let them wobble on purpose via pragma (or via :modifier, which are really just pragmas that scope to regex groups). Yeah, it's probably safer not to have that much room for undefined behavior since people will just try it and assume that their implementation is the universal behavior... Would there be a good way to say don't care about the longest-vs-leftmost matching semantics? Would it be worthwhile to have longest-trumps-leftmost as an optional modifier? (This might be very expensive if implemented in a backtracking engine, since it could no longer shortcut alternations...) Dan suggested :dfa for DFA semantics -- is that the best answer, or would it be better to define the modifiers in terms of visible behavior rather than implementation, if possible? Deven
Re: Does ::: constrain the pattern engine implementation?
On Wed, 28 Aug 2002, Larry Wall wrote: That is a worthy consideration, but expressiveness takes precedence over it in this case. I see nothing wrong with expressiveness taking precedence -- I'm only saying that it would be best to be cognizant of any restrictions we're hardcoding into the design (in case there's a less restrictive means to the same ends) and make that design tradeoff knowingly rather than by default. If we can find general solutions that don't demand a particular style of implementation, that's probably an improvement. There may be unavoidable cases, in which case we decide to accept the limitation for expressiveness, and that's a perfectly reasonable design choice. I'd just hate to ignore the issue now, and have someone later say here's a great way it could have been done that would have allowed this improvement in the implementation... DFAs are really only good for telling you *whether* and *where* a pattern matches as a whole. They are relatively useless for telling you *how* a pattern matches. For instance, a DFA can tell you that you have a valid computer program, but can't hand you back the syntax tree, because it has no way to decide between shifting and reducing. It has to do both simultaneously. Yes and no. You're right, but see below. : It may be that backreferences already demand backtracking. Or some other : feature might. I don't know; I haven't thought it through. I believe you are correct that backrefs require backtracking. Maybe some smart person will find a way to trace back through the states by which a DFA matched to retrieve backref info, but that's probably worth a PhD or two. Well, there are certainly PhD students out there doing new research all the time. Who knows what one will come up with one day? It would suck if one gets a PhD for a super-clever pattern-matching algorithm, and we find that we can't use it because of hardcoded assumptions in the language design... As for backtracing states of the DFA, see below. Minimal matching is also difficult in a DFA, I suspect. Is it? I'm not sure. Since the DFA effectively follows all branches of the NFA at once, perhaps minimal matching is no more dificult than maximal? Then again, maybe not. :-) : If we must have backtracking, so be it. But if that's a tradeoff we're : making for more expressive and powerful patterns, we should still at least : make that tradeoff with our eyes open. And if the tradeoff can be avoided, : that's even better. I refer you to http:://history.org/grandmother/teach/egg/suck. :-) Um, huh? That's a tradeoff I knowingly made in Perl 1. I saw that awk had a DFA implementation, and suffered for it as a useful tool. I suspect it's not practical to have an all-DFA implementation with nearly the power and expressiveness of Perl 4 or Perl 5 regexes, much less Perl 6. On the other hand, many patterns have subpatterns which might benefit from using a DFA as an optimization. You don't lose the expressiveness here, if the backtracking NFA is available as well. I'm just asking that we consider the impact on such optimizations, and see if we can leave the door open to reap the benefits without compromising the power and expressiveness we all want. Maybe this just amounts to adding a few modifiers to allow semantic variants (like longest-trumps-leftmost), to enable optimizations that would otherwise impinge on correctness... And it's not just the backrefs. An NFA can easily incorporate strategies such as Boyer-Moore, which actually skips looking at many of the characters, or the scream algorithm used by study, which can skip even more. All the DFAs I've seen have to look at every character, albeit only once. I suppose it's theoretically possible to compile to a Boyer-Moore-ish state machine, but I've never seen it done. Okay, I confess that I've been saying DFA when I don't necessarily mean precisely that. What I really mean is a non-backtracking state machine of some sort, but I'm calling it a DFA because it would be similar to one (to the degree possible) and people know what a DFA is. I could say NBSM, but that seems confusing. :-) Your objections to the limitations of a DFA are quite correct, of course. Modifications would be required to overcome the limits, and then it's no longer really a DFA, just like Perl 5's regular expressions are no longer really regular expressions in the mathematical sense. I'm envisioning a state machine of some sort, which has a lot in common with a DFA but isn't strictly a DFA anymore. If you prefer, I'll call it an NBSM, or I'm open to better suggestions for a name! Anyway, to respond to your objections to a DFA: * While you couldn't hand back a syntax tree from a true DFA, it should be possible to create an NBSM from a DFA recognizer, modified to record whatever extra information is needed to execute the code that constructs the syntax tree. The
Re: Does ::: constrain the pattern engine implementation?
On Wed, 28 Aug 2002, Steve Fink wrote: On Wed, Aug 28, 2002 at 12:55:44PM -0400, Deven T. Corzine wrote: On Wed, 28 Aug 2002, Dan Sugalski wrote: At 10:57 AM -0400 8/28/02, Deven T. Corzine wrote: On the other hand, :, ::, ::: and commit don't necessarily need to be a problem if they can be treated as hints that can be ignored. If allowing the normal engine to backtrack despite the hints would change the results, that might be a problem. I don't know; cut may pose special problems. They do change the semantics. (June|Jun) ebug matches the string Junebug, but (June|Jun): ebug does not. Similarly, (June|Jun) ebug (June|Jun) :: ebug (June|Jun) ::: ebug (June|Jun) commit ebug all behave differently when embedded in a larger grammar. Good point. Okay, they definitely change the semantics. Still, could such semantics be implemented in a non-backtracking state machine, whether or not it's a strict DFA? However, it is very possible that in many (the majority?) of actual uses, they may be intended purely as optimizations and so any differing behavior is unintentional. It may be worth having a flag noting that (maybe combined with a warning you said this :: isn't supposed to change what can match, but it does.) I think this is like the leftmost matching semantics -- it may exist for the sake of implementation efficiency, yet it has semantic consequences as well. In many cases, those semantic differences may be immaterial, yet some code relies on it. Allowing flags to specify that such differences in semantics are immaterial to your pattern would be helpful. (Would it make sense for one flag to say don't care to the semantic differences for BOTH leftmost matching and the :/::/:::/etc. operators?) I believe there are many subpatterns which might be beneficial to compile to a DFA (or DFA-like) form, where runtime performance is important. For example, if a pattern is matching dates, a (Jan|Feb|Mar|Apr|...) subpattern would be more efficient to implement as a DFA than with backtracking. With a large amount of data to process, that represent significant savings... I agree. I will reply to this in perl6-internals, though. Yes, the discussion of details belongs there, when it's not infringing on issues of language design, as the semantic consequences do... (2) Would simple alternation impede DFA-style optimizations? Currently, a pattern like (Jun|June) would never match June because the leftmost match Jun would always take precedence, despite the normal longest-match behavior of regexes in general. This example could be implemented in a DFA; would that always be the case? You should read Friedl's Mastering Regular Expressions, if you haven't already. A POSIX NFA would be required to find the longest match (it has to work as if it were a DFA). A traditional NFA produces what would result from the straightforward backtracking implementation, which often gives an answer closer to what the user expects. IMO, these are often different, and the DFA would surprise users fairly often. Rather than following a traditional approach to NFA/DFA construction, would it be possible to use a modified approach that preserves leftmost matching? (If so, would it be more expensive or just different?) Would it be better for the matching of (Jun|June) to be undefined and implementation-dependent? Or is it best to require leftmost semantics? I'm voting leftmost, because I've frequently seen people depend on it. I was going to agree, until I read your next paragraph. However, it would be useful to be able to say don't care to the semantic distinction -- it might even be useful to be able to demand longest-match take precedence over leftmost matching, but that would incur an extra cost in the normal regex engine... I'm not so sure that Larry's suggestion of adding a :dfa flag is always the right approach, because I think this might actually be something you'd want to set for subsets of a grammar or a single expression. I don't think it's useful enough to go as far as proposing that || mean alternate without defining the order of preference, but perhaps some angle-bracketed thing would work. (Or can you embed flags in expressions, like perl5's (?imsx:R) thing? Then the :dfa flag is of course adequate!) You know, when you bring up an idea like ||, I start thinking that maybe the default should be NOT to have a preference (since it normally doesn't matter) and to only guarantee the leftmost short-circuit behavior with || instead of |. That would allow for more implementation flexibility, and provide a beautiful parallel with C -- in C, only || short-circuits and the | operator still evaluates all parts. (Granted, that's because it's bitwise, but there's still a nice parallel there.) For the few cases where someone wants to rely on leftmost matching of any alternation, they could simply use
Copy-restore on parameters? (was Re: Autovivi)
On Wed, 14 Aug 2002, Luke Palmer wrote: Why? We could make arglists exactly equivilent to the way they're done in Perl 5, which is a good way. sub foo($a, $b, *@c) {...} Would be exactly equivilent to Perl 5's sub foo { my ($a, $b, c) = _; ... } I've got another idea. How about using a copy-restore technique? First, assume all arguments are pass-by-reference, whether constant or not. Second, allow is ro as a modifier to force constant semantics -- in this case, don't make a copy, just use the reference and refuse to modify it. Third, allow is rw as a modifier to enable copy-restore, and to make the reference modifiable. Fourth, copy all named parameters into local variables from the passed reference (except for is ro parameters, which should be aliased) -- and then keep _ with the actual references, as in Perl 5. (Unlike Perl 5, the is ro parameters would not be modifiable even via _.) Finally, for is rw parameters, copy the modified local variable back to the referenced parameter (the original one, even if _ was shifted!), but ONLY for a normal subroutine exit. This would allow pass-by-value semantics by default for convenience, easily forced read-only parameters, pass-by-reference semantics when desired, but with a twist -- if the function aborts (throws an exception), no parameters have been changed, which offers hassle-free transaction-like atomicity. If a function really wants to modify its parameter, despite throwing an exception, it could do so via _ directly. Or, we could also allow another modifier is ref to use simple (modifiable) pass-by-reference semantics, but default to copy-restore for is rw... This seems like it could be the best of all worlds -- what do people think? Deven
Re: Copy-restore on parameters? (was Re: Autovivi)
On Thu, 15 Aug 2002, Larry Wall wrote: On Thu, 15 Aug 2002, Deven T. Corzine wrote: : I've got another idea. How about using a copy-restore technique? I suspect that would make Perl 6's sub calls even slower than Perl 5's. Yes and no. For the normal case (pass-by-value semantics), it should be about the same speed as Perl 5, since it would be doing the same thing: making a copy of a value that was passed by reference. For the is ro or is ref case, it should be FASTER, since it could avoid the copy by aliasing the reference. Only the is rw case would be slower, since it would involve two copies rather than one. However, it would offer a benefit of rolling back any changes to the parameters in case of an abnormal exit from the function. This seems like a valuable feature that could often be useful -- and anyone who wants to avoid the speed penalty could stick with is ref, assuming that option is also allowed... Deven
Re: Autovivi
On Mon, 5 Aug 2002, Dan Sugalski wrote: At 1:30 PM +1000 8/6/02, Damian Conway wrote: Luke Palmer asked: Does: print %foo{bar}{baz}; still create %foo{bar} as a hashref if it doesn't exist? It is my very strong hope that it will not. Unless Larry declares otherwise, it won't. I bought up this question on 12/22/00, and it didn't get much response, except for Graham Barr, who pointed out that there might be some trouble with constructs like func($x{1}{2}{3}) that may or may not need to autovivify the argument if the function assigns to $_[0]. (This seems a bit like the difficulty of implementing hypothetical values in regexes; perhaps a similar solution could apply to both?) Subsequent to that mini-discussion, I had occasion to talk to Larry in person, and asked him about the autovivification behavior -- his response was that he considered Perl 5's behavior of autovivifying read-only values to be a misfeature, and that he hopes Perl 6 won't do so. Now, since Larry, Dan and Damian all seem to be in agreement on this issue, I sincerely hope Perl 6 will indeed break from Perl 5's misbehavior here... However, will the func($x{1}{2}{3}) case cause an implementation problem? Deven
Re: Autovivi
On Tue, 13 Aug 2002, Larry Wall wrote: On Tue, 13 Aug 2002, Deven T. Corzine wrote: : However, will the func($x{1}{2}{3}) case cause an implementation problem? This is why the new function type signatures will assume that parameters are constant. If you want a modifiable parameter, you have to say is rw. Then it's considered an lvalue. In Perl 5, all function args had to be considered potential lvalues. That will certainly help. What about the is rw case, though? Will we be able to avoid autovivifying the parameter if it's never actually assigned, or do we assume this edge case isn't worth the hassle, and assume it will follow Perl 5's behavior for such variables? In essence, all Perl 5 functions have a signature of (*@_ is rw). Perhaps the translator can turn some of those into (*@_). What'd really be cool is if it could pick up an initial my ($a, $b, $c) = _; and turn it into an appropriate signature. That would be pretty cool. It would also be nice to handle the common alternative form of using shift as well... Of course, there are issues here if the code modifies those variables, since the issue of whether a variable is rw is really distinct from whether it represents a pass by value or reference. Slapping a constant on it is a bald-faced attempt to get the speed of pass-by-reference with the guarantees of pass-by-value. The only accurate way to know if the code modifies the variables is to do some sort of dataflow analysis, and it can't be 100% accurate even then. (Suppose a shift may or may not happen, depending on the parameters, then $_[2] is modified?) Of course, it should often be possible (at least in principle) to determine that it's impossible for a particular parameter to be modified, and play it safe by assuming is rw for undeterminable cases. Perhaps there should be a way to declare a parameter to be pass-by-value, producing a modifiable variable that does not affect the caller's value. But I'm not sure saving one assignment in the body is worth the extra mental baggage. I'd argue that it's worthwhile -- not because of the cost of that single assignment, but because you'd now have two variables to keep track of, the original parameter and the modifiable copy. This seems like greater mental bagger (to me) than remembering another modifier. How about is copied as a pass-by-value modifier? (I was inclined to say is a copy but I didn't think a 3-word phrase would be desirable here...) Deven
Re: Autovivi
On Tue, 13 Aug 2002, Nicholas Clark wrote: On Tue, Aug 13, 2002 at 03:06:40PM -0400, Deven T. Corzine wrote: The only accurate way to know if the code modifies the variables is to do some sort of dataflow analysis, and it can't be 100% accurate even then. (Suppose a shift may or may not happen, depending on the parameters, then $_[2] is modified?) Of course, it should often be possible (at least in principle) to determine that it's impossible for a particular parameter to be modified, and play it safe by assuming is rw for undeterminable cases. Well, perl5 does already manage to avoid auto-vivifying hash keys when they are used as subroutine arguments. It uses magic, rather than dataflow analysis: Yes, magic obviously can work, but it's a greater run-time penalty -- if you do some sort of dataflow analysis, you pay the penalty at compile time instead. I'm not sure which is better, ultimately. It probably varies. Anyway, the Perl 5 magic you speak of presumably only works for a single level of hash referencing -- we were specifically discussing a multilevel reference, which immediately creates missing substructures in Perl 5, even used in a read-only rvalue context in a simple Perl 5 expression... Deven
Re: Autovivification behavior
On Sat, 23 Dec 2000, Graham Barr wrote: This has been discussed on p5p many many times. And many times I have agreed with what you wrote. However one thing you did not mention, but does need to be considered is func($x{1}{2}{3}) at this point you do not know if this is a read or write access as the sub could do $_[0] = 'fred'. If this can be handled in someway so that the autoviv does not happen until the assignment then OK, but otherwise I think we would have to stick with perl5 semantics This is a good point. Similar arguments apply to grep, map, foreach, etc. Could we have some sort of "lazy evaluation" mode for lvalues where a reference to an undefined substructure sets a flag and saves the expression for the lvalue, returning undef whenever evaluated in a read-only context and autovivifying when necessary to write to the lvalue? Also we would probably need a use autoviv pragma so that perl5 scripts which are ported, and rely on the autivivify of hashes, will still work as before. This makes sense, if any programs rely on the current behavior. (Do you know of any such programs?) Deven
Autovivification behavior
Can the autovivication behavior be changed slightly for Perl 6? Specificly, can we suppress autovivication in read-only circumstances, and evaluate the expression as "undef" immediately instead of autovivicating empty data structures that didn't exist before? The current behavior in Perl 5 is inconsistent. Attempting to reference a hash or array element doesn't normally cause that element to spring into existence, but attempting to reference substructures DOES suddenly cause that substructure to spring into existence. This behavior is desirable when writing to that substructure, but undesirable when reading from it, and inconsistent with the normal behavior of single-level perl structures. Referencing a nonexistent entry in a hash or array doesn't create that entry UNLESS it's autovivified as a reference. "%x = (); $x{1};" does NOT have the same effect as "%x = (1 = undef);"; why does "%x = (); $x{1}{1};" have the same effect as "%x = (1 = {});"? If the programmer has never stored any values into %x, it would be reasonable to expect it to remain empty; why violate this reasonable expectation just because of an attempted reference to a nonexistent structure? Here is an example (in the Perl debugger) demonstrating this inconsistency: DB1 x \%x # %x starts empty 0 HASH(0x82136d0) empty hash DB2 x $x{1} # access a nonexistent entry 0 undef DB3 x \%x # %x remains empty 0 HASH(0x82136d0) empty hash DB4 x $x{1}{2}{3}{4}# access nonexistent substructures 0 undef DB5 x \%x # %x no longer empty! 0 HASH(0x82136d0) 1 = HASH(0x829a33c) 2 = HASH(0x829a378) 3 = HASH(0x829a408) empty hash DB6 x \@x # @x also starts empty 0 ARRAY(0x829a420) empty array DB7 x $x[3] # access a nonexistent entry 0 undef DB8 x \@x # @x remains empty 0 ARRAY(0x829a420) empty array DB9 x $x[3][4] # access nonexistent substructure 0 undef DB10 x \@x # @x no longer empty! 0 ARRAY(0x829a420) 0 undef 1 undef 2 undef 3 ARRAY(0x829a554) empty array Now, I understand why this happens, but is it really a good thing that it happens in the first? We're not storing into the nonexistent substructures (in which case you DO want autovivification, of course), only attempting to access a possible value in a substructure that may not exist. This access attempt is creating the intervening substructures, despite appearing to be a "read-only" operation on the data structure it modifies. (This would be closely analogous to having Unix create intervening directories for a nonexistent pathname before reporting a "No such file or directory" error for the full pathname...) Currently, it's sometimes necessary to do "defined" tests on intermediate levels to workaround this behavior before testing inside substructures that may not exist. Couldn't we change this for Perl 6? Deven
Re: Perl 5's non-greedy matching can be TOO greedy!
On Thu, 14 Dec 2000, Nathan Torkington wrote: Deven T. Corzine writes: I haven't even SEEN an example where the current behavior is actually preferable than my proposed behavior, have you? (And I'd expect at least a FEW, though I suspect there are probably more counterexamples.) I think the biggest problem with your idea is that it requires the engine to keep looking even after it finds a match, to see if there's another shorter match. This would make *every* match much much slower, potentially heatdeathoftheuniverse slower. I like the current semantics because it's very easy to visualize the engine acting on your instructions and stopping as soon as it finds a match. I am a programmer, and I prefer programs to descriptions. Not at all. I don't want it to keep looking after it finds the first match. I want it to make sure that match isn't unnecessarily long, if non-greedy matching was in use. Conceptually (I don't think this would be a good implementation), you find the first match as the current engine does, then search for the smallest possible match WITHIN that first match. Since it will already be as short as possible from the starting point, this amounts to advancing the starting point as far as possible without changing the ending point, as long as it still matches. Deven
Re: Perl 5's non-greedy matching can be TOO greedy!
On 14 Dec 2000, Randal L. Schwartz wrote: "Deven" == Deven T Corzine [EMAIL PROTECTED] writes: Deven I haven't even SEEN an example where the current behavior is Deven actually preferable than my proposed behavior, have you? (And Deven I'd expect at least a FEW, though I suspect there are probably Deven more counterexamples.) If I want the leftmost B that is followed by the shortest number of characters to get to a D, I write /B.*?D/. The "leftmost" part keeps getting left out of regex discussions, and maybe that's why you're thrown off. But it's a pretty consistent overriding factor. That's not an example, that's a hypothetical. I meant that I've never seen a concrete, realistic example where the current behavior is more beneficial to the programmer than my proposed behavior. (I imagine in most cases, it will be a moot point, since the match will usually be the same.) If you want something odd like "not necessarily the leftmost", then you'll need to speak more. But "leftmost" is fundamental to the design of regex. Don't mess with that. Or don't call it a regex any more. Strange argument. Greedy matching was once considered fundamental to the design of regex, and the "leftmost" behavior is 100% consistent with greedy matching. Yet Perl 5 added non-greedy modifiers, changing a fundamental aspect of every preceding regex system, and still called it a regex... Extending the design of the regex system to allow for non-greedy matching was an excellent extension, and made Perl 5's regular expressions far more powerful than anyone else's. But they're certainly different than before. I wish I had been involved in the discussions when the idea of non-greedy matching was proposed and the details ironed out. I don't think my ideas would have had the same kind of concerted opposition back then as now... Deven
Re: Perl 5's non-greedy matching can be TOO greedy!
On 15 Dec 2000, Randal L. Schwartz wrote: "Deven" == Deven T Corzine [EMAIL PROTECTED] writes: Deven As for special-case rules, I believe that my proposed modification would Deven REMOVE a special-case semantic rule, at the cost of added complexity at the Deven implementation level. (The cost decision of whether that added complexity Deven is worthwhile is a separate consideration.) No, it would break a much higher overriding rule of "left most match wins". That's at the top of the chart. You're *adding* an exception to that. Tell me how you can do that without breaking much existing code. Can you give a concrete, real-life example of code that my proposed change would actually break, not a contrived hypothetical case design to break? Adding the requirement for "\@" in double-quoted strings broke a LOT of Perl 4 code. I doubt this would break much at all. Deven And I'd really appreciate it if everyone would refrain from Deven suggesting that I don't understand the behavior. I understand Deven it fine; I just don't agree with it. In the language of the Deven Supreme Court, "I respectfully dissent." Just because I don't Deven perfectly agree with the semantics that were chosen doesn't Deven mean I don't understand them. You don't understand the motivation, apparently. That's what I'm referencing. I don't understand the motivation behind taking leftmost matching to such extremes. I understand the importance of leftmost matching in general; this is a particular situation where I believe it's being taken a little too far for the good of the whole. However, some of the comments that have been made seem to imply that I'm simply ignorant about regular expressions and I need to learn them better. My questioning the behavior arises out of philosophy, not ignorance. Deven
Re: Perl 5's non-greedy matching can be TOO greedy!
On Thu, 14 Dec 2000, Jeff Pinyan wrote: You could use my sexeger technique to get this behavior (possibly): $string = "aaabbbcccdddeee"; # regex to be reversed: /b(.*?)d/ $revstr = reverse $string; ($match) = $revstr =~ /d(.*?)b/; No, that doesn't quite work. It works when you unroll the .*? loop: # regex to be reversed: /b([^bd]*)d/ ($match) = $revstr =~ /d([^bd]*)b/; But then why would you use sexeger at all, when you can unroll the loop in the forward direction? ($match) = $string =~ /b([^bd]*)d/; I'm not asking for help with workarounds. If I ever need to workaround the current behavior, I won't have any trouble doing so. It's just that (on a philosophical note) I don't believe this is something that should require a workaround. Obviously, others disagree. Deven
Re: Perl 5's non-greedy matching can be TOO greedy!
On Thu, 14 Dec 2000, James Mastros wrote: On Thu, Dec 14, 2000 at 04:10:12PM -0500, Deven T. Corzine wrote: The crux of the problem is that non-greedy qualifiers don't affect the "earliest match" behavior, which makes the matches more greedy than they really ought to be. Right. We've got a terminoligy issue. There's two axes here: Not the only terminology issue, but I've addressed that in other messages. Greed: Greedy by default. Will try to match as many chars after its starting point as possible. Non-greedy will try to match as few chars after its starting point as possible, and is denoted by putting a '?' after a quantifer. Lazyness: Eager by default. An eager assertation will try to match as early as possible. It isn't possible to use lazy assertations at present. (Not simply, anyway. It's probably possible to combine other (?) features to get the same effect.) You seem to want the choice between greedy, eager assertations, and non-greedy, lazy assertations. If you came up with a good way to specify along both axes, I think we'd have a winner. I want the maximum level of OVERALL consistency for regular expressions as a whole, rather than immutable adherence to the "leftmost trumps nongreedy" rule currently in place. Most of the time, I agree with the precedence of leftmost over nongreedy. The example I gave is a case where I believe the strict adherence to the leftmost rule actually introduces complexity and makes the regular expression system less self-consistent. I don't know if anyone else will ever believe me or not. Deven
Re: Perl 5's non-greedy matching can be TOO greedy!
On Fri, 15 Dec 2000, Deven T. Corzine wrote: Not at all. I don't want it to keep looking after it finds the first match. I want it to make sure that match isn't unnecessarily long, if non-greedy matching was in use. Conceptually (I don't think this would be a good implementation), you find the first match as the current engine does, then search for the smallest possible match WITHIN that first match. Since it will already be as short as possible from the starting point, this amounts to advancing the starting point as far as possible without changing the ending point, as long as it still matches. Actually, I'm not sure -- it's conceivable that the ending point would ALSO move inward for a different starting point within the original match. But the ending point should NEVER be advanced further -- that's where the "leftmost over nongreedy" rule should apply instead... Deven
Re: Perl 5's non-greedy matching can be TOO greedy!
On 15 Dec 2000, Randal L. Schwartz wrote: "Deven" == Deven T Corzine [EMAIL PROTECTED] writes: Deven What surprised me was how vigorously people would defend the Deven status quo, and insist on the correctness of the current Deven behavior without thinking it through. No, I thought it through quite completely. As have others. You and others spent plenty of time thinking this through a long time ago. I wish I had been involved with that process, to raise these points then. Have you thought it through NOW, on a purely semantic level (in isolation from implementation issues and historical precedent), or are you rejecting the idea out-of-hand because you believe you've already given the matter enough consideration? Deven Given how invested people are in the exact current behavior, I Deven now believe it was a poor choice of words to describe it as a Deven "flaw", simply because it presumed an implied consensus on the Deven higher-level semantics that obviously wasn't there. Quite the opposite. You seem to be one of the very few who expects it to act other than as documented. Really? I haven't taken a survey, but I did ask one co-worker for his first impression of what the regexp (from my example) would match. Not being an experienced Perl programmer, but being familiar with regular expressions, he believed he understood the idea of non-greedy matching. His expectation? That would match "bd", not "d". Is it that few people expect this, or that they get smacked down so fast when they suggest it that they never dare raise the question again? Deven It seems to have been interpreted as a value judgement on my Deven part, which it wasn't. It merely occurred to me that Perl 6 Deven might provide an opportunity to eliminate a minor quirk in the Deven regular expression system. I didn't mean to imply that the Deven current behavior is BAD, simply that it's not quite right (at Deven least in my mind) -- since there's serious disagreement about Deven this, I'd like to make a shift in terminology and start Deven referring to this behavior as a "semantic anomaly" rather than Deven a "flaw" or a "bug", and hope that will be a more neutral term. It's not an anomoly at all. It comes out completely accurate with some very simple rules for how regex works. It's an anomaly in my mind. Whether it's a real anomaly or an imaginary one, I left that one up in the air for the moment. Personally, I believe the anomaly is real, but quite dependent on the perspective from which you view the situation. Yes, the existing rules are accurate, predictable and simple. Are they the BEST possible rules, or just what made the most sense at the time? Admit it... it bit you, and you are just refusing to believe that you don't have a complete and accurate model for how regex work. Please, admit this, and we can MOVE ON. Actually, it's never bit me. I only discovered it by experimenting with contrived examples to see exactly how the new (at that time) non-greedy matching feature worked. I thought the behavior was a little odd, but the explanation made enough sense that it didn't seem worth arguing about. Now that Perl 6 is on the horizon, it seems time to consider the idea. I have yet to run into a real-life situation where this anomaly really matters much. Which might be an argument not to worry about it. But the fact that I was already extremely familiar with greedy regexps, and (at first) surprised by this behavior, makes me wonder what's the best way. Deven Hopefully, we can have a rational discussion about whether this Deven semantic anomaly is real or imagined, what impact "fixing" it Deven would have on the implementation (if it's deemed real), and Deven whether it's worth "fixing". You can't fix what isn't broken. I put "fixing" in quotes because it depends on whether it's broken or not. I did the same thing that everyone else is doing, skipping the first part of the discussion on the assumption that the answer was OBVIOUS. Well, it clearly isn't as obvious as we all believe, since what seems obvious to me and what seems obvious to you are in conflict. Everyone keeps trying to frame it as my presumed "misunderstanding" of how it should work, when it's actually a question of perspective. Since nobody seems willing to contemplate this from the perspective that I'm presenting, it's difficult to conclude whether or not it's actually broken. Everyone simply assumes I'm wrong and doesn't care beyond that. Deven If the final decision is not to change the current behavior, Deven for whichever reason, I'd like to see this documented in an RFC Deven that says "here's what was requested and why it isn't going to Deven be done". I'll volunteer to help with that (even if I remain Deven in the minority), whether by summarizing or cutting an
Re: Perl 5's non-greedy matching can be TOO greedy!
[I delayed responding to this message because it was the longest.] On Thu, 14 Dec 2000, Tom Christiansen wrote: No question that's how it's been implemented. But WHY would anyone want such behavior? When is it beneficial? It is beneficial because this is how it's always been, because it is faster, because it is more expressive, because it is more powerful, because it is more intuitive, and because it is more perlian. Nice summary, but I'm not buying what you're selling in the elaboration. In elaboration: 0) All NFAs before POSIX acted this way. It is historically consistent and perfectly expected. First, all those older regular expression systems were inherently greedy matching algorithms. There is absolutely no conflict between earliest match and greedy matching. The conflict only arises when non-greedy matching is introduced into the equation. Second, by referring to NFA's, you're already dropping down to a lower level, and as I said in a prior message, you've lost the gestalt by then. Once you drop to that level (and stop referencing the semantics of the whole), my point is moot, because it no longer makes sense at that level. Finally, historical precedent is a always a justification not to change anything. Of course, that argued against the wonderful regex extensions that Perl 5 added, since they went against the way "it's always been"... 1) It is obviously faster to come to an answer earlier on in the execution than it would be to come to an answer later. It's like an expression whose evaluation short-circuits. Also, when the matching sematics permit back tracking and back references, the combinatoric possibilities can easily explode into virtual unsolvability as the 2**N algorithm loses its race to the heat death of the universe. Yes, if Perl did overall-longest or overall-shorted, this would produce a more predictable time; however, as we see with DFAs and POSIX NFAs, this prediction plays out as guaranteed *WORST-CASE* time. It is not acceptable to make everyone pay the worst-case time. Never penalize the whole world for the needs or desires or the few. I don't think it should be implemented unless the cost is small or applied only to those situations where it's wanted despite the cost. However, I don't believe it would NECESSARILY be slower to execute -- yes, there's more complexity. The price for that complexity must be paid, but it may be payable in memory by making a larger (but equally fast) NFA for the regular expression. Or maybe it can't. I don't know, because I have not investigated it. It's premature to simply assume it MUST be slower. 2) Consider the simple case, /A|B/. In your overall longest/shortest, guaranteed worst-case time, both submatch A and submatch B must be calculated, and then the lengths of their matches both be compared. Perl, fortunately, does not do that. Rather, the first one in that sequence wins. That means that under the current scheme, the patterns /A|B/ and /B|A/ have different semantics. Under your worst-case scheme, they do not. Because /A|B/ and /B|A/ mean something different, more expressivity is provided. This is the same scenario, albeit expressed slightly differently, as your situation. The issues manifest in both are equivalent. Then that is another semantic anomaly, because the alternation is supposed to mean that one or the other pattern matches. Logically, "or" should be communitive. If they're not quite, that's another disconnect between the high-level model and the implementation. Maybe /A|B/ and /B|A/ _do_ mean different things to the engine, but they do NOT mean different things in the high-level semantics -- except when you force the high-level semantics to change by fiat, to match the implementation. 3) This leads to increased power. It's like the difference between a short-circuiting "or" and one that blindly plods ahead trying to figure something out even when all is for naught. Compare AB with AB, for example. If A is 0, then B need not be computed, yet in the second version, one runs subexpression B nevertheless. If according to the rules of one particular system, patX and patY mean different things, whereas in a second system, they are completely interchangeable, then the first system can express nuances that the second one cannot. When you have more nuances, more expressivity, then you have more power, because you can say things you could not otherwise say. Why do C and its derivatives such as Perl have short-circuiting Boolean operators? Because in older languages, such as Fortran and Pascal, where you did not have them, one quickly found that this was cumbersome and annoying. It's a question of speed vs. correctness. Correctness is important, but occasionally a little incorrectness is worthwhile for increased speed. Short-circuiting in C
Perl 5's non-greedy matching can be TOO greedy!
Non-greedy matching is a very valuable Perl 5 regular expression feature that simplifies many regular expressions. However, early on I discovered what seems to be a failure of the mechanism -- matches were MORE greedy than necessary. I'm not sure if other people noticed this, or just failed to care about it. Maybe they would disagree with my interpretation or want to maintain backward-compatibility now that it's implemented in Perl 5 this way. I always meant to raise it for discussion long ago, but I never found the time to participate in the p5p list... The crux of the problem is that non-greedy qualifiers don't affect the "earliest match" behavior, which makes the matches more greedy than they really ought to be. Here is a simple example: (tested with perl 5.005_03) $_ = ""; ($greedy) = /(b.*d)/; # "" (correct) ($non_greedy) = /(b.*?d)/; # "d" (should be "bd"!) I'm sure this will complicate the NFA for the regexp, but it seems like this really ought to be fixed, at least in Perl 6. (There's a good case to be made for fixing it in Perl 5, but people have ignored/missed it for this long already...) Does anyone disagree with the premise, and believe that "d" is the CORRECT match for the non-greedy regexp above? For what it's worth, here's a quote from a Perl 5.005_03 "perlre" manpage: By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?". Note that the meanings don't change, just the "greediness": I don't believe that ".*?" matching "bbb" above qualifies as "to match the minimum number of times possible", when it is possible only to match the "" and still match the full regexp. Since the documentation makes no mention of earliest-match in this paragraph, I can only assume this is unintended behavior, but I'm asking to check my assumptions. Any devil's advocates out there who want to argue for the current behavior? Deven P.S. Meta question: Is this the right way to propose this, or should I write it up as an RFC and subit it first? (I'm not clear if the RFC is intended to be submitted out of the blue, or after initial discussion...)
Re: Perl 5's non-greedy matching can be TOO greedy!
On Thu, 14 Dec 2000, Tom Christiansen wrote: Does anyone disagree with the premise, and believe that "d" is the CORRECT match for the non-greedy regexp above? Yes. The Camel's regex chapter reads: You might say that eagerness holds priority over greed (or thrift). No question that's how it's been implemented. But WHY would anyone want such behavior? When is it beneficial? For what it's worth, here's a quote from a Perl 5.005_03 "perlre" manpage: By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?". Note that the meanings don't change, just the "greediness": I don't believe that ".*?" matching "bbb" above qualifies as "to match the minimum number of times possible", when it is possible only to match the "" and still match the full regexp. Since the documentation makes no mention of earliest-match in this paragraph, I can only assume this is unintended behavior, but I'm asking to check my assumptions. Any devil's advocates out there who want to argue for the current behavior? The simple story is this: Rule 1: Given two matches at *different* starting points, the one that occurs earlier wins. *OTHERWISE* Rule 2: Given two matches at the *same* starting points, the one that is longer wins. Or, more lengthly: [...] "exasperate" =~ /e(.*?)e/ # $1 now "xasp" [...] I didn't need the long-winded explanation, and I don't need help with understanding how that regexp matches what it does. I understand it perfectly well already. I'm no neophyte with regular expressions, even if Perl 5 does offer some regexp features I've never bothered to exploit... My point is that the current behavior, while reasonable, isn't quite right. It's very close, but not quite. I thought the point of Perl 6 was to improve the language through feedback and involvement from the Perl community? That's what I'm trying to do, offer feedback on what I consider to be a subtle (and yes, fairly minor) design flaw in Perl 5's regular expressions. I was hoping we could have a rational debate about it, rather than propagating the same design flaw into Perl 6 through inertia or for legacy reasons. (Is there some reason that the Perl 5 regular expressions should be exempt from refinement?) The example I gave has 16 possible matches, "", "ddd", "dd", "d", "bbb", "bbbddd", "bbbdd", "bbbd", "bb", "bbddd", "bbdd", "bbd", "b", "bddd", "bdd", "bd". _All_ of these possible matches are OVERLAPPING matches. Obviously greedy matching should (and does) match the longest match, "". My contention is that non-greedy matching should match the shortest match, "bd". Now, the "exasperate" example quoted above is a good one. It's also an example of overlapping matches. And I AGREE that "xasp" is what $1 should match in the quoted example, not the shorter, later "rat" match. Basically, I agree that eagerness SHOULD take priority, but not at the expense of making a "non-greedy" gratuitously more greedy than necessary. [...] It's only when multiple possible matches start at the same point that you use minimal or maximal matching to break the tie. If the starting points differ, there's no tie to break. This is the point I take issue with. In my original example, "d" had an earlier starting point than "bd", and I believe that WAS a tie to break with non-greedy matching, to prefer "bd" over "d". After considering the interesting example of matching "exasperate", I've got a more precise formulation to propose: "tie-breaking" should be done when potential matches have the same starting OR ending points. Since the "rat" match in "exasperate" would require pushing the ending point back further, I'm in agreement that "xasp" is the correct match. However, "bd" had the SAME ending point as "d", so considering "d" a better match is really VERY arbitrary, and I believe flawed. I might also note that this is probably a subtle effect of the fact that Perl and other regular expression systems are based on English, therefore making leftmost matching seem more "natural". This is a cultural bias, which would be different if English were a right-to-left language like Hebrew or Arabic. If it were, you can bet that we'd be arguing over whether "bd" or "b" was the proper match in my example! In fact, this brings to mind another possibility to consider for the Perl 6 regular expression engine. It would probably be helpful for international purposes to support rightmost matching (as a modifier), since it would probably be helpful
Re: Perl 5's non-greedy matching can be TOO greedy!
On Thu, 14 Dec 2000, Jeff Pinyan wrote: On Dec 14, Deven T. Corzine said: You're asking for something like /(?!b)(b.*?d)/ which is an "optimization" you'll have to incorporate on your own. Thanks for the example. Unfortunately, your attempted workaround doesn't even work for the example string; the "a" preceding "d" isn't a "b", so the regexp engine is still perfectly happy with the same match. Even if it worked as you intended, it would have failed with something like "bbbabbb", since the ".*?" would happily match "bbabbb"... Sorry, I was thinking backwards (when I was supposed to think forwards). Using: /(b(?!b).*?d)/ is what I'd meant to say. True, this one works, and I should have thought of it. Unlike the wrong example of /(b[^b].*?d)/, using the zero-width lookahead assertion fixes the problem for this regexp, even for matches of "bd". (Again, not all regexps will be as simple.) I wouldn't call the current behavior a design flaw, though. I do. From an implementation standpoint, it may have been more convenient than getting the semantics EXACTLY right, and the differences are subtle enough that it's debatable how much effort is worth expending on fixing the problem. That's why I said it was "understandable", but it remains a flaw. I haven't even SEEN an example where the current behavior is actually preferable than my proposed behavior, have you? (And I'd expect at least a FEW, though I suspect there are probably more counterexamples.) This is certainly a minor issue, but I believe the design is flawed in the current Perl 5 implementation; why not fix the design properly for Perl 6? Deven