Re: hotplug regexes, other misc regex questions
On Mon, Sep 16, 2002 at 10:32:17AM +0300, Markus Laire wrote: On 15 Sep 2002 at 22:41, Steve Fink wrote: Your code seems to backtrack to the beginning at every failure. First code only backtracks one char at time. Huh? What implementation is that? I think my naive implementation gives 111 112 11 121 122 12 1 211 212 21 221 222 22 2 Backtracking needs to be done on the state machine, not on the input string. For any pattern that can be reduced to a DFA, they will have equivalent output, but consider something like (unoptimized) aax =~ /(a|aa)x/ You try to match 'a' and get it. You try to match 'x' and fail. You are at a.ax in your input string. Backing up in the input string doesn't help. You need to backtrack to the next possibility of the previous match operation, in this case to the aa of the a|aa. My code is not backtracking to the beginning. It is backtracking to the previous possibility in the NFA. So /(a|a)*x/ matches (a|a) as many times as possible, then moves on to match the x. On failure, it undoes one match of (a|a), tries to match the (a|a) a different way, and if it succeeds matches as many times as possible again before trying the x. If (a|a) cannot match a different way, it tries to continue on to the x without matching that last (a|a) at all. Finally, if that fails, it undoes another match of the (a|a) and repeats. Programmatically, the implementation of R* looks something like: loop: match R or goto next goto loop back: if we have at least one R match left on the stack, goto R.back otherwise, fail (goto last backtrack point) next: ...(in this case, match x or goto back) where R.back is the backtrack point for a successful (a|a) match (it behaves like a continuation, so it is as if the match R call returns another result.) R is assumed to push its backtrack information on the stack if it succeeds, or to leave the stack untouched if it fails. R.back is assumed to clean up any previous backtrack state of R and change it to either a new backtrack state or clean it up, depending on whether it can succeed in another way.
Re: hotplug regexes, other misc regex questions
On Wed, Sep 18, 2002 at 05:01:35PM +0200, Damian Conway wrote: Steve Fink wrote: What possible outputs are legal for this: aaa =~ /( a { print 1 } | a { print 2 })* { print \n } x/ Unless Larry specifies a required semantics, there are potentially very many acceptable outputs from this, depending on implementation. Therefore, your implementation must print the superposition of all possible outputs. ;-) It does. In fact, I wrote the superposition of all possible sequences of outputs in my email. Don't tell me you actually _read_ my question before answering it? You, of all people, should know the dangers of observing these things! Just out of curiousity, which one did it collapse to? Superpositions don't seem to work that well as a building block for our irregular expressions. Making a* map to ANY(,a,aa,aaa,...) loses some rather important ordering information. Let's see... what if you makes an OANY (Ordered ANY) so that R* - OANY(..., aaa..., aa..., ...) R*? - OANY(,a,aa,aaa,,...) R|S - OANY(R,S) Then if you optimize by converting as many OANY's to ANY's as you can, would it then be correct to take maximal subtrees of ANY's and implement them with DFAs? Bleh. Why bother? The OANY - ANY decision is no easier than doing it directly on the parse tree, I think. Oops, gotta go. The nice men in the white coats are back.
Re: hotplug regexes, other misc regex questions
On Sun, 15 Sep 2002, Steve Fink wrote: : What should this do: : : my $x = the letter x; : print yes if $x =~ /the { $x .= ! } .* !/; Depends. I think it may be necessary for speed and safety reasons to set COW on the string we're matching, so that you're always matching against the original string. Perhaps we can make it work in the specific case of concatenation, since we want to it to work for extensible strings coming from a filehandle. But if a fast implementation needs to keep pointers into a string rather than offsets from the beginning, we're asking for core dumps if the string is modified out from under the pointers, or we have to adjust all known pointers any time the string may be modified. : Does this print yes? : : print yes if helo =~ /hel { .pos-- } lo/; Yes, that isn't modifying the string. : Would it be correct for this to print 0? Would it be correct for this : to print 2? : : my $n = 0; : aargh =~ /a* { $n++ } aargh/; : print $n; : : It might want to print zero because if regexes are not required to pay : attention to modifications in the input string, then it can optimize : away the a*. Implementation dependent, but there will probably be a :modifier to specifically turn off optimizations. : What possible outputs are legal for this: : : aaa =~ /( a { print 1 } | a { print 2 })* { print \n } x/ Lots. :-) Larry
Re: hotplug regexes, other misc regex questions
On Fri, 20 Sep 2002, Larry Wall wrote: But if a fast implementation needs to keep pointers into a string rather than offsets from the beginning, we're asking for core dumps if the string is modified out from under the pointers, or we have to adjust all known pointers any time the string may be modified. With the current Parrot GC, keeping pointers into the string while doing unrelated allocation will get you a core dump, since the string body might be copied. So unless the regex engine copies strings off into its own private non-collected storage, we're stuck with offsets anyways. /s
Re: hotplug regexes, other misc regex questions
On Fri, 20 Sep 2002, Sean O'Rourke wrote: : On Fri, 20 Sep 2002, Larry Wall wrote: : But if a fast implementation needs to keep pointers into a string : rather than offsets from the beginning, we're asking for core dumps if : the string is modified out from under the pointers, or we have to : adjust all known pointers any time the string may be modified. : : With the current Parrot GC, keeping pointers into the string while doing : unrelated allocation will get you a core dump, since the string body might : be copied. So unless the regex engine copies strings off into its own : private non-collected storage, we're stuck with offsets anyways. That's fine, if it's a constraint. I had thought perhaps COW would allow a locked-down copy to work with without forcing unnecessary copying, but I suppose it doesn't hook into GC at that level. I'd also hoped it would solve any $-style inefficiencies. But hey, that's not my job this time around... :-) Larry
Re: hotplug regexes, other misc regex questions
Josh Jore wrote: Would it be correct for this to print 0? Would it be correct for this to print 2? my $n = 0; aargh =~ /a* { $n++ } aargh/; print $n; Yes. ;-) Wouldn't that print 2 if $n is lexical Err. It *is* lexical in this example. and 0 if it's localized? No. Without the Cmy it would still print either 0 or 2, depending on the implementation/optimization. Or are lexicals localized now? They can be. But this example C$n isn't. (Just because it's used in a nested closure doesn't mean it's localized within the pattern). What possible outputs are legal for this: aaa =~ /( a { print 1 } | a { print 2 })* { print \n } x/ I take it that what I've learned from _Mastering_Regular_Expressions_ doesn't quite apply here? From that interpretation I'd think it'd print 111\n since the second part of the alternation wouldn't be tried. No. It would fail to match the final Cx in the pattern and start backtracking. Damian
Re: hotplug regexes, other misc regex questions
On Wed, 18 Sep 2002, Luke Palmer wrote: On Wed, 18 Sep 2002, Josh Jore wrote: On Wed, 18 Sep 2002, Damian Conway wrote: What possible outputs are legal for this: aaa =~ /( a { print 1 } | a { print 2 })* { print \n } x/ I take it that what I've learned from _Mastering_Regular_Expressions_ doesn't quite apply here? From that interpretation I'd think it'd print 111\n since the second part of the alternation wouldn't be tried. The first time through, yes. But then it tries to match the x, and says oops, that's not what I have and backtracks. That tries the second of the alternation, which doesn't work either. So it backtracks so the * is only getting two of the first one, et cetera... Or are you talking about something else from Mastering Regular Expressions? Like some kind of optimization that happens? I missed the trailing 'x/' since my perl5 eyes read that as '/x'. My bad. Joshua b. Jore -{ weird geeky madness }- http://www.greentechnologist.org
Re: hotplug regexes, other misc regex questions
Steve Fink wrote: What should this do: my $x = the letter x; print yes if $x =~ /the { $x .= ! } .* !/; Does this print yes? If it's allowed at all, I think the match should succeed. print yes if helo =~ /hel { .pos-- } lo/; This definitely has to work. But remember the call to Cpos is on the match object (i.e. $0), not the string. Would it be correct for this to print 0? Would it be correct for this to print 2? my $n = 0; aargh =~ /a* { $n++ } aargh/; print $n; Yes. ;-) What possible outputs are legal for this: aaa =~ /( a { print 1 } | a { print 2 })* { print \n } x/ Unless Larry specifies a required semantics, there are potentially very many acceptable outputs from this, depending on implementation. Therefore, your implementation must print the superposition of all possible outputs. ;-) Actually, I would expect that *any* pattern with closures in it should act as though it were trying each branch and loop in the normal sequence (even if it optimizes that sequence away (which probably means it can't do that optimization in the first place (which means it should act as though it were trying each branch and loop in the normal sequence %-))). Of course, LMMV. Damian
Re: hotplug regexes, other misc regex questions
On Wed, 18 Sep 2002, Damian Conway wrote: Would it be correct for this to print 0? Would it be correct for this to print 2? my $n = 0; aargh =~ /a* { $n++ } aargh/; print $n; Yes. ;-) Wouldn't that print 2 if $n is lexical and 0 if it's localized? Or are lexicals localized now? What possible outputs are legal for this: aaa =~ /( a { print 1 } | a { print 2 })* { print \n } x/ I take it that what I've learned from _Mastering_Regular_Expressions_ doesn't quite apply here? From that interpretation I'd think it'd print 111\n since the second part of the alternation wouldn't be tried. Joshua b. Jore http://www.greentechnologist.org
Re: hotplug regexes, other misc regex questions
On Wed, 18 Sep 2002, Josh Jore wrote: On Wed, 18 Sep 2002, Damian Conway wrote: Would it be correct for this to print 0? Would it be correct for this to print 2? my $n = 0; aargh =~ /a* { $n++ } aargh/; print $n; Yes. ;-) Wouldn't that print 2 if $n is lexical and 0 if it's localized? Or are lexicals localized now? Well, { $n++ } is within the lexical scope of $n, so it doesn't matter. What matters is whether $n++ was hypotheticalized like so: aargh =~ /a* { let $n++ } aargh/ # can it work that way? Then it would either print 1 or 0, because if it backtracked, the ++ would be undone. If the change is adopted that you can't optimize when there's a closure in the middle of the optimization, it would print 1. What possible outputs are legal for this: aaa =~ /( a { print 1 } | a { print 2 })* { print \n } x/ I take it that what I've learned from _Mastering_Regular_Expressions_ doesn't quite apply here? From that interpretation I'd think it'd print 111\n since the second part of the alternation wouldn't be tried. The first time through, yes. But then it tries to match the x, and says oops, that's not what I have and backtracks. That tries the second of the alternation, which doesn't work either. So it backtracks so the * is only getting two of the first one, et cetera... Or are you talking about something else from Mastering Regular Expressions? Like some kind of optimization that happens? Luke