Re: hotplug regexes, other misc regex questions

2002-09-21 Thread Steve Fink

On Mon, Sep 16, 2002 at 10:32:17AM +0300, Markus Laire wrote:
 On 15 Sep 2002 at 22:41, Steve Fink wrote:
 
 Your code seems to backtrack to the beginning at every failure. First 
 code only backtracks one char at time.
  Huh? What implementation is that? I think my naive implementation
  gives
  
  111
  112
  11
  121
  122
  12
  1
  211
  212
  21
  221
  222
  22
  2
  
 

Backtracking needs to be done on the state machine, not on the input
string. For any pattern that can be reduced to a DFA, they will have
equivalent output, but consider something like (unoptimized)

 aax =~ /(a|aa)x/

You try to match 'a' and get it. You try to match 'x' and fail. You
are at a.ax in your input string. Backing up in the input string
doesn't help. You need to backtrack to the next possibility of the
previous match operation, in this case to the aa of the a|aa.

My code is not backtracking to the beginning. It is backtracking to
the previous possibility in the NFA. So /(a|a)*x/ matches (a|a) as
many times as possible, then moves on to match the x. On failure, it
undoes one match of (a|a), tries to match the (a|a) a different way,
and if it succeeds matches as many times as possible again before
trying the x. If (a|a) cannot match a different way, it tries to
continue on to the x without matching that last (a|a) at all. Finally,
if that fails, it undoes another match of the (a|a) and repeats.

Programmatically, the implementation of R* looks something like:

  loop: match R or goto next
goto loop

  back: if we have at least one R match left on the stack, goto R.back
otherwise, fail (goto last backtrack point)

  next: ...(in this case, match x or goto back)

where R.back is the backtrack point for a successful (a|a) match (it
behaves like a continuation, so it is as if the match R call returns
another result.) R is assumed to push its backtrack information on the
stack if it succeeds, or to leave the stack untouched if it fails.
R.back is assumed to clean up any previous backtrack state of R and
change it to either a new backtrack state or clean it up, depending on
whether it can succeed in another way.



Re: hotplug regexes, other misc regex questions

2002-09-21 Thread Steve Fink

On Wed, Sep 18, 2002 at 05:01:35PM +0200, Damian Conway wrote:
 Steve Fink wrote:
 
 What possible outputs are legal for this:
 
   aaa =~ /( a { print 1 } | a { print 2 })* { print \n } x/
 
 Unless Larry specifies a required semantics, there are potentially very
 many acceptable outputs from this, depending on implementation.
 
 Therefore, your implementation must print the superposition of all possible
 outputs. ;-)

It does. In fact, I wrote the superposition of all possible sequences
of outputs in my email. Don't tell me you actually _read_ my question
before answering it? You, of all people, should know the dangers of
observing these things! Just out of curiousity, which one did it
collapse to?

Superpositions don't seem to work that well as a building block for
our irregular expressions. Making a* map to ANY(,a,aa,aaa,...) loses
some rather important ordering information. Let's see... what if you
makes an OANY (Ordered ANY) so that

 R* - OANY(..., aaa..., aa..., ...)
 R*? - OANY(,a,aa,aaa,,...)
 R|S - OANY(R,S)

Then if you optimize by converting as many OANY's to ANY's as you
can, would it then be correct to take maximal subtrees of ANY's and
implement them with DFAs?

Bleh. Why bother? The OANY - ANY decision is no easier than doing it
directly on the parse tree, I think.

Oops, gotta go. The nice men in the white coats are back.



Re: hotplug regexes, other misc regex questions

2002-09-20 Thread Larry Wall

On Sun, 15 Sep 2002, Steve Fink wrote:
: What should this do:
: 
:   my $x = the letter x;
:   print yes if $x =~ /the { $x .= ! } .* !/;

Depends.  I think it may be necessary for speed and safety reasons
to set COW on the string we're matching, so that you're always matching
against the original string.  Perhaps we can make it work in the specific
case of concatenation, since we want to it to work for extensible
strings coming from a filehandle.  But if a fast implementation needs to
keep pointers into a string rather than offsets from the beginning,
we're asking for core dumps if the string is modified out from under the
pointers, or we have to adjust all known pointers any time the string
may be modified.

: Does this print yes?
: 
:   print yes if helo =~ /hel { .pos-- } lo/;

Yes, that isn't modifying the string.

: Would it be correct for this to print 0? Would it be correct for this
: to print 2?
: 
:   my $n = 0;
:   aargh =~ /a* { $n++ } aargh/;
:   print $n;
: 
: It might want to print zero because if regexes are not required to pay
: attention to modifications in the input string, then it can optimize
: away the a*.

Implementation dependent, but there will probably be a :modifier to
specifically turn off optimizations.

: What possible outputs are legal for this:
: 
:   aaa =~ /( a { print 1 } | a { print 2 })* { print \n } x/

Lots.  :-)

Larry




Re: hotplug regexes, other misc regex questions

2002-09-20 Thread Sean O'Rourke

On Fri, 20 Sep 2002, Larry Wall wrote:
 But if a fast implementation needs to keep pointers into a string
 rather than offsets from the beginning, we're asking for core dumps if
 the string is modified out from under the pointers, or we have to
 adjust all known pointers any time the string may be modified.

With the current Parrot GC, keeping pointers into the string while doing
unrelated allocation will get you a core dump, since the string body might
be copied.  So unless the regex engine copies strings off into its own
private non-collected storage, we're stuck with offsets anyways.

/s




Re: hotplug regexes, other misc regex questions

2002-09-20 Thread Larry Wall

On Fri, 20 Sep 2002, Sean O'Rourke wrote:
: On Fri, 20 Sep 2002, Larry Wall wrote:
:  But if a fast implementation needs to keep pointers into a string
:  rather than offsets from the beginning, we're asking for core dumps if
:  the string is modified out from under the pointers, or we have to
:  adjust all known pointers any time the string may be modified.
: 
: With the current Parrot GC, keeping pointers into the string while doing
: unrelated allocation will get you a core dump, since the string body might
: be copied.  So unless the regex engine copies strings off into its own
: private non-collected storage, we're stuck with offsets anyways.

That's fine, if it's a constraint.  I had thought perhaps COW would
allow a locked-down copy to work with without forcing unnecessary
copying, but I suppose it doesn't hook into GC at that level.  I'd also
hoped it would solve any $-style inefficiencies.  But hey, that's not
my job this time around...  :-)

Larry




Re: hotplug regexes, other misc regex questions

2002-09-19 Thread Damian Conway

Josh Jore wrote:

Would it be correct for this to print 0? Would it be correct for this
to print 2?

  my $n = 0;
  aargh =~ /a* { $n++ } aargh/;
  print $n;

Yes. ;-)
 
 Wouldn't that print 2 if $n is lexical 

Err. It *is* lexical in this example.

 and 0 if it's localized? 

No. Without the Cmy it would still print either 0 or 2, depending
on the implementation/optimization.


  Or are lexicals localized now?

They can be. But this example C$n isn't.
(Just because it's used in a nested closure doesn't mean it's
localized within the pattern).


What possible outputs are legal for this:

  aaa =~ /( a { print 1 } | a { print 2 })* { print \n } x/

 
 I take it that what I've learned from _Mastering_Regular_Expressions_
 doesn't quite apply here?  From that interpretation I'd think it'd print
 111\n since the second part of the alternation wouldn't be tried.

No. It would fail to match the final Cx in the pattern and start
backtracking.

Damian




Re: hotplug regexes, other misc regex questions

2002-09-19 Thread Josh Jore

On Wed, 18 Sep 2002, Luke Palmer wrote:
 On Wed, 18 Sep 2002, Josh Jore wrote:
  On Wed, 18 Sep 2002, Damian Conway wrote:
What possible outputs are legal for this:
   
  aaa =~ /( a { print 1 } | a { print 2 })* { print \n } x/
 
  I take it that what I've learned from _Mastering_Regular_Expressions_
  doesn't quite apply here? From that interpretation I'd think it'd print
  111\n since the second part of the alternation wouldn't be tried.

 The first time through, yes.  But then it tries to match the x, and says
 oops, that's not what I have and backtracks.  That tries the second of
 the alternation, which doesn't work either.  So it backtracks so the * is
 only getting two of the first one, et cetera...

 Or are you talking about something else from Mastering Regular
 Expressions?  Like some kind of optimization that happens?

I missed the trailing 'x/' since my perl5 eyes read that as '/x'. My bad.

Joshua b. Jore -{ weird geeky madness }- http://www.greentechnologist.org





Re: hotplug regexes, other misc regex questions

2002-09-18 Thread Damian Conway

Steve Fink wrote:

 What should this do:
 
   my $x = the letter x;
   print yes if $x =~ /the { $x .= ! } .* !/;
 
 Does this print yes?

If it's allowed at all, I think the match should succeed.


   print yes if helo =~ /hel { .pos-- } lo/;

This definitely has to work. But remember the call to Cpos is on
the match object (i.e. $0), not the string.


 Would it be correct for this to print 0? Would it be correct for this
 to print 2?
 
   my $n = 0;
   aargh =~ /a* { $n++ } aargh/;
   print $n;

Yes. ;-)


 What possible outputs are legal for this:
 
   aaa =~ /( a { print 1 } | a { print 2 })* { print \n } x/

Unless Larry specifies a required semantics, there are potentially very
many acceptable outputs from this, depending on implementation.

Therefore, your implementation must print the superposition of all possible
outputs. ;-)

Actually, I would expect that *any* pattern with closures in it should act as
though it were trying each branch and loop in the normal sequence (even if
it optimizes that sequence away (which probably means it can't do that
optimization in the first place (which means it should act as though it were
trying each branch and loop in the normal sequence %-))).

Of course, LMMV.

Damian




Re: hotplug regexes, other misc regex questions

2002-09-18 Thread Josh Jore

On Wed, 18 Sep 2002, Damian Conway wrote:

  Would it be correct for this to print 0? Would it be correct for this
  to print 2?
 
my $n = 0;
aargh =~ /a* { $n++ } aargh/;
print $n;

 Yes. ;-)

Wouldn't that print 2 if $n is lexical and 0 if it's localized? Or are
lexicals localized now?

  What possible outputs are legal for this:
 
aaa =~ /( a { print 1 } | a { print 2 })* { print \n } x/

I take it that what I've learned from _Mastering_Regular_Expressions_
doesn't quite apply here? From that interpretation I'd think it'd print
111\n since the second part of the alternation wouldn't be tried.

Joshua b. Jore
http://www.greentechnologist.org




Re: hotplug regexes, other misc regex questions

2002-09-18 Thread Luke Palmer

On Wed, 18 Sep 2002, Josh Jore wrote:

 On Wed, 18 Sep 2002, Damian Conway wrote:
 
   Would it be correct for this to print 0? Would it be correct for this
   to print 2?
  
 my $n = 0;
 aargh =~ /a* { $n++ } aargh/;
 print $n;
 
  Yes. ;-)
 
 Wouldn't that print 2 if $n is lexical and 0 if it's localized? Or are
 lexicals localized now?

Well, { $n++ } is within the lexical scope of $n, so it doesn't matter.  
What matters is whether $n++ was hypotheticalized like so:

aargh =~ /a* { let $n++ } aargh/  # can it work that way?

Then it would either print 1 or 0, because if it backtracked, the ++ would 
be undone.  If the change is adopted that you can't optimize when there's 
a closure in the middle of the optimization, it would print 1.

   What possible outputs are legal for this:
  
 aaa =~ /( a { print 1 } | a { print 2 })* { print \n } x/
 
 I take it that what I've learned from _Mastering_Regular_Expressions_
 doesn't quite apply here? From that interpretation I'd think it'd print
 111\n since the second part of the alternation wouldn't be tried.

The first time through, yes.  But then it tries to match the x, and says 
oops, that's not what I have and backtracks.  That tries the second of 
the alternation, which doesn't work either.  So it backtracks so the * is 
only getting two of the first one, et cetera...

Or are you talking about something else from Mastering Regular 
Expressions?  Like some kind of optimization that happens?

Luke