Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-25 Thread Mark-Jason Dominus


I think the proposal that Joe McMahon and I are finishing up now will
make these obsolete anyway.




Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-25 Thread Mark-Jason Dominus


 On Mon, Sep 25, 2000 at 08:56:47PM +, Mark-Jason Dominus wrote:
  I think the proposal that Joe McMahon and I are finishing up now will
  make these obsolete anyway.
 
 Good! The less I have to maintain the better...

Sorry, I meant that it would make (??...) and (?{...}) obsolete, not
that it will make your RFC obsolete.  Our proposal is agnostic about
whether (??...) and (?{...}) should be eliminated.




Re: Perlstorm #0040

2000-09-23 Thread Mark-Jason Dominus


 I lie: the other reason qr{} currently doesn't behave like that is that
 when we interpolate a compiled regexp into a context that requires it be
 recompiled,

Interpolated qr() items shouldn't be recompiled anyway.  They should
be treated as subroutine calls.  Unfortunately, this requires a
reentrant regex engine, which Perl doesn't have.  But I think it's the
right way to go, and it would solve the backreference problem, as well
as many other related problems.




Re: RFC 166 (v2) Alternative lists and quoting of things

2000-09-15 Thread Mark-Jason Dominus


 (?Q$foo) Quotes the contents of the scalar $foo - equivalent to
 (??{ quotemeta $foo }).

How is this different from

\Q$foo\E

? 



Re: RFC 72 (v1) The regexp engine should go backward as well as forward.

2000-09-11 Thread Mark-Jason Dominus


 Simply put, I want variable-length lookbehind.  

Why didn't you simply propose that the (?...) operator be fixed to
support variable-length expressions?  Why so much additional machinery?




Re: $ and copying: rfc 158 (was Re: RFC 110 (v3) counting matches)

2000-09-11 Thread Mark-Jason Dominus


  in any case, i think we have a fair agreement on rfc 158 and i will
  freeze it if there is no further comments on it.
 
 I think you should remove the parts of your propsal about making $ be
  autolocalized.

If you're not planning to revise your RFC, let me know so that I can
ask the librarian to mark it as withdrawn.




Re: XML/HTML-specific ? and ? operators?

2000-09-11 Thread Mark-Jason Dominus


 : it looks worse and dumps core.
 
 That's because the first non-paren forces it to recurse into the
 second branch until you hit REG_INFTY or overflow the stack. Swap
 second and third branches and you have a better chance:

I think something else goes wrong there too.  


   $re = qr{...}
 (I haven't checked that there aren't other problems with it, though.)

Try this:

"(x)(y)" -~ /^$re$/;

This should match, but it dumps core.  I don't think there is infinite
recursion, although I might be mistaken.

Anyway, Snobol has a nice heuristic to prevent infinite recursion in
cases like this, but I'm not sure it's applicable to the way the Perl
regex engine works.  I will think about it.




Re: XML/HTML-specific ? and ? operators?

2000-09-11 Thread Mark-Jason Dominus


 :Anyway, Snobol has a nice heuristic to prevent infinite recursion in
 :cases like this, but I'm not sure it's applicable to the way the Perl
 :regex engine works.  I will think about it.
 
 It is probably worth adding the heuristic above: anytime you recurse
 into the same re at the same position, there is an infinite loop.


That is basically it, except that in snobol it is inside out:  Each
recursively interpolated pattern is assumed to match a string of at
least length 1, and if the remaining part of the target string isn't
sufficiently long to match the rest of the pattern after recursion,
then the recursion is skipped.




Re: What's in a Regex (was RFC 145)

2000-09-07 Thread Mark-Jason Dominus


2. Many people - including Larry - have voiced their desire
   to see =~ die a horrible death
 
 Please provide a look-up-able reference to Larry's saying that he
 wanted to =~ to die horrible death.  

Larry said:

# Well, the fact is, I've been thinking about possible ways to get rid
# of =~ for some time now, so I certainly don't mind brainstorming in
# this direction.

That is in 
[EMAIL PROTECTED]

which is archived at 

http://www.mail-archive.com/perl6-language-regex@perl.org/msg3.html

I think Nathan was exaggerating here, but maybe he knows something I don't.




Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Mark-Jason Dominus


 ...My point is that I think we're approaching this
 the wrong way.  We're trying to apply more and more parser power into what
 classically has been the lexer / tokenizer, namely our beloved
 regular-expression engine.

I've been thinking the same thing.  It seems to me that the attempts to
shoehorn parsers into regex syntax have either been unsuccessful
(yielding an underpowered extension) or illegible or both.

An approach that appears to have been more successful is to find ways
to integrate regexes *into* parser code more effectively.  Damian
Conway's Parse::RecDescent module does this, and so does SNOBOL.

In SNOBOL, if you want to write a pattern that matches balanced
parenteses, it's easy and straightforward and legible:

parenstring = '(' *parenstring ')'  
| *parenstring *parenstring
| span('()')


(span('()') is like [^()]* in Perl.)

The solution in Parse::RecDescent is similar.

Compare this with the solutions that work now:

 # man page solution
 $re = qr{
  \(
(?:
   (? [^()]+ )# Non-parens without backtracking
 |
   (??{ $re }) # Group with matching parens
 )*
  \)
}x;

This is not exactly the same, but I tried a direct translation:

 $re = qr{ \( (??{$re}) \)
 | (??{$re}) (??{$re})
 | (? [^()]+)
 }x;

and it looks worse and dumps core.  

This works:

qr{
  ^
  (?{ local $d=0 })
  (?:   
  \(
  (?{$d++}) 
   |  
  \)
  (?{$d--})
  (?
(?{$d0})
(?!) 
  )  
   |  
  (? [^()]* )
  
  )* 


  (?
(?{$d!=0})  
(?!)
  )
 $
}x;

but it's rather difficult to take seriously.

The solution proposed in the recent RFC 145:

/([^\m]*)(\m)(.*?)(\M)([^\m\M]*)/g

is not a lot better.  David Corbin's alternative looks about the same.

On a different topic from the same barrel, we just got a proposal that
([23,39]) should match only numbers between 23 and 39.  It seems to me
that rather than trying to shoehorn one special-purpose syntax after
another into the regex language, which is already overloaded, that it
would be better to try to integrate regex matching better with Perl
itself.  Then you could use regular Perl code to control things like
numeric ranges.  

Note that at present, you can get the effect of [(23,39)] by writing
this:

(\d+)(?(?{$1  23 || $1  39})(?!))

which isn't pleasant to look at, but I think it points in the right
direction, because it is a lot more flexible than [(23,39)].  If you
need to fix it to match 23.2 but not 39.5, it is straightforward to do
that:  
  
(\d+(\.\d*)?)(?(?{$1  23 || $1  39})(?!))

The [(23,39)] notation, however, is doomed.All you can do is
propose Yet Another Extension for Perl 7.

The big problem with 

(\d+)(?(?{$1  23 || $1  39})(?!))

is that it is hard to read and understand.

The real problem here is that regexes are single strings.  When you
try to compress a programming language into a single string this way,
you end up with something that looks like Befunge or TECO.  We are
going in the same direction here.

Suppose there were an alternative syntax for regexes that did *not*
require that everything be compressed into a single string?  Rather
than trying to pack all of Perl into the regex syntax, bit by bit,
using ever longer and more bizarre punctuation sequences, I think a
better solution would be to try to expose the parts of the regex
engine that we are trying to control.

I have some ideas about how to do this, and I will try to write up an
RFC this week.



Re: RFC 110 (v3) counting matches

2000-08-31 Thread Mark-Jason Dominus


 (mystery: how
 can filling in $ be a lot slower than filling in $1?)

It isn't.  It's the same.  $1 might even be more expensive than $.

It appears that many people don't understand the problem with $.  I
will try to explain.

Maintaining the information required by $1 or $ slows down the regex
match, possibly by as much as forty to sixty percent, or more.  (How
much depends on details of the regex and the target string.)

For this reason, Perl has an optimization in it so that if you never
use $ anywhere in your program, Perl never maintains the information,
and every regex in your program runs faster.

But if you do use $ somewhere, Perl cannot apply the optimization,
and it must compute the $ information for every regex in the program.
Every regex becomes much slower.

In particular, if you load a module whose author happened to use $,
all your regexes get slower, which might be an unpleasant surprise,
since you might not be aware of the cause.

A regex with backreferences is *also* slow.  But using backreferences
in one regex does not make all the *other* regexes slow.  If you have

/(...)/   # regex 1
/.../ # regex 2

Perl knows that it must compute the backreference information for
regex 1, and knows that it can skip computing the backreference
information for regex 2, because regex 2 contains no parentheses.

If you use a module that contains regexes that use backreferences,
those regexes run slowly, but there is no effect on *your* regexes.

The cost is just as high for backreferences as for $, but the
backreference cost is paid only by regexes that actually need it.

The $ cost is paid by every regex in the entire program, whether they
used it or not.  This is because Perl has no way to tell which regexes
use $ and which do not. 

One of Uri's suggestions in RFC 158 was to compute $ only for regexes
that have a /k modifier.  This would solve the $ problem because Perl
would compute $ only when asked to, and not for every other regex in
the rest of the program.




RFC 166 (disambiguator)

2000-08-29 Thread Mark-Jason Dominus


Richard Proctor suggests that (?) will match the empty string. 
Then it can be inserted into regexes to separate elements that need to
be separated.  For example, /$foo(?)bar/ interpolates the value of
$foo and then looks for that pattern followed by 'bar'.   You cannot
simply write /$foobar/ because then Perl tries to interpolate $foobar,
which is not what you wanted.

1. You can already write /${foo}bar/ to get what you wanted.  This
   solution already works inside of double-quoted strings.  (?) would
   not work inside of double-quoted strings.

2. You can already write /$foo(?:)bar/ to get what you wanted.  This
   is almost identical to what Richard proposed anyway.

It is really not clear to me that this problem needs to be solved any
better than it is already.

I suggest that this section be removed from the RFC.

Mark-Jason Dominus   [EMAIL PROTECTED]
I am boycotting Amazon. See http://www.plover.com/~mjd/amazon.html for details.




Re: RFC 110 (v3) counting matches

2000-08-29 Thread Mark-Jason Dominus


 On Mon, 28 Aug 2000, Mark-Jason Dominus wrote:
 
  But there is no convenient way to run the loop once for each date and
  split the dates into pieces:
  
  # WRONG
  while (($mo, $dy, $yr) = ($string =~ /(\d\d)-(\d\d)-(\d\d)/g)) {
...
  }
 
 What I use in a script of mine is:
 
 while ($string =~ /(\d\d)-(\d\d)-(\d\d)/g) {
 ($mo, $dy, $yr) = ($1, $2, $3);
 }
 
 Although this, of course, also requires that you know the number of
 backreferences. 

The real problem I was trying to discuss was not this particular
application.  I was trying to point out a larger problem, which is
that there are several regex features that are enabled or disabled
depending on what context the match is in, so that if you want one
scalar-context feature and one list-context feature at the same time,
there is no direct way to do it.

 Nicer would be to be able to assign from @matchdata or something
 like that :)

I agree.  There are many operations that would be simpler if there was
a magic array that contained ($1, $2, $3, ...).  If anyone wants to
write an RFC on this, I will help.




Re: RFC 110 (v2) counting matches

2000-08-29 Thread Mark-Jason Dominus


 On Tue, 29 Aug 2000 08:47:25 -0400, Mark-Jason Dominus wrote:
 
 m/.../Count,Insensitive   (instead of m/.../ti)
 
 That would escape the problem that we are running out of letters and
 also the problem that the current letters are hard to remember.
 
 Yes, but wouldn't this give us backward compatibility problems? For
 example, code like
 
   $result = m/(.)/Insensitive, ord $1;

No, because that is presently a syntax error.  The one you have to
watch out for is:

$result = m/(.)/s,Insensitive, ord $1;

 And, I don't really see the need for the comma.
 
 m/.../CountInsensitive   (instead of m/.../ti)

I guess, but to me CountInsensitive looks like one option, not two.




Overlapping RFCs 135 138 164

2000-08-29 Thread Mark-Jason Dominus


RFC135: Require explicit m on matches, even with ?? and // as delimiters.

C?...? and C/.../ are what makes Perl hard to tokenize.
Requiring them to be written Cm?...? and Cm/.../ would
solve this.

(Nathan Torkington)

RFC138: Eliminate =~ operator.

Replace EXPR =~ m/.../ with m/.../ EXPR, and similarly for
s/// and tr///. Force an explicit dereference when using
qr/.../. Disallow the implicit treatment of a string as a
regular expression to match against.

(Steve Fink)

RFC164: Replace =~, !~, m//, and s/// with match() and subst()

Several people (including Larry) have expressed a desire to
get rid of C=~ and C!~. This RFC proposes a way to replace
Cm// and Cs/// with two new builtins, Cmatch() and
Csubst().

(Nathan Widger)


I would like to see these three RFCs merged into one if this is
appropriate.  I am calling on the three authors to discuss in private
email how this may be done.  I hope that the discussion will result in
the withdrawal at least two of the three RFCs, and that this private
discussion produces a new RFC.  The new RFC should discuss the points
raised by all three existing RFCs, should investigate several
solutions in parallel, and should compare them with one another and
contrast the benefits and drawbacks of each one.





Mark-Jason Dominus   [EMAIL PROTECTED]
I am boycotting Amazon. See http://www.plover.com/~mjd/amazon.html for details.




Re: RFC 158 (v1) Regular Expression Special Variables

2000-08-25 Thread Mark-Jason Dominus


 Please correct me if I'm mistaken, but I believe that that's the way
 they are implemented now.  A regex match populates the -startp and
 -endp parts of the regex structure, and the elements of these items
 are byte offsets into the original string.  
 
 I haven't looked at it at all, and perhaps that 's sometihng Ilya
 did when creating @+ etc.  So you might be right.  

As far as I know it's the same in 5.000.

I thought the problem with $ was that the regex engine has to adjust
the offsets in the startp/endp arrays every time it scans forward a
character or backtracks a character.  

But maybe the effect of $ is greatly exaggerated or is a relic from
perl4?  Has anyone actually benchmarked this recently?