Re: RFC 308 (v1) Ban Perl hooks into regexes
I think the proposal that Joe McMahon and I are finishing up now will make these obsolete anyway.
Re: RFC 308 (v1) Ban Perl hooks into regexes
On Mon, Sep 25, 2000 at 08:56:47PM +, Mark-Jason Dominus wrote: I think the proposal that Joe McMahon and I are finishing up now will make these obsolete anyway. Good! The less I have to maintain the better... Sorry, I meant that it would make (??...) and (?{...}) obsolete, not that it will make your RFC obsolete. Our proposal is agnostic about whether (??...) and (?{...}) should be eliminated.
Re: Perlstorm #0040
I lie: the other reason qr{} currently doesn't behave like that is that when we interpolate a compiled regexp into a context that requires it be recompiled, Interpolated qr() items shouldn't be recompiled anyway. They should be treated as subroutine calls. Unfortunately, this requires a reentrant regex engine, which Perl doesn't have. But I think it's the right way to go, and it would solve the backreference problem, as well as many other related problems.
Re: RFC 166 (v2) Alternative lists and quoting of things
(?Q$foo) Quotes the contents of the scalar $foo - equivalent to (??{ quotemeta $foo }). How is this different from \Q$foo\E ?
Re: RFC 72 (v1) The regexp engine should go backward as well as forward.
Simply put, I want variable-length lookbehind. Why didn't you simply propose that the (?...) operator be fixed to support variable-length expressions? Why so much additional machinery?
Re: $ and copying: rfc 158 (was Re: RFC 110 (v3) counting matches)
in any case, i think we have a fair agreement on rfc 158 and i will freeze it if there is no further comments on it. I think you should remove the parts of your propsal about making $ be autolocalized. If you're not planning to revise your RFC, let me know so that I can ask the librarian to mark it as withdrawn.
Re: XML/HTML-specific ? and ? operators?
: it looks worse and dumps core. That's because the first non-paren forces it to recurse into the second branch until you hit REG_INFTY or overflow the stack. Swap second and third branches and you have a better chance: I think something else goes wrong there too. $re = qr{...} (I haven't checked that there aren't other problems with it, though.) Try this: "(x)(y)" -~ /^$re$/; This should match, but it dumps core. I don't think there is infinite recursion, although I might be mistaken. Anyway, Snobol has a nice heuristic to prevent infinite recursion in cases like this, but I'm not sure it's applicable to the way the Perl regex engine works. I will think about it.
Re: XML/HTML-specific ? and ? operators?
:Anyway, Snobol has a nice heuristic to prevent infinite recursion in :cases like this, but I'm not sure it's applicable to the way the Perl :regex engine works. I will think about it. It is probably worth adding the heuristic above: anytime you recurse into the same re at the same position, there is an infinite loop. That is basically it, except that in snobol it is inside out: Each recursively interpolated pattern is assumed to match a string of at least length 1, and if the remaining part of the target string isn't sufficiently long to match the rest of the pattern after recursion, then the recursion is skipped.
Re: What's in a Regex (was RFC 145)
2. Many people - including Larry - have voiced their desire to see =~ die a horrible death Please provide a look-up-able reference to Larry's saying that he wanted to =~ to die horrible death. Larry said: # Well, the fact is, I've been thinking about possible ways to get rid # of =~ for some time now, so I certainly don't mind brainstorming in # this direction. That is in [EMAIL PROTECTED] which is archived at http://www.mail-archive.com/perl6-language-regex@perl.org/msg3.html I think Nathan was exaggerating here, but maybe he knows something I don't.
Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))
...My point is that I think we're approaching this the wrong way. We're trying to apply more and more parser power into what classically has been the lexer / tokenizer, namely our beloved regular-expression engine. I've been thinking the same thing. It seems to me that the attempts to shoehorn parsers into regex syntax have either been unsuccessful (yielding an underpowered extension) or illegible or both. An approach that appears to have been more successful is to find ways to integrate regexes *into* parser code more effectively. Damian Conway's Parse::RecDescent module does this, and so does SNOBOL. In SNOBOL, if you want to write a pattern that matches balanced parenteses, it's easy and straightforward and legible: parenstring = '(' *parenstring ')' | *parenstring *parenstring | span('()') (span('()') is like [^()]* in Perl.) The solution in Parse::RecDescent is similar. Compare this with the solutions that work now: # man page solution $re = qr{ \( (?: (? [^()]+ )# Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x; This is not exactly the same, but I tried a direct translation: $re = qr{ \( (??{$re}) \) | (??{$re}) (??{$re}) | (? [^()]+) }x; and it looks worse and dumps core. This works: qr{ ^ (?{ local $d=0 }) (?: \( (?{$d++}) | \) (?{$d--}) (? (?{$d0}) (?!) ) | (? [^()]* ) )* (? (?{$d!=0}) (?!) ) $ }x; but it's rather difficult to take seriously. The solution proposed in the recent RFC 145: /([^\m]*)(\m)(.*?)(\M)([^\m\M]*)/g is not a lot better. David Corbin's alternative looks about the same. On a different topic from the same barrel, we just got a proposal that ([23,39]) should match only numbers between 23 and 39. It seems to me that rather than trying to shoehorn one special-purpose syntax after another into the regex language, which is already overloaded, that it would be better to try to integrate regex matching better with Perl itself. Then you could use regular Perl code to control things like numeric ranges. Note that at present, you can get the effect of [(23,39)] by writing this: (\d+)(?(?{$1 23 || $1 39})(?!)) which isn't pleasant to look at, but I think it points in the right direction, because it is a lot more flexible than [(23,39)]. If you need to fix it to match 23.2 but not 39.5, it is straightforward to do that: (\d+(\.\d*)?)(?(?{$1 23 || $1 39})(?!)) The [(23,39)] notation, however, is doomed.All you can do is propose Yet Another Extension for Perl 7. The big problem with (\d+)(?(?{$1 23 || $1 39})(?!)) is that it is hard to read and understand. The real problem here is that regexes are single strings. When you try to compress a programming language into a single string this way, you end up with something that looks like Befunge or TECO. We are going in the same direction here. Suppose there were an alternative syntax for regexes that did *not* require that everything be compressed into a single string? Rather than trying to pack all of Perl into the regex syntax, bit by bit, using ever longer and more bizarre punctuation sequences, I think a better solution would be to try to expose the parts of the regex engine that we are trying to control. I have some ideas about how to do this, and I will try to write up an RFC this week.
Re: RFC 110 (v3) counting matches
(mystery: how can filling in $ be a lot slower than filling in $1?) It isn't. It's the same. $1 might even be more expensive than $. It appears that many people don't understand the problem with $. I will try to explain. Maintaining the information required by $1 or $ slows down the regex match, possibly by as much as forty to sixty percent, or more. (How much depends on details of the regex and the target string.) For this reason, Perl has an optimization in it so that if you never use $ anywhere in your program, Perl never maintains the information, and every regex in your program runs faster. But if you do use $ somewhere, Perl cannot apply the optimization, and it must compute the $ information for every regex in the program. Every regex becomes much slower. In particular, if you load a module whose author happened to use $, all your regexes get slower, which might be an unpleasant surprise, since you might not be aware of the cause. A regex with backreferences is *also* slow. But using backreferences in one regex does not make all the *other* regexes slow. If you have /(...)/ # regex 1 /.../ # regex 2 Perl knows that it must compute the backreference information for regex 1, and knows that it can skip computing the backreference information for regex 2, because regex 2 contains no parentheses. If you use a module that contains regexes that use backreferences, those regexes run slowly, but there is no effect on *your* regexes. The cost is just as high for backreferences as for $, but the backreference cost is paid only by regexes that actually need it. The $ cost is paid by every regex in the entire program, whether they used it or not. This is because Perl has no way to tell which regexes use $ and which do not. One of Uri's suggestions in RFC 158 was to compute $ only for regexes that have a /k modifier. This would solve the $ problem because Perl would compute $ only when asked to, and not for every other regex in the rest of the program.
RFC 166 (disambiguator)
Richard Proctor suggests that (?) will match the empty string. Then it can be inserted into regexes to separate elements that need to be separated. For example, /$foo(?)bar/ interpolates the value of $foo and then looks for that pattern followed by 'bar'. You cannot simply write /$foobar/ because then Perl tries to interpolate $foobar, which is not what you wanted. 1. You can already write /${foo}bar/ to get what you wanted. This solution already works inside of double-quoted strings. (?) would not work inside of double-quoted strings. 2. You can already write /$foo(?:)bar/ to get what you wanted. This is almost identical to what Richard proposed anyway. It is really not clear to me that this problem needs to be solved any better than it is already. I suggest that this section be removed from the RFC. Mark-Jason Dominus [EMAIL PROTECTED] I am boycotting Amazon. See http://www.plover.com/~mjd/amazon.html for details.
Re: RFC 110 (v3) counting matches
On Mon, 28 Aug 2000, Mark-Jason Dominus wrote: But there is no convenient way to run the loop once for each date and split the dates into pieces: # WRONG while (($mo, $dy, $yr) = ($string =~ /(\d\d)-(\d\d)-(\d\d)/g)) { ... } What I use in a script of mine is: while ($string =~ /(\d\d)-(\d\d)-(\d\d)/g) { ($mo, $dy, $yr) = ($1, $2, $3); } Although this, of course, also requires that you know the number of backreferences. The real problem I was trying to discuss was not this particular application. I was trying to point out a larger problem, which is that there are several regex features that are enabled or disabled depending on what context the match is in, so that if you want one scalar-context feature and one list-context feature at the same time, there is no direct way to do it. Nicer would be to be able to assign from @matchdata or something like that :) I agree. There are many operations that would be simpler if there was a magic array that contained ($1, $2, $3, ...). If anyone wants to write an RFC on this, I will help.
Re: RFC 110 (v2) counting matches
On Tue, 29 Aug 2000 08:47:25 -0400, Mark-Jason Dominus wrote: m/.../Count,Insensitive (instead of m/.../ti) That would escape the problem that we are running out of letters and also the problem that the current letters are hard to remember. Yes, but wouldn't this give us backward compatibility problems? For example, code like $result = m/(.)/Insensitive, ord $1; No, because that is presently a syntax error. The one you have to watch out for is: $result = m/(.)/s,Insensitive, ord $1; And, I don't really see the need for the comma. m/.../CountInsensitive (instead of m/.../ti) I guess, but to me CountInsensitive looks like one option, not two.
Overlapping RFCs 135 138 164
RFC135: Require explicit m on matches, even with ?? and // as delimiters. C?...? and C/.../ are what makes Perl hard to tokenize. Requiring them to be written Cm?...? and Cm/.../ would solve this. (Nathan Torkington) RFC138: Eliminate =~ operator. Replace EXPR =~ m/.../ with m/.../ EXPR, and similarly for s/// and tr///. Force an explicit dereference when using qr/.../. Disallow the implicit treatment of a string as a regular expression to match against. (Steve Fink) RFC164: Replace =~, !~, m//, and s/// with match() and subst() Several people (including Larry) have expressed a desire to get rid of C=~ and C!~. This RFC proposes a way to replace Cm// and Cs/// with two new builtins, Cmatch() and Csubst(). (Nathan Widger) I would like to see these three RFCs merged into one if this is appropriate. I am calling on the three authors to discuss in private email how this may be done. I hope that the discussion will result in the withdrawal at least two of the three RFCs, and that this private discussion produces a new RFC. The new RFC should discuss the points raised by all three existing RFCs, should investigate several solutions in parallel, and should compare them with one another and contrast the benefits and drawbacks of each one. Mark-Jason Dominus [EMAIL PROTECTED] I am boycotting Amazon. See http://www.plover.com/~mjd/amazon.html for details.
Re: RFC 158 (v1) Regular Expression Special Variables
Please correct me if I'm mistaken, but I believe that that's the way they are implemented now. A regex match populates the -startp and -endp parts of the regex structure, and the elements of these items are byte offsets into the original string. I haven't looked at it at all, and perhaps that 's sometihng Ilya did when creating @+ etc. So you might be right. As far as I know it's the same in 5.000. I thought the problem with $ was that the regex engine has to adjust the offsets in the startp/endp arrays every time it scans forward a character or backtracks a character. But maybe the effect of $ is greatly exaggerated or is a relic from perl4? Has anyone actually benchmarked this recently?