Re: hyperoperators (was: Apocalypse)
Alberto Simoes wrote: :2) using ^ for mapping operators.. this only works with two lists. :The problem here is that we have a pair of lists, and want a :list of pairs. There can be other situations where we have :three lists, instead of a list of tripplets... I thought it was :better to have a 'evidence' or 'factorize' for lists in a way :((a,b,c),(1,2,3)) will become ((a,1),(b,2),(c,3)) and :((1,2,3),(4,5,6),(7,8,9)) will become ((1,4,7),(2,5,8),(3,6,9)). :This way, the ^ operator could be replaced with a simple map... :More generic, less operators confusion... better? maybe... I'm hoping we'll get the facility to add user-defined hyperoperators, in which case it will be easy to add other list-manipulation strategies in a generic manner. With a bit of luck, the commonest such hyperoperators will get to be in a standard class that everyone uses, rather than everyone going off to invent their own symbols. Hugo
Re: redraft (v2) for RFC 348 Regex assertions in plain Perl code
In [EMAIL PROTECTED], Bart Lateur writes: :Likely the most justifiable to want to be able to execute Perl code in a reason :This makes the implementation very tricky. I :wouldn't be surprised if precisely this feature is the main reason why :the current implementation is so notoriously unstable. I'm not aware of any instability caused by this. The instability is caused by various other factors, discussed at length on p5p. :The fact that the embedded code is called 3 times, not more, surely :suprised me. It probably will surprise many people. Apparently, it is :only executed once for every lowercase letter, not just for any :character. : :This inpredictability is yet another reason to discourage incrementally :modify global data structures. I think this is precisely why the non-assertion form encourages use of local() - in general, the local() constructs will have executed a predictable number of times _that have not been unwound_ by the time a successful match is achieved. I don't think this observation (of mine) is particularly relevant to the proposal, however. :=head2 /(?(condition)yes-pattern|no-pattern)/ The simplest form of this is (?(1)yes|no). This is rather harder to emulate with other mechanisms without running to eval. OTTOMH it is equivalent to (??{ defined($1) ? 'yes' : 'no' }). Hugo
More on RFC 93 (was Re: RFC 316 (v1) ...)
In [EMAIL PROTECTED], Bart Lateur writes: :Yes, but RFC 93 has some other disadvantages. In respect of the number of calls, there seems nothing in RFC 93 to stop us permitting the callback to return more or fewer than the requested number of characters. So a filehandle, for example, could choose to return some multiple of 4K blocks for every request. A socket conenction that applies a line-based protocol would probably read a line at a time, while another socket might return just those characters available to read without blocking. :Furthermore, where is the resulting buffer stored? People usually still :want a copy of their data, to do yet other things with. Here, the data :has disappeared into thin air. The only way to get it, is putting :capturing parens in the regex. It seems to me that $` and $ are the right solutions here. I assume that perl6 will not allow this to cause an overreaching performance problem. In this context we have the additional advantage that the only copy of the accumulated string is owned by the regexp engine, so no additional copy need be made to protect it. :Compared to that, RFC 93 feels like a straightjacket. To me. Strangely it feels uncommonly liberating to me. :You may have to completely rewrite your script. So much for code reuse. I don't believe that it need be so painful to take advantage of it in existing code. We can ease that by providing a selection of helpful ready-rolled routines for common tasks. Hugo
Re: RFC 112 (v3) Asignment within a regex
In [EMAIL PROTECTED], "Richard Proctor" writes: :In general all assignments should wait to the very end, and then assign :them all. [...] If the expression finally fails the localised values :would unroll. Ah, I hadn't anticipated that - I had assumed you would get whatever was the last value set. Please can you make sure this is clearly explained in the next version of the RFC? Hugo
Re: RFC 348 (v1) Regex assertions in plain Perl code
In [EMAIL PROTECTED], Perl6 RFC Librarian writes: :=item assertion in Perl5 : : (?(?{not COND})(?!)) : (?(?{not do { COND }})(?!)) Or (?(?{COND})|(?!)). Migration could consider replacing detectable equivalents of such constructs with the favoured new construct. :"local" inside embedded code will no longer be supported, nor will :consitional regexes. The Perl5 - Perl6 translator should warn if it :ever encounters one of these. I'm not convinced that removing either of these are necessary to the main thrust of the proposal. They may both still be useful in their own right, and you seem to offer little evidence against them other than that you don't like them. I do like the idea of making (?{...}) an assertion, all the more because we have a simple migration path that avoids unnecessarily breaking existing scripts: wrap $code as '$^R = do { $code }; 1'. If you want to remove support for 'local' in embedded code, it is worth a full proposal in its own right that will explain what will happen if people try to do that. (I think it will make perl unnecessarily more complex to detect and disable it in this case.) Similarly if you want to remove support for (?(...)) completely, you need to address the utility and options for migration for all the available uses of it, not just the one addressed by the new handling of (?{...}). Hugo
Re: RFC 308 (v1) Ban Perl hooks into regexes
In [EMAIL PROTECTED], Tom Christiansen writes: :I consider recursive regexps very useful: : : $a = qr{ (? [^()]+ ) | \( (??{ $a }) \) }; : :Yes, they're "useful", but darned tricky sometimes, and in :ways other than simple regex-related stuff. For example, :consider what happens if you do : :my $regex = qr{ (? [^()]+ ) | \( (??{ $regex }) \) }; : :That doesn't work due to differing scopings on either side :of the assignment. Yes, this is a problem. But it bites people in other situations as well: my $fib = sub { $_[0] 2 ? 1 : $fib($_[0] - 1) }; I haven't kept up with the non-regexp RFCs, but I hope someone has suggested an alternative scoping that would permit these cases to refer to the just-introduced variable. Perhaps we should special-case qr{} and sub{} - I can't offhand think of another area that suffers from this, and I don't think these two areas would suffer from an inability to refer to the same- -name variable in an outlying scope. A useful alternative might be a different special case. Plucking random grammar, perhaps: my $regex = qr{ (? [^()]+ ) | \( ^^ \) }x; Certainly I think a simple self-reference is likely to be a common enough use that it would help to avoid the full deferred eval infrastructure, even when it works properly. :And clearly a non-regex approach could be more legible for :recursive parsing. Like any aspect of programming, if you use it regularly it will become easier to read. And comments are a wonderful thing. Hugo
Re: RFC 331 (v1) Consolidate the $1 and C\1 notations
:=item * :/(foo)_$1_bar/ : :=item * :/(foo)_C\1_bar/ Please don't do this: write C/(foo)_\1_bar/ or /(foo)_\1_bar/, but don't insert C in the middle: that makes it much more difficult to read. :mean different things: the second will match 'foo_foo_bar', while the :first will match 'foo[SOMETHING]bar' where [SOMETHING] is whatever was should be: foo_[SOMETHING]_bar :captured in the Bprevious match...which could be a long, long way away, :possibly even in some module that you didn't even realize you were :including (because it was included by a module that was included by a :module that was included by a...). This seems a bit unfair. It is just another variable. Any variable you include in a pattern, you are assumed to know that it contains the intended value - there is nothing special about $1 in this regard. :The key fact here is that, in the first section of a s/// you are supposed :to use C\1, but in the second portion you are supposed to use $1. If :you understand the whole logical structure behind it and understand how an :s/// works (i.e., the right hand side of an s/// is a double-quoted :string, not a regex), you will understand the distinction. For newbies, :however, it is apt to be quite confusing. I think the whole idea that the LHS of s/// is a pattern, but the RHS is a string (module /e, of course) is apt to be confusing when you first encounter it. You won't be able to make sense of any but the simplest use of s/// until you understand it, I think, and the documentation expresses it quite clearly. :=item * :${P1} means what $1 currently means (first match in last regex) Do you understand that this is the same variable as $P1? Traditionally, perl very rarely coopts variable names that start with alphanumerics, and (off the top of my head) all the ones it does so coopt are letters only (ARGV, AUTOLOAD, STDOUT etc). I think we need better reasons to extend that to all $P1-style variables. If you are suggesting that they should have a special meaning only in regexps, and only if braced, then I'd find it even more confusing. The use of braces is usually the easiest (and only?) way to split out a variable from following alphanumerics: /foo${P1}bar/ :These changes eliminate a potential source of confusion, retain all :functionality, provide an easy migration path for P526, and the last :notation (${P1}) serves as a clear indicator that you are talking about :something from outside the current regex. What is the migration path for existing uses of $P1-style variables? :=item * :s/(bar)(bell)/${P1}$2/ # changes "barbell" to "foobell" Note that in the current regexp engine, ${P1} has disappeared by the time matching starts. Can you explain why we need to change this? Note also that if you are sticking with ${P1} either we need to rename all existing user variables of this form, or we can no longer use the existing 'interpolate this string' (or eval, double-eval etc) routines, and have to roll our own for this (these) as well. :=head1 IMPLEMENTATION : :This may require significant changes to the regex engine, which is a topic :on which I am not qualified to speak. Could someone with more :knowledge/experience please chime in? Currently the regexp compiler is handed a string in which $variables have already interpolated. We'd need to avoid that and get either the the raw data for the string or some list that has undergone a minimum of preparation. It is possible we need that anyway - it is a prerequisite for some of the other proposed enhancements (such as the meta-referred-to RFC 112) and would certainly make the regexp engine more flexible - but it is certainly substantial work. I don't know what gotchas may arise. In general it seems a shame to recreate large parts of the existing string parsing/interpolation code, but it may not be possible to avoid it. Changing the lifetime of backreferences feels likely to be difficult, but it isn't clear to me what you are trying to achieve here. I think you at least need to add an example of how it would act under s///g and s///ge. :=head1 REFERENCES : :RFC 112: Assignment within a regex : :RFC 276: Localising Paren Counts in qr()s. I didn't see a mention of these in the body of the proposal. To me, the prime issue is with \1. The backslash is heavily overloaded in perl, and that makes it difficult to suggest a consistent and legible extension that would allow us to refer back to either variables (RFC 112) or hash keys (RFC 150). I don't think switching to $1 is any help for those, though. Hugo
Re: RFC 332 (v1) Regex: Make /$/ equivalent to /\z/ under the '/s' modifier
In [EMAIL PROTECTED], Bart Lateur writes: :I'll try to find that "thread" back. This was my message: http://www.mail-archive.com/perl6-language-regex%40perl.org/msg00354.html :I don't think changing /s is the right solution. I think this will :incline people to try and fix their problems by adding /s, without :realising that this changes the definition of every . in their :regexp as well. : :Perhaps. I do think that, in general, textual data falls into one of :three categories: : : * text with possibly embedded newlines : * text with no embedded newlines : * text with an irrelevant newline at the very end. : :The '/s' option is for the 1st case. No '/s' for the 3rd. As for #2: you :don't care. I'd distinguish the first case further into 'the newlines are significant' or not - /s is often desired for the first case, and /m often for the second. And then I'd be tempted to repeat the whole list, replacing 'newline' with 'record separator'. I have to say I'm quite prejudiced against /s - I consider myself reasonably knowledgeable about regexps, but on average about once a month I find myself unsure enough about which is /m and which is /s that I need to check the top of perlre to be sure. I think we've appreciated for some time that it was a mistake to name them as if they were opposites, but if anything I'd like to reduce the need for them rather than to increase it. Hugo
Re: RFC 308 (v1) Ban Perl hooks into regexes
In 005501c027eb$43bafe60$[EMAIL PROTECTED], "Michael Maraist" writes: :As you said, we shouldn't encourage full-fledged execution (since core dumps :are common). Let's not redefine the language just because there are bugs to fix. Surely it is better to concentrate first on fixing the bugs so that we can then more fairly judge whether the feature is useful enough to justify its existence. :One restriction might be to disallow various op-codes within the reg-ex :assertion. Namely user-function calls, reg-ex's, and most OS or IO :operations. That seems quite unreasonable. Why do you _want_ to restrict someone from calling isKeyword($1) within the regexp, which will then read the keyword patterns from a file and check $1 against those patterns using regexps? It seems like an entirely reasonable and useful thing to do. Hugo
Re: RFC 308 (v1) Ban Perl hooks into regexes
In [EMAIL PROTECTED], Bart Lateur writes: :On 25 Sep 2000 20:14:52 -, Perl6 RFC Librarian wrote: : :Remove C?{ code }, C??{ code } and friends. : :I'm putting the finishing touches on an RFC to drop (?{...}) and replace :it with something far more localized, hence cleaner: assertions, also in :Perl code. That way, : : /(?!\d)(\d+)(?{$1 256})/ : :would only match integers between 0 and 255. I'd like to suggest an alternative semantic for this: rename (??{ code }) to (?{ code }), and use the newly freed (??{ code }) for the assertions. (I was about to write an RFC for just that, so I'm glad I can save a bit of time. :) Hugo
Re: RFC 308 (v1) Ban Perl hooks into regexes
In [EMAIL PROTECTED], Perl6 RFC Librarian writes: :It would be preferable to keep the regular expression engine as :self-contained as possible, if nothing else to enable it to be used :either outside Perl or inside standalone translated Perl programs :without a Perl runtime. : :To do this, we'll have to remove the bits of the engine that call :Perl code. In short: C?{ code } and C??{ code } must die. I would have thought it more reasonable, if you wish to create standalone translated Perl programs without a Perl runtime, to fail with a helpful error if you encounter a construct that won't permit it. You'll need to remove chunks of eval() and do() as well, otherwise, and probably more besides. In the context of a more shareable regexp engine, I would like to see (? and (?? stay, but they need to be implemented more cleanly. You could handle them quite nicely, I think, with just three well-defined external hooks: one to find the matching brace at the end of the code, one to parse the code, and one to run the code. Anyone wishing to re-use the regexp library could then choose either to keep the default drop-in replacements for those hooks (that die) or provide their own equivalents to the perl usage. I consider recursive regexps very useful: $a = qr{ (? [^()]+ ) | \( (??{ $a }) \) }; .. and I class re-eval in general in the arena of 'making hard things possible'. But whether or not they stay, it would probably also be useful to have a more direct way of expressing simple recursive regexps such as the above without resorting to a costly eval. When I've tried to come up with an appropriate restriction, however, I find it very difficult to pick a dividing line. Hugo
Re: RFC 308 (v1) Ban Perl hooks into regexes
In [EMAIL PROTECTED], Perl6 RFC Librarian writes: :=head1 ABSTRACT : :Remove C?{ code }, C??{ code } and friends. Whoops, I missed this bit - what 'friends' do you mean? Hugo
Re: Perlstorm #0040
In [EMAIL PROTECTED], Richard Proctor writes : :TomCs perl storm has: : : Figure out way to do : : /$e1 $e2/ : : safely, where $e1 might have '(foo) \1' in it. : and $e2 might have '(bar) \1' in it. Those won't work. : :If e1 and e2 are qr// type things the answer might be to localise :the backref numbers in each qr// expression. : :If they are not qr//s it might still be possible to achieve if the expansion :of variables in regexes is done by the regex compiler it could recognise :this context and localise the backrefs. : :Any code like this is going to have real problem with $1 etc if used later, :use of assignment in a regex and named backrefs (RFC 112) would make this :a lot safer. I think it is reaonable to ask whether the current handling of qr{} subpatterns is correct: perl -wle '$a=qr/(a)\1/; $b=qr/(b).*\1/; /$a($b)/g and print join ":", $1, pos for "aabbac"' a:5 I'm tempted to suggest it isn't; that the paren count should be local to each qr{}, so that the above prints 'bb:4'. I think that most people currently construct their qr{} patterns as if they are going to be handled in isolation, without regard to the context in which they are embedded - why else do they override the embedder's flags if not to achieve that? The problem then becomes: do we provide a mechansim to access the nested backreferences outside of the qr{} in which they were referenced, and if so what syntax do we offer to achieve that? I don't have an answer to the latter, which tempts me to answer 'no' to the former for all the wrong reasons. I suspect (and suggest) that complication is the only reason we don't currently have the behaviour I suggest the rest of the semantics warrant - that backreferences are localised within a qr(). I lie: the other reason qr{} currently doesn't behave like that is that when we interpolate a compiled regexp into a context that requires it be recompiled, we currently ignore the compiled form and act only on the original string. Perhaps this is also an insufficiently intelligent thing to do. Hugo
Re: \z vs \Z vs $
In 12839.969393548@chthon, Tom Christiansen writes: :What can be done to make $ work "better", so we don't have to :make people use /foo\z/ to mean /foo$/? They'll keep writing :the $ for things that probably oughtn't abide optional newlines. : :Remember that /$/ really means /(?=\n?\z)/. And likewise with \Z. It might be reasonable to redefine $ to mean the same as \z whenever the /s flag is supplied. Another possibility would be to have a scoped "use re qw/simple_anchor/' pragma to achieve the same. And another would be simply to switch the meaning of $ and \z. None of these feel particularly satisfactory, however, and I think any change to the current semantics would be difficult for existing perl programmers. Perhaps '$$' to mean 'match at end of string (without /m) or at end of any line (with /m)? The p52p6 translator can easily replace references to $$ with ${$}. I can't think of a usefully different meaning for ^^, but as currently defined it will already do the right thing. I don't know what proposals have come out of the other wgs, but if we know when a variable has been read from a line-oriented input medium, then we could turn on the special meaning of $ only in such cases and define it as $$ above in all other cases. I think this would be more confusing, though. We could also consider changing the base definition to (?=($/)?\z), particularly if $/ is to be seen as a regexp. I think I like $$ the best. Hugo
perl6-language-regex summary for 20000920
perl6-language-regex Summary report 2920 Mark-Jason Dominus has relinquished the wg chair due to the pressure of other commitments; I'll be taking over the chair for the short time remaining. Thanks to Mark-Jason for all his hard work. I'll be contacting the authors of all outstanding RFCs shortly to encourage them to work towards freezing them as soon as practical. Hugo RFC 72: The regexp engine should go backward as well as forward. (Peter Heslin) Peter says (edited): :If the regexp code is unlikely to be rewritten from the ground up, then :there may be little chance of this feature being implemented. I'll make :a pitch for it anyway at the end of my talk at YAPC::Europe, and then :I'll freeze the RFC. RFC 93: Regex: Support for incremental pattern matching (Damian Conway) Now frozen at v3 with no changes; I don't think there was a v2. RFC 110: counting matches (Richard Proctor) Richard added my suggestions about the interaction between /t, /g and \G, and froze the RFC soon after. RFC 112: Assignment within a regex (Richard Proctor) No discussion. RFC 138: Eliminate =~ operator. (Steve Fink) Withdrawn. RFC 144: Behavior of empty regex should be simple (Mark Dominus) Frozen. RFC 145: Brace-matching for Perl Regular Expressions (Eric Roode) No discussion directly about this RFC. The discussion of XML/HTML- -specific extensions continued for a short while, but has not resulted in an RFC. The closest we have to an emerging consensus appears to be that it is very difficult to pin down a precise problem to solve - the areas in which we want to match pairs of delimiters (such as numeric expressions, C code, perl code, HTML and XML) each seem to require a variety of special cases, each different from the other. RFC 150: Extend regex syntax to provide for return of a hash of matched subpatterns (Kevin Walker) One suggestion from me of (?\%key) for backreferencing, but no substantive discussion. RFC 158: Regular Expression Special Variables (Uri Guttman) No discussion. RFC 164: Replace =~, !~, m//, s///, and tr// with match(), subst(), and trade() (Nathan Wiger) This RFC has now been frozen; the frozen version included some rewording and a couple of additional explanatory notes, as well as introducing a typo ('$gotis') in an example. RFC 165: Allow variables in tr/// (Richard Proctor) Surprisingly, no discussion. RFC 166: Alternative lists and quoting of things (Richard Proctor) New version, with a new name (was 'Additions to regexs'). This RFC is not currently available from the archive due to a misfiling, but you'll find it here: http://www.mail-archive.com/perl6-language-regex@perl.org/msg00350.html This removes two of the three original suggestions, and expands on the remaining one. Mark-Jason pointed out that the (new) extension to (?\Q$foo) is not needed. RFC 170: Generalize =~ to a special-purpose assignment operator (Nathan Wiger) Now frozen, with some modifications. RFC 197: Numberic Value Ranges In Regular Expressions (David Nichol) No discussion. RFC 198: Boolean Regexes (Richard Proctor) No discussion. New RFCS Of the other discussions that may still spawn a new RFC, most have been mentioned previously. One new one: Tom Christiansen has asked '[w]hat can be done to make $ work "better", so we don't have to make people use /foo\z/ to mean /foo$/'.
Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.
mike mulligan writes: :From: Hugo [EMAIL PROTECTED] :Sent: Tuesday, September 12, 2000 2:54 PM : : 3. The regexp is matched left to right: first the lookbehind, then 'X', : then '[yz]'. : :Thanks for the insight - I was stuck in my bad assumption that the optimized :behavior was the only behavior. : :What I am not sure of is whether the "optimization" is ever dangerous. In :other words, is there ever a difference in end-result between, doing at each :point: 1. test look-behind and then test the remainder of the regex, vs 2. :test the remainder of the regex, and then test the look-behind? Sometimes it may not be possible at all: "axbcxd" =~ /(?= a(.)b ) c\1d/x; :I am without a motiviating example, but can anyone see utility in a :non-greedy look-behind that operates in sense "2" above? Syntax: :(?=pat)?(?!pat)?Currently, a question-mark like this on a :look-behind makes it optional, defeating the assertion's purpose. If anyone :has a good example, I'll take on writing a RFC. Currently, a question mark like this on a lookbehind is apparently ignored: crypt% ./perl -wle '/(?=test)?/' Quantifier unexpected on zero-length expression before HERE mark in regex m/(?=test)? HERE / at -e line 1. Use of uninitialized value in pattern match (m//) at -e line 1. crypt% .. but I don't know why, since it could arguably be useful: / (?= (+|-) )? \d+ /x; print defined($1) ? "sign: '$1'\n" : "no sign\n"; Note that you can rewrite /(?=[aeiou])X[yz]/ as /X[yz](?=[aeiou]..)/ if you really want ... Hugo
negative variable-length lookbehind example
In RFC 72, Peter Heslin gives this example: :Imagine a very long input string containing data such as this: : :... GCAAGAATTGAACTGTAG ... : :If you want to match text that matches /GA+C/, but not when it :follows /G+A+T+/, you cannot at present do so easily. I haven't tried to work it out exactly, but I think you can achieve this (and fairly efficiently) with something like: / (?: ^ | # else we won't match at start (?: (? G+ A+ T+) | (.) )* (?(1) | . ) ) G A+ C /x This requires that the regexp engine reliably leaves $1 unset if we took the G+A+T+ branch last time through the (...)*, which has been an area of many bugs and no little discussion in perl5; I'm not sure of the status of that in latest perls. It isn't particularly relevant to this proposal since there are other combinations that can't be resolved in this way; I thought it might be of interest nonetheless. Hugo
Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.
In 085601c01cc8$2c94f390$[EMAIL PROTECTED], "mike mulligan" w rites: :From: Hugo [EMAIL PROTECTED] :Sent: Monday, September 11, 2000 11:59 PM : : : mike mulligan replied to Peter Heslin: : : ... it is greedy in the sense of the forward matching "*" or "+" :constructs. : : [snip] : : This is nothing to do with greediness and everything to do with : left-to-rightness. The regexp engine does not look for x* except : in those positions where the lookbehind has already matched. : :I was trying to understand at what point the lookbehind was attempted, and :confused myself and posted a bad example. My apologies to everyone. Let's :see if I can make sense of it on a second try. : :My question is: if I have the regex /(?=[aeiou]X[yz]+/ then does Perl: 1. :scan first for 'X', test the lookbehind, and then test the '[yz]', or 2. :scan for 'X[yz]' and then test the lookbehind? 3. The regexp is matched left to right: first the lookbehind, then 'X', then '[yz]'. :I am expecting these two alternatives to give the same result, but certain :test strings might run slower or faster depending on the approach. : :Running perl -Dr shows that alternative 1 is used: Running perl -Dr shows that alternative 3 is used. However the -Dr data is confused by the optimiser, which happens to have chosen the fixed string 'X' as something worth searching for first. So the optimiser permits the main matching engine to look only at those positions where there is an 'X' immediately following. I've annotated the -Dr output below to try and clarify. Note that if you replace 'X' with '(x|X)', no optimisations take place (other than a 'minimum length' check) and -Dr will give a much clearer picture of the flow; again, if you replace 'X[yz]' with '(x|X)y' the optimiser will now pick 'y' as the most significant thing worth searching for. Hope this helps, Hugo --- :qq(aXuhXvoXyz) =~ /(?=[aeiou])X[yz]/ : :Guessing start of match, REx `(?=[aeiou])X[yz]' against `aXuhXvoXyz'... The optimiser is entered. :Found anchored substr `X' at offset 1... This is what the optimiser is looking for. :Guessed: match at offset 1 This is what the optimiser found. :Matching REx `(?=[aeiou])X[yz]' against `XuhXvoXyz' The real matcher is entered. : Setting an EVAL scope, savestack=3 : 1 a XuhXvoXyz | 1: IFMATCH[-1] : 0 aXuhXvoXyz | 3:ANYOF[aeiou] Checking lookbehind ... : 1 a XuhXvoXyz | 12:SUCCEED Ok. : could match... : 1 a XuhXvoXyz | 14: EXACT X Checking 'X' ... : 2 aX uhXvoXyz | 16: ANYOF[yz] Checking '[yz]' ... :failed... Failed: try the next position permitted by the optimiser. : Setting an EVAL scope, savestack=3 : 4 aXuh XvoXyz | 1: IFMATCH[-1] : 3 aXu hXvoXyz | 3:ANYOF[aeiou] Checking lookbehind ... : failed... Failed. :failed... Try the next position permitted by the optimiser. : Setting an EVAL scope, savestack=3 : 7 aXuhXvo Xyz | 1: IFMATCH[-1] : 6 aXuhXv oXyz | 3:ANYOF[aeiou] Checking lookbehind ... : 7 aXuhXvo Xyz | 12:SUCCEED Ok. : could match... : 7 aXuhXvo Xyz | 14: EXACT X Checking 'X' ... : 8 aXuhXvoX yz | 16: ANYOF[yz] Checking '[yz]' ... : 9 aXuhXvoXy z | 25: END :Match successful! Match successful.
Re: RFC 158 (v1) Regular Expression Special Variables
Mark-Jason Dominus writes: : There's also long been talk/thought about making $ and $1 : and friends magic aliases into the original string, which would : save that cost. : :Please correct me if I'm mistaken, but I believe that that's the way :they are implemented now. A regex match populates the -startp and :-endp parts of the regex structure, and the elements of these items :are byte offsets into the original string. I went on a briefish trawl for this the other day, and as far as I can tell what happens is this: - during matching, the startp/endp pairs are populated with offsets into the target string - immediately after matching, the target string is copied if needed, and the PL_curpm object is updated to refer to the copy - the copy is needed if any of the special variables can be referred to: $`, $, $', $1, $2, ... The result of that is that if there are backreferences in the regexp, the copy is always needed; if not, the copy is needed only if $ or her kin have been seen. So regexps with backrefs should suffer no slowdown from use of $ in the same program, but regexps without backrefs will get a (potentially) unnecessary copy. The other problem with this, of course, is that the compiler may not yet have seen the $ we intend to use: crypt% perl -wle '$_="foo"; /.*/; $_="bar"; print eval q{$}' bar crypt% .. and I think coredumps may be possible from this. (Hmm, perlbug upcoming.) Hugo
all regexp RFCs
Hi guys, I'm sorry that time has not permitted me to join and take an active part in the perl6-language-regex list; however, I have grabbed an opportunity to look through the RFCs generated to date, and thought I should throw some comments at you. Apologies in advance for so rudely dumping this lot and _still_ not joining the list; sorry also if I duplicate stuff that's already been said. Feel free to ignore all or any of this. You'll need to cc me if you want me to see replies, and in that case you might want to do what I didn't, and tailor the subject to be more specific. I've tried in particular to add a note about implementation issues in each case. Enjoy, Hugo --- RFC 72: Variable-length lookbehind: the regexp engine should also go backward. == This is an interesting idea. However, it is not obvious to me that there is any practical difference between the existing: /(?= a+ ) b/x .. and the proposed: /b (?`= a+ )/x .. which implies that implementing one would be as difficult as the other. And if that is the case, fixing (?=...) to support variable length would be preferable, since it is more general. (Consider /\d+ (?! 00) \. \d+/x, for example: AFAICS the proposed (?`=...) does not allow the lookbehind to be anchored anywhere other than the start of the match.) While it would be great to have a working variable-length lookbehind, it is not obvious how you would implement it: the internal structure of a compiled regexp, as currently implemented, does not (I believe) hold enough information to allow you to walk it backwards. It might still be possible, though, with a fair amount of effort; you would, for example, have to rewrite (?= ([abc]) ([def]) g \2 \1 ) into (?= \1 \2 g ([def]) ([abc]) ), or maybe swap the \1 and \2. RFC 93: Regex: Support for incremental pattern matching == I love this to bits. You might consider changing the arguements to the fetcher($n;$s), such that if $n is positive it requests the next $n characters, else it is a final call returning the -$n bytes of $s to the stream. Not sure if this is any better than the current proposal, but it might be easier to understand if the first argument always represented a number of bytes. I do not think implementation should be too difficult, though I assume all optimisation should be turned off for such matches. It might also be desirable to have a new regexp flag 'no optimisation desired' to avoid the compile-time work done for optimisation's sake, for optimisation's sake. IYSWIM. RFC 110: counting matches === I like this too. I'd suggest /t should mean a) return a scalar of the number of matches and b) don't set any special variables. Then /t without /g would return 0 or 1, but be faster since no extra information need be captured (except internally for (.)\1 type matching - compile time checks could determine if these are needed, though (?{..}) and (??{..}) patterns would require disabling of that optimisation). /tg would give a scalar count of the total number of matches. \G would retain its meaning. Any which way, implementation should be fairly straightforward, though ensuring that optimisations occurred precisely when they are safe would probably involve a few bug-chasing cycles. RFC 112: Assignment within a regex === This is cool, and has been requested several times in the past. There is an outstanding issue of how variable references should be scoped when encountered within regexps, however. Consider: { local $a = 1; my $re = qr{ (?$a = .) }x { my $a = 2; "3" =~ $re; print $a; } print $a; } This is a problem that needs to be solved in any case, for proper understanding of how (?{..}) and (??{..}) should be interpreted, and I assume this proposed feature should be handled the same way. Implementation should not be particularly difficult once that knotty issue is resolved. RFC 144: Behavior of empty regex should be simple === Absolutely. snip RFC 145: Brace-matching for Perl Regular Expressions === This is an interesting idea. I'm not sure how useful it would actually be: as far as I can see it would not match the block on code such as: use matchpairs '{' = '}'; EOF =~ /\m.*\M/; { my $brace = '{'; ... } EOF .. and most of the pair-matching patterns I've tried to write in the past have needed to cope with embedded oddities such as quoted-strings, comments etc. It might be useful to add some more complex examples to show how you'd deal with such things. Another type of example that would be useful is HTML parsing: table border=1 trstuff.../tr trstuff... /TABLE .. since it also isn't clear to me whether you'd be able to extract the table contents, or the rows, using the mechanisms of this proposal. RFC 150: Extend regex syntax to provide for return of a hash of matched subpatterns === This is cool - I don't think I've seen this suggested before. Implementation might be a bit more work: the back