More on RFC 93 (was Re: RFC 316 (v1) ...)
In [EMAIL PROTECTED], Bart Lateur writes: :Yes, but RFC 93 has some other disadvantages. In respect of the number of calls, there seems nothing in RFC 93 to stop us permitting the callback to return more or fewer than the requested number of characters. So a filehandle, for example, could choose to return some multiple of 4K blocks for every request. A socket conenction that applies a line-based protocol would probably read a line at a time, while another socket might return just those characters available to read without blocking. :Furthermore, where is the resulting buffer stored? People usually still :want a copy of their data, to do yet other things with. Here, the data :has disappeared into thin air. The only way to get it, is putting :capturing parens in the regex. It seems to me that $` and $ are the right solutions here. I assume that perl6 will not allow this to cause an overreaching performance problem. In this context we have the additional advantage that the only copy of the accumulated string is owned by the regexp engine, so no additional copy need be made to protect it. :Compared to that, RFC 93 feels like a straightjacket. To me. Strangely it feels uncommonly liberating to me. :You may have to completely rewrite your script. So much for code reuse. I don't believe that it need be so painful to take advantage of it in existing code. We can ease that by providing a selection of helpful ready-rolled routines for common tasks. Hugo
Re: RFC 348 (v1) Regex assertions in plain Perl code
On Sat, 30 Sep 2000 00:57:47 +0100, Hugo wrote: :"local" inside embedded code will no longer be supported, nor will :consitional regexes. The Perl5 - Perl6 translator should warn if it :ever encounters one of these. I'm not convinced that removing either of these are necessary to the main thrust of the proposal. They may both still be useful in their own right, and you seem to offer little evidence against them other than that you don't like them. "local" promotes the idea of semi-permanently changes global data. That is a very coding practice, it shouldn't be encouraged. The fact that it's pretty hard to predict precisely when embedded code will be called (see the example in the RFC), that too, conflicts with this. It most definitely doesn't fit into the spirit of assertions. There's an RFC requesting that *all* of these advanced features should go. There's no justification there, either. I'm limiting myself here to mentioning the features I do no consider essential for assertions to be useful. It doesn't need local. Is that good enough for you? You may keep it if you wish, but it is not essential. And I do think that the semantics of "local" don't fit well into the rest of Perl. Clearly, in (?{local $c = $c+1 }) the scope of $c should be limited to this embedded code block!?! I do like the idea of making (?{...}) an assertion, all the more because we have a simple migration path that avoids unnecessarily breaking existing scripts: wrap $code as '$^R = do { $code }; 1'. Good. :-) If you want to remove support for 'local' in embedded code, it is worth a full proposal in its own right that will explain what will happen if people try to do that. (I think it will make perl unnecessarily more complex to detect and disable it in this case.) Quite the contrary, I think. My guess is that this support for loacl *complicates* implementation, and probably by a substantial amount. Similarly if you want to remove support for (?(...)) completely, you need to address the utility and options for migration for all the available uses of it, not just the one addressed by the new handling of (?{...}). You're talking about conditional regexes? I am curious to see just *one* good reason to keep them in. I've not yet seen anything using a regex that makes use of it (appart from Perl5's embedded code assertions), that can't be done without it. Anybody is free to prove me wrong. -- Bart.
Re: RFC 316 (v1) Regex modifier for support of chunk processing and prefix matching
On Tue, 26 Sep 2000 11:55:32 +1100 (EST), Damian Conway wrote: Wouldn't this interact rather badly with the /gc option (which also leaves Cpos set on failure)? Yes. The easy way out is disallow combining /gc wit h/z. But, since this typically one of the applications it is aimed for, I should find a solution. A different interface, is one option. This question arose because I was trying to work out how one would write a lexer with the new /z option, and it made my head ache ;-) Heheh. Your turn. ;-) I'm not sure I see that this: ... is less intimidating or closer to the "ordinary program flow" than: \*FH =~ /(abcd|bc)/g; (as proposed in RFC 93). Was that what was proposed? I think not. It was: sub { ... } =~ /(abcd|bc)/g; But I kinda like that syntax. But, in practice, it looks too much like black magic: * where is the sting stored? It looks like it disappears into thin air. * What about pushback? Your proposal depends on it, but standard filehandles don't support it, IMO. Does this require a TIEHANDLE implementation? * Your regex shouldn't consume any more characters friom the filehandle than it matches? Where are the reamining characters pushed back into? After every single keystroke, you can test what he just entered against a regex matching the valid format for a number, so that C1234E can be recognized as a prefix for the regex /^\d+\.?\d*(?:E[+-]?\d+)$/ Isn't this just: \*STDIN =~ /^\d+\.?\d*(?:E[+-]?\d+)$/ or die "Not a number"; ??? No. First of all, you can't override the behaviour of STDIN. That reads a whole line, then checks it, and then your script dies if it's not right. I want a test on every single keystroke, see if it's in sync with the regex, and if it's not, reject it, i.e. no insertion in the uinput buffer, and no echo on screen. Besides, you can't be sure your data comes from a filehandle (or compatible handle). Not in a GUI. -- Bart.
Re: RFC 72 (v4) Variable-length lookbehind.
On 30 Sep 2000 19:50:27 -, Perl6 RFC Librarian wrote: In Perl6, lookbehind in regular expressions should be extended to permit not only fixed-length, but also variable-length lookbehind. I see no mention of negative lookbehind. As I wrote before, in: /(?!ab*c)x/ The lookbehind should fail if *any* lookbehind string can be found matching, and not succeed if there's a string to be found that doesn't match! In the latter case, negative lookbehind would be useless. -- Bart.
Re: RFC 331 (v1) Consolidate the $1 and C\1 notations
On 28 Sep 2000 20:57:39 -, Perl6 RFC Librarian wrote: Currently, C\1 and $1 have only slightly different meanings within a regex. Let's consolidate them together, eliminate the differences, and settle on $1 as the standard. I wrote this before, but apparently you didn't hear it. Let me repeat: $foo on the LHS allows metacharacter matching, for example "a.*b" can match "a foo b". But \1 only allows literal strings. If $1 captured "a.*b", then \1 will only match the literal string "a.*b", as if the regex contained "a\.\*b". I don't see how you can possibly consider this a "tiny difference". -- Bart.
Re: RFC 331 (v1) Consolidate the $1 and C\1 notations
On Sat, 30 Sep 2000, Bart Lateur wrote: I wrote this before, but apparently you didn't hear it. Let me repeat: You're right, I missed your email when I was incorporating things into the new version. Apologies. $foo on the LHS allows metacharacter matching, for example "a.*b" can match "a foo b". But \1 only allows literal strings. If $1 captured I don't believe it matters...my version of $1 works exactly like the current \1 and my $/[1] works exactly like the current $1. Dave
RFC 150
=head1 TITLE Extend regex syntax to provide for return of a hash of matched subpatterns =head1 VERSION Maintainer: Kevin Walker [EMAIL PROTECTED] Date: 23 Aug 2000 Mailing List: [EMAIL PROTECTED] Number: 150 Version: 2 Status: Frozen =head1 ABSTRACT Currently regexes return matched subpatterns as a list. This is inconvenient in at least two situations: (1) long, complicated regexes, where counting parentheses can be difficult and error-prone; and, more importantly, (2) matching against a list of regexes, when the corresponding fields of the various regexes do not occur in the same order. =head1 DESCRIPTION I suggest that (?% field_name : pattern) spit out 'field_name', in addition to the matched pattern, when matching in a list context: $text = "abajace -- mailbox full"; %hash = $text =~ /^ (?% username : \S+) \s*--\s* (?% reason : .*)$/xsi; would result in %hash = (username = 'abajace', reason = 'mailbox full'). Suggestions for better syntax are hereby solicited. (?% field_name - pattern) and (?% field_name = pattern) come immediately to mind. Why This Would be Useful: Often one wants to match a string against a list of patterns which extract similar information from the string, but the fields occur in varying orders. Also, some optional fields might get extracted by some patterns and not by others. Continuing with the (over-simplified) example of analyzing e-mail bounce messages: my @regexps = ( # 'abajace -- mailbox full' or 'abajace -- user unknown' q/^ \s* (?% username : \S+) \s*--\s* (?% reason : .*)$/, # 'Unknown local part: flycrake' q/^ \s* (?% reason : Unknown\ local\ part): \s* (?% username : \S+)/, # 'New address for abajace is [EMAIL PROTECTED]' q/(?% reason : new\ address\ for) \s+ (?% username : \S+) \s+ is \s+ (?% new_address : \S+\@\S+)/, ); while (my $bounce_text = get_next_message()) { my %field = (); for my $regexp (@regexps) { if ( %field = $bounce_text =~ /$regexp/xsi;) { print "username: $field{username}, reason: $field{reason}\n"; if ($field{new_address}) { change_address($field{username}, $field{new_address}); } last; } } } Backrefs It would also be useful to have named backrefs. I propose that (\%field_name) match a previous a previous named bracket. As before, I'm not attached to the proposed syntax. =head1 IMPLEMENTATION I confess that I'm not an expert in regex internals. Nevertheless, I'll go out on a limb and assert that this will be relatively easy to implement, with relatively few entangling side-issues. =head1 REFERENCES See also RFC 112.
Regex Extension RFC
=head1 TITLE Allow multiply matched groups in regexes to return a listref of all matches =head1 VERSION Maintainer: Kevin Walker [EMAIL PROTECTED] Date: 30 Sep 2000 Version: 1 Mailing List: [EMAIL PROTECTED] Status: Frozen =head1 DESCRIPTION Since the October 1 RFC deadline is nigh, this will be pretty informal. Suppose you want to parse text with looks like: name: John Abajace children: Tom, Dick, Harry favorite colors: red, green, blue name: I. J. Reilly children: Jane, Gertrude favorite colors: black, white ... Currently, this takes two passes: while ($text =~ /name:\s*(.*?)\n\s* children:\s*(.*?)\n\s* favorite\ colors:\s*(.*?)\n/sigx) { # now second pass for $2 ( = "Tom, Dick, Harry") and $3, yielding # list of children and favorite colors } If we introduce a new construction, (?@ ... ), which means "spit out a list ref of all matches, not just the last match", then this could be done in one pass: while ($text =~ /name:\s*(.*?)\n\s* children:\s*(?:(?@\S+)[, ]*)*\n\s* favorite\ colors:\s*(?:(?@\S+)[, ]*)*\n/sigx) { # now we have: # $1 = "John Abajace"; # $2 = ["Tom", "Dick", "Harry"] # $3 = ["red", "green", "blue"] } Although the above example is contrived, I have very often felt the need for this feature in real-world projects. =head1 IMPLEMENTATION Unknown. =head1 REFERENCES None.