More on RFC 93 (was Re: RFC 316 (v1) ...)

2000-09-30 Thread Hugo

In [EMAIL PROTECTED], Bart Lateur writes:
:Yes, but RFC 93 has some other disadvantages.

In respect of the number of calls, there seems nothing in RFC 93
to stop us permitting the callback to return more or fewer than the
requested number of characters. So a filehandle, for example, could
choose to return some multiple of 4K blocks for every request. A
socket conenction that applies a line-based protocol would probably
read a line at a time, while another socket might return just those
characters available to read without blocking.

:Furthermore, where is the resulting buffer stored? People usually still
:want a copy of their data, to do yet other things with. Here, the data
:has disappeared into thin air. The only way to get it, is putting
:capturing parens in the regex.

It seems to me that $` and $ are the right solutions here. I assume
that perl6 will not allow this to cause an overreaching performance
problem. In this context we have the additional advantage that the
only copy of the accumulated string is owned by the regexp engine,
so no additional copy need be made to protect it.

:Compared to that, RFC 93 feels like a straightjacket. To me.

Strangely it feels uncommonly liberating to me.

:You may have to completely rewrite your script. So much for code reuse.

I don't believe that it need be so painful to take advantage of it
in existing code. We can ease that by providing a selection of
helpful ready-rolled routines for common tasks.

Hugo



Re: RFC 348 (v1) Regex assertions in plain Perl code

2000-09-30 Thread Bart Lateur

On Sat, 30 Sep 2000 00:57:47 +0100, Hugo wrote:

:"local" inside embedded code will no longer be supported, nor will
:consitional regexes. The Perl5 - Perl6 translator should warn if it
:ever encounters one of these.

I'm not convinced that removing either of these are necessary to the
main thrust of the proposal. They may both still be useful in their
own right, and you seem to offer little evidence against them other
than that you don't like them.

"local" promotes the idea of semi-permanently changes global data. That
is a very coding practice, it shouldn't be encouraged. The fact that
it's pretty hard to predict precisely when embedded code will be called
(see the example in the RFC), that too, conflicts with this.

It most definitely doesn't fit into the spirit of assertions.

There's an RFC requesting that *all* of these advanced features should
go. There's no justification there, either. I'm limiting myself here to
mentioning the features I do no consider essential for assertions to be
useful. It doesn't need local. Is that good enough for you? You may keep
it if you wish, but it is not essential.

And I do think that the semantics of "local" don't fit well into the
rest of Perl. Clearly, in

(?{local $c = $c+1 })

the scope of $c should be limited to this embedded code block!?!

I do like the idea of making (?{...}) an assertion, all the more
because we have a simple migration path that avoids unnecessarily
breaking existing scripts: wrap $code as '$^R = do { $code }; 1'.

Good. :-)

If you want to remove support for 'local' in embedded code, it is
worth a full proposal in its own right that will explain what will
happen if people try to do that. (I think it will make perl
unnecessarily more complex to detect and disable it in this case.)

Quite the contrary, I think. My guess is that this support for loacl
*complicates* implementation, and probably by a substantial amount.

Similarly if you want to remove support for (?(...)) completely,
you need to address the utility and options for migration for all
the available uses of it, not just the one addressed by the new
handling of (?{...}).

You're talking about conditional regexes? I am curious to see just *one*
good reason to keep them in. I've not yet seen anything using a  regex
that makes use of it (appart from Perl5's embedded code assertions),
that can't be done without it. Anybody is free to prove me wrong. 

-- 
Bart.



Re: RFC 316 (v1) Regex modifier for support of chunk processing and prefix matching

2000-09-30 Thread Bart Lateur

On Tue, 26 Sep 2000 11:55:32 +1100 (EST), Damian Conway wrote:

Wouldn't this interact rather badly with the /gc option (which also leaves
Cpos set on failure)?

Yes.

The easy way out is disallow combining /gc wit h/z. But, since this
typically one of the applications it is aimed for, I should find a
solution. A different interface, is one option.

This question arose because I was trying to work out how one would write a
lexer with the new /z option, and it made my head ache ;-)

Heheh. Your turn.   ;-)


I'm not sure I see that this:
...
is less intimidating or closer to the "ordinary program flow"  than:

   \*FH =~ /(abcd|bc)/g;

(as proposed in RFC 93).

Was that what was proposed? I think not. It was:

sub { ... } =~ /(abcd|bc)/g;


But I kinda like that syntax. But, in practice, it looks too much like
black magic:

 * where is the sting stored? It looks like it disappears into thin air.
 
 * What about pushback? Your proposal depends on it, but standard
filehandles don't support it, IMO. Does this require a TIEHANDLE
implementation?

 * Your regex shouldn't consume any more characters friom the filehandle
than it matches? Where are the reamining characters pushed back into?

After every single keystroke, you can test what he just 
entered against a regex matching the valid format for a number, so that 
C1234E can be recognized as a prefix for the regex

/^\d+\.?\d*(?:E[+-]?\d+)$/

Isn't this just:

   \*STDIN =~ /^\d+\.?\d*(?:E[+-]?\d+)$/
   or die "Not a number";

???

No. First of all, you can't override the behaviour of STDIN. That reads
a whole line, then checks it, and then your script dies if it's not
right.

I want a test on every single keystroke, see if it's in sync with the
regex, and if it's not, reject it, i.e. no insertion in the uinput
buffer, and no echo on screen. Besides, you can't be sure your data
comes from a filehandle (or compatible handle). Not in a GUI.

-- 
Bart.



Re: RFC 72 (v4) Variable-length lookbehind.

2000-09-30 Thread Bart Lateur

On 30 Sep 2000 19:50:27 -, Perl6 RFC Librarian wrote:

In Perl6, lookbehind in regular expressions should be extended to permit
not only fixed-length, but also variable-length lookbehind.

I see no mention of negative lookbehind.

As I wrote before, in:

/(?!ab*c)x/

The lookbehind should fail if *any* lookbehind string can be found
matching, and not succeed if there's a string to be found that doesn't
match! In the latter case, negative lookbehind would be useless.

-- 
Bart.



Re: RFC 331 (v1) Consolidate the $1 and C\1 notations

2000-09-30 Thread Bart Lateur

On 28 Sep 2000 20:57:39 -, Perl6 RFC Librarian wrote:

Currently, C\1 and $1 have only slightly different meanings within a
regex.  Let's consolidate them together, eliminate the differences, and
settle on $1 as the standard.

I wrote this before, but apparently you didn't hear it. Let me repeat:
$foo on the LHS allows metacharacter matching, for example "a.*b" can
match "a foo b". But \1 only allows literal strings. If $1 captured
"a.*b", then \1 will only match the literal string "a.*b", as if the
regex contained "a\.\*b".

I don't see how you can possibly consider this a "tiny difference".

-- 
Bart.



Re: RFC 331 (v1) Consolidate the $1 and C\1 notations

2000-09-30 Thread Dave Storrs



On Sat, 30 Sep 2000, Bart Lateur wrote:

 I wrote this before, but apparently you didn't hear it. Let me repeat:

You're right, I missed your email when I was incorporating things
into the new version.  Apologies.


 $foo on the LHS allows metacharacter matching, for example "a.*b" can
 match "a foo b". But \1 only allows literal strings. If $1 captured

I don't believe it matters...my version of $1 works exactly like
the current \1 and my $/[1] works exactly like the current $1.  

Dave




RFC 150

2000-09-30 Thread Kevin Walker

=head1 TITLE

Extend regex syntax to provide for return of a hash of matched subpatterns

=head1 VERSION

   Maintainer: Kevin Walker [EMAIL PROTECTED]
   Date: 23 Aug 2000
   Mailing List: [EMAIL PROTECTED]
   Number: 150
   Version: 2
   Status: Frozen

=head1 ABSTRACT

Currently regexes return matched subpatterns as a list.  This is
inconvenient in at least two situations: (1) long, complicated regexes,
where counting parentheses can be difficult and error-prone; and, more
importantly, (2) matching against a list of regexes, when the corresponding
fields of the various regexes do not occur in the same order.


=head1 DESCRIPTION

I suggest that (?% field_name : pattern) spit out 'field_name', in addition
to the matched pattern, when matching in a list context:

 $text = "abajace -- mailbox full";
%hash = $text =~ /^ (?% username : \S+) \s*--\s* (?% reason : .*)$/xsi;

would result in %hash = (username = 'abajace', reason = 'mailbox full').

Suggestions for better syntax are hereby solicited.  (?% field_name -
pattern) and (?% field_name = pattern) come immediately to mind.


Why This Would be Useful:

Often one wants to match a string against a list of patterns which extract
similar information from the string, but the fields occur in varying orders.
Also, some optional fields might get extracted by some patterns and not by
others.  Continuing with the (over-simplified) example of analyzing e-mail
bounce messages:

   my @regexps = (

   # 'abajace -- mailbox full' or 'abajace -- user unknown'
   q/^ \s* (?% username  : \S+) \s*--\s* (?% reason : .*)$/,
 
   # 'Unknown local part: flycrake'
   q/^ \s* (?% reason : Unknown\ local\ part): \s* (?% username  : \S+)/,
 
   # 'New address for abajace is [EMAIL PROTECTED]'
   q/(?% reason : new\ address\ for) \s+ (?% username  : \S+) \s+ is \s+
(?% new_address : \S+\@\S+)/,

   );

   while (my $bounce_text = get_next_message()) {
   my %field = ();
   for my $regexp (@regexps) {
   if ( %field = $bounce_text =~ /$regexp/xsi;) {
   print "username: $field{username}, reason: $field{reason}\n";
   if ($field{new_address}) {
   change_address($field{username}, $field{new_address});
   }
   last;
   }
   }
   }


Backrefs

It would also be useful to have named backrefs.  I propose that (\%field_name)
match a previous a previous named bracket.  As before, I'm not attached to
the proposed syntax.


=head1 IMPLEMENTATION

I confess that I'm not an expert in regex internals.  Nevertheless, I'll go
out on a limb and assert that this will be relatively easy to implement,
with relatively few entangling side-issues.


=head1 REFERENCES

See also RFC 112.



Regex Extension RFC

2000-09-30 Thread Kevin Walker

=head1 TITLE

Allow multiply matched groups in regexes to return a listref of all matches

=head1 VERSION

   Maintainer: Kevin Walker [EMAIL PROTECTED]
   Date: 30 Sep 2000
   Version: 1
   Mailing List: [EMAIL PROTECTED]
   Status: Frozen


=head1 DESCRIPTION

Since the October 1 RFC deadline is nigh, this will be pretty informal.

Suppose you want to parse text with looks like:

 name: John Abajace
 children: Tom, Dick, Harry
 favorite colors: red, green, blue

 name: I. J. Reilly
 children: Jane, Gertrude
 favorite colors: black, white
 
 ...

Currently, this takes two passes:

 while ($text =~ /name:\s*(.*?)\n\s*
children:\s*(.*?)\n\s*
favorite\ colors:\s*(.*?)\n/sigx) {
 # now second pass for $2 ( = "Tom, Dick, Harry") and $3, yielding
 # list of children and favorite colors
 }

If we introduce a new construction, (?@ ... ), which means "spit out a
list ref of all matches, not just the last match", then this could be
done in one pass:

 while ($text =~ /name:\s*(.*?)\n\s*
children:\s*(?:(?@\S+)[, ]*)*\n\s*
favorite\ colors:\s*(?:(?@\S+)[, ]*)*\n/sigx) {
 # now we have:
 #  $1 = "John Abajace";
 #  $2 = ["Tom", "Dick", "Harry"]
 #  $3 = ["red", "green", "blue"]
 }

Although the above example is contrived, I have very often felt the need
for this feature in real-world projects.

=head1 IMPLEMENTATION

Unknown.

=head1 REFERENCES

None.