Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-09-28 Thread Nathan Wiger

> =item *
> C<\1> goes away as a special form
> 
> =item *
> $1 means what C<\1> currently means (first match in this regex)
> 
> =item *
> ${1} is the same as $1 (first match in this regex)
> 
> =item *
> ${P1} means what $1 currently means (first match in last regex)

Here's the big problem with this, and I think others have said it
similarly: If we need the functionality of both \1 and $1, then there is
no reason redoing the syntax. Period.

If \1 is unneeded, then let's ditch it and just use $1 everywhere.
However, this is not the case, as Randal, Bart, and others have shown.

If we need \1, then we should leave as-is. There's no reason to force
literally millions of people to relearn this. Renaming something just to
rename it does not add value.

-Nate



Re: RFC 332 (v1) Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

2000-09-28 Thread Nathan Wiger

> Is $$ the only alternative, or did I miss more? I don't think I've even
> seen this $$ mentioned before?

$$ is not a suitable alternative. It already means the current process
ID. It really cannot be messed with. And ${$} is identical to $$ by
definition.

> >I still like the idea of $$, as I described it in the original thread.
> >I've seen no comments for or against at this time.

See above.

> I can't see how yet another alternative, /$$/, is any better than what
> we have now: /\z/.

I agree. If it's more alternatives we're after, just have the person
write a custom regex. The idea is to make Perl do the right thing,
whatever that may be.

The big problem with changing $, as you note, is for people that need to
catch multiple instances in a string:

   $string = "Hello\nGoodbye\nHello\nHello\n";
   $string =~ s/Hello$/Goodbye/gm;

Without $, you can workaround this like so:

   $string =~ s/Hello\n/Goodbye\n/gm;

My suggestion would be:

   1. Make $ exactly always match just before the last \n, as the
  RFC suggests.

   2. Introduce some new \X switch that does what $ does
  currently if it's deemed necessary.

We're back to new alternatives again, but the one thing this buys you is
a $ that works consistently. I don't think many people need $'s current
functionality, and those that do can have an new \X.

-Nate



Re: RFC 332 (v1) Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

2000-09-28 Thread Hugo

In <[EMAIL PROTECTED]>, Bart Lateur writes:
:I'll try to find that "thread" back.

This was my message:

  http://www.mail-archive.com/perl6-language-regex%40perl.org/msg00354.html

:>I don't think changing /s is the right solution. I think this will
:>incline people to try and fix their problems by adding /s, without
:>realising that this changes the definition of every . in their
:>regexp as well.
:
:Perhaps. I do think that, in general, textual data falls into one of
:three categories:
:
: * text with possibly embedded newlines
: * text with no embedded newlines
: * text with an irrelevant newline at the very end.
:
:The '/s' option is for the 1st case. No '/s' for the 3rd. As for #2: you
:don't care.

I'd distinguish the first case further into 'the newlines are
significant' or not - /s is often desired for the first case,
and /m often for the second. And then I'd be tempted to repeat
the whole list, replacing 'newline' with 'record separator'.

I have to say I'm quite prejudiced against /s - I consider myself
reasonably knowledgeable about regexps, but on average about once
a month I find myself unsure enough about which is /m and which
is /s that I need to check the top of perlre to be sure. I think
we've appreciated for some time that it was a mistake to name them
as if they were opposites, but if anything I'd like to reduce the
need for them rather than to increase it.

Hugo



Re: RFC 166 (v3) Alternative lists and quoting of things

2000-09-28 Thread Hugo

In <[EMAIL PROTECTED]>, Perl6 RFC Librarian writes:
:The basic idea is to expand an array as a list of alternatives.  There
:are two possible syntaxs (?@foo) and just plain @foo.  @foo might just have
:existing uses (just), therefore I prefer the (?@foo) syntax.

That needn't be a problem, that's why all RFCs have a migration section. :)

:(?@foo) is just syntactic sugar for (?:(??{ join('|',@foo) })) A bracketed
:list of alternatives.

Is this not constructed at regexp compile time? If so, it is more
like @{[ join('|',@foo) ]}.

:Suggested syntax:
:
:(?Q$foo) Quotes the contents of the scalar $foo - equivalent to
:(??{ quotemeta $foo }).
:
:(?Q@foo) Quotes each item in a list (as above) this is equivalent to
:(?:(??{ join ('|', map quotemeta, @foo)})).
:
:In this syntax the Q is used as it represents a more inteligent \Quot\E.

I think it has been stated before that (?Q is reserved along with
other letters for possible regexp flags.

Hugo



Re: RFC 112 (v3) Asignment within a regex

2000-09-28 Thread Hugo

In <[EMAIL PROTECTED]>, Perl6 RFC Librarian writes:
:=head1 TITLE
:
:Asignment within a regex

This document could do with running through a spellchecker.

:Potentially the $foo could be any scalar LHS, as in (?$foo{$bar}= ... )!,
:likewise the '=' could be any asignment operator.

It isn't clear what the significance of the '!' is in that example.
It also isn't clear what parts of the expression are interpolated at
compile time; what should the following leave in %foo?

  %foo = ();
  $bar = "one";
  "twothree" =~ / (?$bar=two) (?$foo{$bar}=three) /x;

:=head2 Scoping
:
:The question of scoping for these assignments has been raised, but I don't
:currently have a feel for the "best" way to handle this.  Input welcome.

I think it should be defined to act the same as in (??{...}), whenever
we get around to defining that.

:=head1 IMPLENTATION
:
:Currently all $scalars in regexes are expanded before the main regex compiler
:gets to analyse the syntax.  This problem also affects several other RFCs
:(166 for example).  The expansion of variables in regexes needs for these
:(and other RFCs) to be driven from within the regex compiler so that the
:regex can expand as and where appropriate.  Changing this should not affect
:any existing behaviour.

That may not be necessary for this case; it may be enough just to tweak
the parser slightly, to detect '(?$' (and maybe '(?\$'). Don't forget
that the parser already successfully skips past '$' when we need it to.

Hugo



Re: RFC 332 (v1) Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

2000-09-28 Thread Bart Lateur

On Thu, 28 Sep 2000 23:54:20 +0100, Hugo wrote:

>We thought of a few other possibilities too. I think it is a shame you
>did not mention them, and explain why your proposal is better.

Let me think on it.

Is $$ the only alternative, or did I miss more? I don't think I've even
seen this $$ mentioned before?

>I still like the idea of $$, as I described it in the original thread.
>I've seen no comments for or against at this time. 

I'll try to find that "thread" back.

>>Perhaps '$$' to mean match at end of string (without /m) or at end
>>of any line (with /m)? The p52p6 translator can easily replace
>>references to $$ with ${$}.

I can't see how yet another alternative, /$$/, is any better than what
we have now: /\z/.

>:=head2 '/ms': combined '/m' and '/s'
>:
>:'/ms' still works as before. Internally, '/m' has taken over the job  of
>:matching before a newline at the end of the string, simply because /$/m
>:can match before I newline.
>
>Eh? Surely /$/ms would now only match _after_ the newline, or at end of
>string, whereas before it would match before _or_ after any newline, or
>at end of string?

Oh damned, you're probably right. This makes me wonder if this is doing
the right thing...

>This seems like a read bad idea. I think you have to assume people
>are feeding you the code they want to run. At worst you should
>generate a warning, but I think it is evil not to migrate things
>properly.

Well... there's a simple solution: replace /$/ with /\Z/. That one would
remain the same. Wouldn't it? I'll surely add that.

>I don't think changing /s is the right solution. I think this will
>incline people to try and fix their problems by adding /s, without
>realising that this changes the definition of every . in their
>regexp as well.

Perhaps. I do think that, in general, textual data falls into one of
three categories:

 * text with possibly embedded newlines
 * text with no embedded newlines
 * text with an irrelevant newline at the very end.

The '/s' option is for the 1st case. No '/s' for the 3rd. As for #2: you
don't care.

-- 
Bart.



Re: RFC 276 (v1) Localising Paren Counts in qr()s.

2000-09-28 Thread Hugo

In <[EMAIL PROTECTED]>, Perl6 RFC Librarian writes:
:MJD:
:Interpolated qr() items shouldn't be recompiled anyway.  They should
:be treated as subroutine calls.  Unfortunately, this requires a
:reentrant regex engine, which Perl doesn't have.  But I think it's the
:right way to go, and it would solve the backreference problem, as well
:as many other related problems.
[...]
:=head1 IMPLENTATION
:
:The Regex engine must be made re-entrant.
:
:The expansion of variables in regexes must be driven by the regex compiler
:(Same problem as for RFCs 112, 166 ...)

None of these are necessarily true - we could change the overloading
of the Regexp object instead. Currently we have:

  my $re = qr{pattern};
  print "$re";

.. giving 'pattern' by overloading stringification. If we overload it
instead to give '(??{ $re })' (or a moral equivalent) we have a nasty
hack, it is true, but it could allow us to defer the much trickier
proper solution. Of course it breaks every other use of the string
value, and I'm not sure how big a problem that might be.

Hugo



Re: RFC 316 (v1) Regex modifier for support of chunk processing and prefix matching

2000-09-28 Thread Hugo

In <[EMAIL PROTECTED]>, Perl6 RFC Librarian writes:
:In addition, pos() is set to the offset of the start of the recognized 
:match prefix. In case of a plain succesful match, or of a normal 
:not-found termination, pos is undef() on exit.

That's not entirely true - it depends on the flags. It is always
true after a failed match though, which I think is enough for your
intended behaviour.

:This serves both as a flag, as pos will only be defined if the search 
:has been aborted for this reason, and it allows more optimized
:searching, 
:because after you have appended the next chunk to the current one, the 
:next try will simply start again at the position where the pattern may 
:first match, skipping any earlier matches.

Is that intended to be a feature of /z alone, or only in the presence
of /g? Perhaps you could add an extra example or two showing how you
might use /gcz or /z alone.

:I originally had thought of providing a separate, dedicated regex 
:modifier, just for the match prefix, but I don't think too many people 
:need this that desperately. You can easily build a working application 
:with just the '/z' modifier. If you can't, you're in over your head, 
:anyway.  ;-)

I don't understand this paragraph.

Hugo



Re: RFC 274 (v1) Generalised Additions to Regexs

2000-09-28 Thread Hugo

In <[EMAIL PROTECTED]>, "Richard Proctor" writes:
:> I'd be more inclined to have callbacks registered for a word: that
:> way we can complain earlier when two modules try to register the
:> same word. Then at regexp-compile time we parse out the word
:> following the (+ and immediately know who to pass it to (or fail).
:
:This is equally possible, my thoughts where to leave the syntax
:completely open so that anything could be added - words, chinese,
:$$$.  And leave it to the enhancements to recognise it or not.  I
:could add this as an alternative for V2.

Well, there are limits to what we can handle - earlier, the parser
will have had to be able to determine where the end of the regexp
is. Even specifying a word at the beginning doesn't help: we need
to know whether the rest should look like a regexp, or code, or
whatever else. The regexp compiler doesn't get a look in until
after that has been done.

Which suggests that maybe each callback - whether or not we link
them to words - should specify what it will match, which suggests
it should be linked with a regexp. And that brings us round to
this message from Larry:

  http://www.mail-archive.com/perl6-language%40perl.org/msg02955.html

.. which made me go all quivery when I read it. :}

:> :5) if an enhancement recognises the content it could do either of:
:> :
:> :a) return replacement expanded regex using existing capabilities
:> :perl will then pass this back through the regex compiler.
:>
:> Can we/should we detect (+...) loops? Or are you suggesting that the
:> returned string should not permit (+...) expansion?
:
:Should we detect? Probably not.  Should we allow definately yes.  The
:only grounds for detection are to report infinite recursion.

Ok.

:> :  The referenced code needs to have enough access to the regex
:> :internals to be able to see the current sub-expression, request
:> :more characters ,access to relevant flags and visability of
:> :greediness.
:>
:> I don't see that this is a good idea; it makes more sense to me that
:> the coderef is treated exactly as if it had been compiled from (?{...}).
:
:Lets look at these one at a time:
:
:Access to subexpresions - ok this can be done.
:
:Visability of flags - Not curently possible. The code might
:like to know that /i is in effect, it might want to know that /s is
:in effect it probably does not need to know about /o.  This is equally
:true to the enhancement regex handler that looks at the (+foo) in the
:first place.  I think that these could be of use to (?{...}) code.
:
:Greediness - maybe not necessary, but I think better visability of
:internals might be beneficial.

Hm, I do appreciate the problem - I wasn't too happy when I realised
that embedded qr{} expressions are protected from the flags of their
outer regexp, cos I wanted to specify /i on the outside and have it
trickle in to the rest. It feels like its going to get real messy,
though, and totally screw the optimiser.

:
:>
:> :Following on, if (?{...}) etc code is evaluated
:> :in forward match, it would be a good idea to likewise support some
:> :code block that is ignored on a forward match but is executed when the
:> :code is unwound due to backtracking.
:>
:> The support in (?{...}) for localisation is (as I understand it) the
:> intended mechanism for permitting such effects. Can you describe some
:> specific problems you are trying to solve here?
:
:Is localisation enough?

Enough to achieve everything you might want to? Yes: you can always
have a (?{ local $a = new Object }) with a DESTROY method. It may not
necessarily be the cleanest possible way to write everything, though.

Hugo



Re: RFC 332 (v1) Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

2000-09-28 Thread Hugo

In <[EMAIL PROTECTED]>, Perl6 RFC Librarian writes:
:Originally, we had thought of adding Yet Another Regex Modifier; but to
:be honest, having 2 modifiers just for the newline is already confusing
:enough, for too many people. A third is definitely out.

We thought of a few other possibilities too. I think it is a shame you
did not mention them, and explain why your proposal is better.

I still like the idea of $$, as I described it in the original thread.
I've seen no comments for or against at this time. To recap:

>Perhaps '$$' to mean match at end of string (without /m) or at end
>of any line (with /m)? The p52p6 translator can easily replace
>references to $$ with ${$}.

:=head2 The $* variable
:
:'/s' and '/m' also have a lesser known side effect: they both override
:the setting of the $* special variable, which controls multiline related
:behaviour in regexes.
:
:Use of this special variable has already been deprecated at least since
:Perl5 first came out, more than 5 years ago. It is a very good candidate
:to be removed from Perl6 altogether, which would result in fewer
:gotcha's in the language. That is a Good Thing.

Has there not been an RFC to remove this yet? If not I'll write one.
(Or if someone else has more spare time on their hands and wants to
do it, please let me know.)

:=head2 Getting the old behaviour back
:
:You can't. Question is: do you really want to?
:
:=over 2
:
:=item *
:
:If you know your data can contain newlines, and you want to treat them
:as ordinary characters, you probably don't want to make an exception for
:a trailing newline, anyway.

So you _can_ recreate the original behaviour. Why did you just say you
can't?

:=head2 '/ms': combined '/m' and '/s'
:
:'/ms' still works as before. Internally, '/m' has taken over the job  of
:matching before a newline at the end of the string, simply because /$/m
:can match before I newline.

Eh? Surely /$/ms would now only match _after_ the newline, or at end of
string, whereas before it would match before _or_ after any newline, or
at end of string?

:=head1 MIGRATION
:
:It's not unlikely that currently having /$/ in your regexes, is actually
:a bug in your script, but you don't care because the data won't ever
:make it visible.
:
:Therefore, I think it is not desirable to have the Perl5 To Perl6
:converter actually change your source code. A warning if /$/ is found in
:combination with a bare '/s' modifier, not combined with '/m', is
:probably all that is wanted.

This seems like a read bad idea. I think you have to assume people
are feeding you the code they want to run. At worst you should
generate a warning, but I think it is evil not to migrate things
properly.

I don't think changing /s is the right solution. I think this will
incline people to try and fix their problems by adding /s, without
realising that this changes the definition of every . in their
regexp as well. I like the idea of $$ better - this is a natural
and obvious extension to $, which adds a new capability without
messing with any existing capability. Furthermore people who find
that they have a problem in their existing regexp because $ does
not mean what they thought will not set themselves up for new and
different problems when they apply the obvious one-byte fix.

Hugo



Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-09-28 Thread Hugo

:=item *
:/(foo)_$1_bar/
:
:=item *
:/(foo)_C<\1>_bar/

Please don't do this: write C or /(foo)_\1_bar/, but
don't insert C<> in the middle: that makes it much more difficult to
read.

:mean different things:  the second will match 'foo_foo_bar', while the
:first will match 'foo[SOMETHING]bar' where [SOMETHING] is whatever was

should be: foo_[SOMETHING]_bar

:captured in the B match...which could be a long, long way away,
:possibly even in some module that you didn't even realize you were
:including (because it was included by a module that was included by a
:module that was included by a...). 

This seems a bit unfair. It is just another variable. Any variable
you include in a pattern, you are assumed to know that it contains
the intended value - there is nothing special about $1 in this regard.

:The key fact here is that, in the first section of a s/// you are supposed
:to use C<\1>, but in the second portion you are supposed to use $1.  If
:you understand the whole logical structure behind it and understand how an
:s/// works (i.e., the right hand side of an s/// is a double-quoted
:string, not a regex), you will understand the distinction.  For newbies,
:however, it is apt to be quite confusing.

I think the whole idea that the LHS of s/// is a pattern, but the
RHS is a string (module /e, of course) is apt to be confusing when
you first encounter it. You won't be able to make sense of any but
the simplest use of s/// until you understand it, I think, and the
documentation expresses it quite clearly.

:=item *
:${P1} means what $1 currently means (first match in last regex)

Do you understand that this is the same variable as $P1? Traditionally,
perl very rarely coopts variable names that start with alphanumerics,
and (off the top of my head) all the ones it does so coopt are letters
only (ARGV, AUTOLOAD, STDOUT etc). I think we need better reasons to
extend that to all $P1-style variables.

If you are suggesting that they should have a special meaning only
in regexps, and only if braced, then I'd find it even more confusing.
The use of braces is usually the easiest (and only?) way to split
out a variable from following alphanumerics:
  /foo${P1}bar/

:These changes eliminate a potential source of confusion, retain all
:functionality, provide an easy migration path for P526, and the last
:notation (${P1}) serves as a clear indicator that you are talking about
:something from outside the current regex.

What is the migration path for existing uses of $P1-style variables?

:=item *
:s/(bar)(bell)/${P1}$2/ # changes "barbell" to "foobell"

Note that in the current regexp engine, ${P1} has disappeared by the
time matching starts. Can you explain why we need to change this?
Note also that if you are sticking with ${P1} either we need to
rename all existing user variables of this form, or we can no longer
use the existing 'interpolate this string' (or eval, double-eval etc)
routines, and have to roll our own for this (these) as well.

:=head1 IMPLEMENTATION
:
:This may require significant changes to the regex engine, which is a topic
:on which I am not qualified to speak.  Could someone with more
:knowledge/experience please chime in?

Currently the regexp compiler is handed a string in which $variables
have already interpolated. We'd need to avoid that and get either
the the raw data for the string or some list that has undergone a
minimum of preparation. It is possible we need that anyway - it is
a prerequisite for some of the other proposed enhancements (such as
the meta-referred-to RFC 112) and would certainly make the regexp
engine more flexible - but it is certainly substantial work. I don't
know what gotchas may arise. In general it seems a shame to recreate
large parts of the existing string parsing/interpolation code, but
it may not be possible to avoid it.

Changing the lifetime of backreferences feels likely to be difficult,
but it isn't clear to me what you are trying to achieve here. I think
you at least need to add an example of how it would act under s///g
and s///ge.

:=head1 REFERENCES
:
:RFC 112: Assignment within a regex
:
:RFC 276: Localising Paren Counts in qr()s.

I didn't see a mention of these in the body of the proposal.

To me, the prime issue is with \1. The backslash is heavily overloaded
in perl, and that makes it difficult to suggest a consistent and
legible extension that would allow us to refer back to either variables
(RFC 112) or hash keys (RFC 150). I don't think switching to $1 is any
help for those, though.

Hugo



Re: is \1 vs $1 a necessary distinction?

2000-09-28 Thread Bart Lateur

On Wed, 27 Sep 2000 10:34:48 -0500, Jonathan Scott Duff wrote:

>If $1 could be made to work properly on the LHS of s///, I'd vote for
>that being The Way.

I disagree, because \1 is different from a variable $foo in at least two
ways:

 * $foo is compiled into /$foo/ before anything is matched. \1 is a
repetition of what was just matched; this is dynamic interpolation
instead of static.

 * if $foo contains metacharacters, they are treated as metacharacters.
for example, if $foo is "a.b", then /$foo/ can match "axb". /\1/, OTOH,
can only match the LITERAL string that $1 captured. With $foo='a.b', 

/($foo)!$foo/

and

/($foo)!\1/

will not match the same set of things.

"\1" is more like equivalent to "\Q$1\E". Therefore, I don't want $1 on
the LHS to be the standard syntax.

-- 
Bart.



Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-09-28 Thread Jonathan Scott Duff

On Thu, Sep 28, 2000 at 08:57:39PM -, Perl6 RFC Librarian wrote:
> ${P1} means what $1 currently means (first match in last regex)

I'm sorry that I don't have anything more constructive to say than
"ick", but ... Ick.

Well, maybe I do.   Forget $P1.  If the user wanted $1 from the
previous RE, then they should have saved it somewhere.  This would
eliminate the "major" RE-engine changes to make $P1 work.  But it
would require that the p52p6 translator make some really smart
modifications.

-Scott
-- 
Jonathan Scott Duff
[EMAIL PROTECTED]



RFC 332 (v1) Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

2000-09-28 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

=head1 VERSION

  Maintainer: Bart Lateur <[EMAIL PROTECTED]>
  Date: 28 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 332
  Version: 1
  Status: Developing

=head1 ABSTRACT

To most Perlers, /$/ in a regex simply means "end of string". This is
only right, if you're absolutely sure your string doesn't end in a
newline, as is commonly the case in a large part of all textual data:
ordinary strings don't contain newlines. Lines coming from text files
can only contain a newline as the very last character. The '/s' modifier
is usually only used in combination with the former class of textual
data.

However, this situation is basically a bug hole.

This RFC proposes to change the '/s' modifier so that under '/s', /$/
will only match at the very end of a string, and not also before a
newline at the end of the string.

=head1 DESCRIPTION

To most Perl programmers, /^foo$/ is a regex that can only match the
string "foo". It's not, actually: it can match "foo\n", too. This
assumption is usually safe, because people know the kind of data they're
dealing with, and they "know" that it won't ever end in a newline.

However, this basically is a chance for bugs to creep in, if for some
reason this assumption about the data no longer holds.

To make matters worse, Perl doesn't even have a mechanism to prevent the
regex engine from matching /$/ at just before the last character if it's
a newline.

Originally, we had thought of adding Yet Another Regex Modifier; but to
be honest, having 2 modifiers just for the newline is already confusing
enough, for too many people. A third is definitely out.

Therefore, the proposal is instead to modify the behaviour of the '/s'
modifier.

Under '/s':

=over 2

=item *

/./ can match any character, including newline;

=item *

/$/ can match only at the very end of the string, not also in front of a
last character, if it happens to be a newline.

=back

This seems simple enough.

=head1 CONSIDERATIONS

=head2 Mnemonic value of '/s'

'/s' originally stood for "single line". This can no longer be true, the
mnemonic value of the "s" is thereby reduced to zero.

However, the mnemonic value wasn't that great to begin with, especially
if you consider that combining '/s' and '/m' is not only possible, but a
useful option, too. How can a string both be a single line and
multiline, at the same time?

So, to most Perl programmers, '/s' simply stands for

=over 2

=item

let /./ match a newline too

=back

which now gets turned into:

=over 2

=item

treat "\n" as an ordinary character

=back

The change isn't that big, so it is just as easy to remember. Or not.

=head2 The $* variable

'/s' and '/m' also have a lesser known side effect: they both override
the setting of the $* special variable, which controls multiline related
behaviour in regexes.

Use of this special variable has already been deprecated at least since
Perl5 first came out, more than 5 years ago. It is a very good candidate
to be removed from Perl6 altogether, which would result in fewer
gotcha's in the language. That is a Good Thing.

Perlvar says:

Use of `$*' is deprecated in modern Perl, supplanted by the `/s'
and `/m' modifiers on pattern matching.


Therefore, any changing behaviour of '/s', with regards to $*, can
nowadays hardly be considered relevant, any more.

=head2 Getting the old behaviour back

You can't. Question is: do you really want to?

=over 2

=item *

If you know your data can contain newlines, and you want to treat them
as ordinary characters, you probably don't want to make an exception for
a trailing newline, anyway.

=item *

If you still want to ignore a trailing newline in the regex, you can
either adjust your regex so that it contains /\n?$/ or something like
it, instead of plain /$/; or you can chomp() your data, before doing the
match.

=item * 

And finally, there's still the option for simply not using '/s', and all
things will remain as they were before.  ;-)

=back

=head2 '/ms': combined '/m' and '/s'

'/ms' still works as before. Internally, '/m' has taken over the job  of
matching before a newline at the end of the string, simply because /$/m
can match before I newline.

=head1 MIGRATION

It's not unlikely that currently having /$/ in your regexes, is actually
a bug in your script, but you don't care because the data won't ever
make it visible.

Therefore, I think it is not desirable to have the Perl5 To Perl6
converter actually change your source code. A warning if /$/ is found in
combination with a bare '/s' modifier, not combined with '/m', is
probably all that is wanted.

=head1 IMPLEMENTATION

Under '/s', make '$' behave as /\z/ does now.

=head1 REFERENCES

perlre, about '/s' and '/m'

perlvar, section about $*





RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-09-28 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Consolidate the $1 and C<\1> notations

=head1 VERSION

  Maintainer: David Storrs <[EMAIL PROTECTED]>
  Date: 28 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number:  331
  Version: 1
  Status: Developing

=head1 ABSTRACT

Currently, C<\1> and $1 have only slightly different meanings within a
regex.  Let's consolidate them together, eliminate the differences, and
settle on $1 as the standard.

=head1 DESCRIPTION

Note:  For convenience, I am going to talk about C<\1> and $1 in this RFC.
In actuality, these notations extend indefinitely:  C<\1..\n> and
C<$1..$n>.  Take it as read that anything which applies to $1 also applies
to C<$2, $3>, etc.


In current versions of Perl, C<\1> means "whatever was matched by the
first set of grouping parens I."  $1 means "whatever
was matched by the first set of grouping parens I."  For example:

=over 4

=item *
/(foo)_$1_bar/

=item *
/(foo)_C<\1>_bar/

=back

mean different things:  the second will match 'foo_foo_bar', while the
first will match 'foo[SOMETHING]bar' where [SOMETHING] is whatever was
captured in the B match...which could be a long, long way away,
possibly even in some module that you didn't even realize you were
including (because it was included by a module that was included by a
module that was included by a...). 

Probably the primary reason for this distinction is the following:

=over 4

=item *
s/(foo)C<\1>/$1bar/ # changes "foofoo" to "foobar"

=back

The key fact here is that, in the first section of a s/// you are supposed
to use C<\1>, but in the second portion you are supposed to use $1.  If
you understand the whole logical structure behind it and understand how an
s/// works (i.e., the right hand side of an s/// is a double-quoted
string, not a regex), you will understand the distinction.  For newbies,
however, it is apt to be quite confusing.

Aside from this confusion is the fact that, in general, when you use a
backreference you want it to refer to something that you just
matched...i.e., something from this regex.

To resolve all these issues, let's remove the C<\1> notation and
consolidate meanings as follows:

=over 4

=item *
C<\1> goes away as a special form 

=item *
$1 means what C<\1> currently means (first match in this regex)

=item *
${1} is the same as $1 (first match in this regex)

=item *
${P1} means what $1 currently means (first match in last regex)

=back

These changes eliminate a potential source of confusion, retain all
functionality, provide an easy migration path for P526, and the last
notation (${P1}) serves as a clear indicator that you are talking about
something from outside the current regex.

Using this new syntax, you could then write:

=over 4

=item *
s/(foo)$1/$1bar/# changes "foofoo" to "foobar"

=item *
s/(bar)(bell)/${P1}$2/  # changes "barbell" to "foobell"

=back

=head2 Updating $1...When should it happen?

After a regex is finished, it must update the ${Pn} variables so that the
next match can access them if desired (if we wanted to get really
pathological, we could have multidimensional access such as:  ${P2,2}
which is the second capture from the second-to-most-recent regex.  This
would seem to be a Bad Idea, however).  This should not happen until after
the statement containing the regex is finished, in order that the $1
variables on the right hand side of an s/// will still refer to the
correct things.

=head1 IMPLEMENTATION

This may require significant changes to the regex engine, which is a topic
on which I am not qualified to speak.  Could someone with more
knowledge/experience please chime in?

=head1 REFERENCES

RFC 112: Assignment within a regex

RFC 276: Localising Paren Counts in qr()s.

perlre manpage






Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-28 Thread Hugo

In <[EMAIL PROTECTED]>, Tom Christiansen writes:
:>I consider recursive regexps very useful:
:>
:> $a = qr{ (?> [^()]+ ) | \( (??{ $a }) \) };
:
:Yes, they're "useful", but darned tricky sometimes, and in
:ways other than simple regex-related stuff.  For example,
:consider what happens if you do
:
:my $regex = qr{ (?> [^()]+ ) | \( (??{ $regex }) \) };
:
:That doesn't work due to differing scopings on either side
:of the assignment.

Yes, this is a problem. But it bites people in other situations
as well:
  my $fib = sub { $_[0] < 2 ? 1 : &$fib($_[0] - 1) };

I haven't kept up with the non-regexp RFCs, but I hope someone
has suggested an alternative scoping that would permit these
cases to refer to the just-introduced variable. Perhaps we
should special-case qr{} and sub{} - I can't offhand think of
another area that suffers from this, and I don't think these
two areas would suffer from an inability to refer to the same-
-name variable in an outlying scope.

A useful alternative might be a different special case. Plucking
random grammar, perhaps:
  my $regex = qr{ (?> [^()]+ ) | \( ^^ \) }x;

Certainly I think a simple self-reference is likely to be a
common enough use that it would help to avoid the full deferred
eval infrastructure, even when it works properly.

:And clearly a non-regex approach could be more legible for
:recursive parsing.

Like any aspect of programming, if you use it regularly it will
become easier to read. And comments are a wonderful thing.

Hugo



Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-28 Thread Tom Christiansen

>I consider recursive regexps very useful:
>
> $a = qr{ (?> [^()]+ ) | \( (??{ $a }) \) };

Yes, they're "useful", but darned tricky sometimes, and in
ways other than simple regex-related stuff.  For example,
consider what happens if you do

my $regex = qr{ (?> [^()]+ ) | \( (??{ $regex }) \) };

That doesn't work due to differing scopings on either side
of the assignment.  And clearly a non-regex approach could
be more legible for recursive parsing.

--tom

Visit our website at http://www.ubswarburg.com

This message contains confidential information and is intended only 
for the individual named.  If you are not the named addressee you 
should not disseminate, distribute or copy this e-mail.  Please 
notify the sender immediately by e-mail if you have received this 
e-mail by mistake and delete this e-mail from your system.

E-mail transmission cannot be guaranteed to be secure or error-free 
as information could be intercepted, corrupted, lost, destroyed, 
arrive late or incomplete, or contain viruses.  The sender therefore 
does not accept liability for any errors or omissions in the contents 
of this message which arise as a result of e-mail transmission.  If 
verification is required please request a hard-copy version.  This 
message is provided for informational purposes and should not be 
construed as a solicitation or offer to buy or sell any securities or 
related financial instruments.




Re: is \1 vs $1 a necessary distinction?

2000-09-28 Thread Piers Cawley

Dave Storrs <[EMAIL PROTECTED]> writes:

> On 27 Sep 2000, Piers Cawley wrote:
> 
> > >   Do we *want* to maintain \1?  Why have two notations to do the
> > 
> > I'm kind of curious about what happens when you want to do, say:
> > 
> >   if (m/(\S+)/) {
> >  $reg = qr{<(em|i|b)>($1)};
> >   }
> > 
> > where the $1 in the regex quote is refering to $1 from the previous
> > regex match.
> 
>   Well, how about this:
> 
>   $reg = qr{<(em|i|b)>(${P1})};
> NOTE:  ^  
> 
>   If you assume that $1 and ${1} are equivalent (which makes it
> possible to have as many backrefs as you want), then you could say that,
> if the first character after the { is a P, it means "in the previous regex
> match."

Oh good ghod. That is *vile*.

-- 
Piers