subject:"Re\: regex and"

Re: regex and

2010-08-10 Thread Eirik Berg Hanssen

On Tue, Aug 10, 2010 at 9:00 PM,  wrote:

> Once the & operator is in rakudo, though... I gather I /could/ do something
> like the following
>
> ^ [ * &  ]  $
>
> And this would in effect ensued that the sequence "abc"  doesn't exist
> anywhere across the match for 
>
>
> Is this correct?
>

  Not quite, I suspect – the  is still zero-width, so unless
quantified zero-width assertions are DWIMmier than what's healthy, this is
likely still equivalent to ^$.  I think the following should DWYW, though:

  ^ [ [  . ] * &  ] $

  ... though perhaps there is a shorter way to write [  . ]?  Feels
like there should be one ...

Eirik

RE: regex and

2010-08-10 Thread philippe.beauchamp

Back to your original advice...


> If you want to match an alphabetic string which does not include 'abc'
> anywhere, you can write this as
>
> ^ [   ]* $


I presume this only works here because  is one character... if instead 
of  I used anything more complicated 

(for example)

token name
{
<[A..Z]>*
}


And then tried to do ^ [   ]* $
This wouldn't work since there's a wildcard within name.



Once the & operator is in rakudo, though... I gather I /could/ do something 
like the following

^ [ * &  ]  $

And this would in effect ensued that the sequence "abc"  doesn't exist anywhere 
across the match for  


Is this correct?

Re: regex and

2010-08-10 Thread Moritz Lenz

philippe.beauch...@bell.ca wrote:
> On the & operator... are you saying that it would operate basically as 
> expected... 
> allowing sets of rules and'ed rather than or's with the | ?

Yes, with the limitation that both parts separated by & have to match
the same length of string, so that for example

^ [ a+ & . ** 3 ]

could only match exactly 3 a's. If you don't want to them tied to the
same length, you look-ahead assertions instead.

Cheers,
Moritz

RE: regex and

2010-08-10 Thread philippe.beauchamp

Great! That does it. Thanks. :)
I realized my error on the anchors after sending... but didn't think of the * 
on the grouping. 

On the & operator... are you saying that it would operate basically as 
expected... allowing sets of rules and'ed rather than or's with the | ?

--- Phil

-Original Message-
From: Moritz Lenz [mailto:mor...@faui2k3.org] 
Sent: August 10, 2010 2:09 PM
To: Beauchamp, Philippe (6009210)
Cc: perl6-language@perl.org
Subject: Re: regex and

Hi,

philippe.beauch...@bell.ca wrote:
> rule TOP
> {
> ^
> [
> & *
> & 
> ]
> $
> }

The & syntax is specced, but it's not yet implemented in Rakudo.

But note that  is a zero-width assertion, so your example regex
matches at the start of a string, if it does not begin with 'abc'.

Since you anchor it to the end of string too, it can only ever match the
empty string.

You can achieve the same with just ^$.

If you want to match an alphabetic string which does not include 'abc'
anywhere, you can write this as

^ [   ]* $

Cheers,
Moritz

Re: regex and

2010-08-10 Thread Moritz Lenz

Hi,

philippe.beauch...@bell.ca wrote:
> rule TOP
> {
> ^
> [
> & *
> & 
> ]
> $
> }

The & syntax is specced, but it's not yet implemented in Rakudo.

But note that  is a zero-width assertion, so your example regex
matches at the start of a string, if it does not begin with 'abc'.

Since you anchor it to the end of string too, it can only ever match the
empty string.

You can achieve the same with just ^$.

If you want to match an alphabetic string which does not include 'abc'
anywhere, you can write this as

^ [   ]* $

Cheers,
Moritz

Re: regex and xml/html/*ml

2002-06-05 Thread Michel Rodriguez


On Wed, 5 Jun 2002 [EMAIL PROTECTED] wrote:

> Just read (skimmed) apocalypse 5, had one concern - it looks like we are on a
> serious collision course with parsing the various *mls.
> 
> before:
> 
> m#..etc#
> 
> after
> 
> m#\\\#
> 
> Also, the space being backslashed sort of bugs me. Surely there is going to be
> a 'non-x' modifier? And perhaps a modifier to change the character for logical
> tags from <> to something else (like <<>>, perhaps?)

Hey, if that makes people more reluctant to use regexes  to parse HTML or 
XML and leads them to use real parsers then this could be construed as a 
feature ;--)

Michel Rodriguez
Perl & XML
http://www.xmltwig.com

RE: regex and xml/html/*ml

2002-06-05 Thread Erik Steven Harrison


 
--

On Wed, 5 Jun 2002 13:21:39   
 Brent Dax wrote:
>[EMAIL PROTECTED]:
># Just read (skimmed) apocalypse 5, had one concern - it looks 
># like we are on a serious collision course with parsing the 
># various *mls.
># 
># before:
># 
># m#..etc#
># 
># after
># 
># m#\\\#
>
>That's intentional.  What will that regex do with this?
>
>   
>
>That's interpreted the same way, but typed a bit differently.  It won't
>match your regex.
>
>The moral of the story is that you should not try to parse the *MLs with
>regexen--use modules instead.
>
>--Brent Dax <[EMAIL PROTECTED]>
>@roles=map {"Parrot $_"} qw(embedding regexen Configure)
>
>Early in the series, Patrick Stewart came up to us and asked how warp
>drive worked.  We explained some of the hypothetical principles . . .
>"Nonsense," Patrick declared.  "All you have to do is say, 'Engage.'"
>--Star Trek: The Next Generation Technical Manual
>
>


Is your boss reading your email? Probably
Keep your messages private by using Lycos Mail.
Sign up today at http://mail.lycos.com

RE: regex and xml/html/*ml

2002-06-05 Thread Brent Dax


[EMAIL PROTECTED]:
# Just read (skimmed) apocalypse 5, had one concern - it looks 
# like we are on a serious collision course with parsing the 
# various *mls.
# 
# before:
# 
# m#..etc#
# 
# after
# 
# m#\\\#

That's intentional.  What will that regex do with this?



That's interpreted the same way, but typed a bit differently.  It won't
match your regex.

The moral of the story is that you should not try to parse the *MLs with
regexen--use modules instead.

--Brent Dax <[EMAIL PROTECTED]>
@roles=map {"Parrot $_"} qw(embedding regexen Configure)

Early in the series, Patrick Stewart came up to us and asked how warp
drive worked.  We explained some of the hypothetical principles . . .
"Nonsense," Patrick declared.  "All you have to do is say, 'Engage.'"
--Star Trek: The Next Generation Technical Manual

Re: Regex and Matched Delimiters

2002-04-24 Thread Rafael Garcia-Suarez


Michael G Schwern wrote in perl.perl6.language :
> On Tue, Apr 23, 2002 at 11:11:28PM -0500, Me wrote:
>> Third, I was thinking that having perl 6 regexen have /s on
>> by default would be easy for perl 5 coders to understand;
>> not too hard to get used to; and have no negative effects
>> for existing coders beyond getting used to the change.
> 
> I'm jumping in the middle of a conversation here, but consider the
> problem of .* matching newlines by default and greediness.
> 
>  /(foo.*)$/,  /(foo.*)$/m  and  /(foo.*)$/s

This is so old-fashioned.

> when matching against something like "foo\nwiffle\nbarfoo\n" One matches the
> last line.  One matches the first line.  And one matches all three lines.

And by the way, there's the semantic unaccuracy of $ matching
transparently newlines, combined with the obscure variants \z and \Z.
This needs (IMHO) some reshaping.

-- 
Rafael Garcia-Suarez
I'll better skip() some releases until it is() ok() to use Test::More
without() going insane(). Any more than I already am, that is().
-- Tels in the perl-qa mailing list

Re: Regex and Matched Delimiters

2002-04-23 Thread Me


> when matching against something like "foo\nwiffle\nbarfoo\n"


>/(foo.*)$/ # matches the last line

/(foo[^\n]*)$/ # assuming perl 6 meaning of $, end of string


>/(foo.*)$/m # matches the first line

/(foo[^\n]*)$$/ # assuming perl 6 meaning of $$, end of line

or

/(foo.*?)$$/


>/(foo.*)$/s # matches all three lines

/(foo.*)$/


--
ralph

Re: Regex and Matched Delimiters

2002-04-23 Thread Michael G Schwern

On Tue, Apr 23, 2002 at 11:11:28PM -0500, Me wrote:
> Third, I was thinking that having perl 6 regexen have /s on
> by default would be easy for perl 5 coders to understand;
> not too hard to get used to; and have no negative effects
> for existing coders beyond getting used to the change.

I'm jumping in the middle of a conversation here, but consider the
problem of .* matching newlines by default and greediness.

   /(foo.*)$/,  /(foo.*)$/m  and  /(foo.*)$/s

when matching against something like "foo\nwiffle\nbarfoo\n" One matches the
last line.  One matches the first line.  And one matches all three lines.

-- 

Michael G. Schwern   <[EMAIL PROTECTED]>http://www.pobox.com/~schwern/
Perl Quality Assurance  <[EMAIL PROTECTED]> Kwalitee Is Job One
Consistency?  I'm sorry, Sir, but you obviously chose the wrong door.
-- Jarkko Hietaniemi in <[EMAIL PROTECTED]>

Re: Regex and Matched Delimiters

2002-04-23 Thread Me


> > : I'd expect . to match newlines by default.

I forgot, fourth, this simplifies the rule for . -- it
would become period matches any char, period.

Fifth, it makes the writing of "match anything but
newline" into an explicit [^\n], which I consider a
good thing.

Of course, all this is minor stuff. But I can't get
my head around parse trees and grammars, so
I'll continue to fiddle around spraying a bit of
grafitti here and there on the bikeshed.

--
ralph

Re: Regex and Matched Delimiters

2002-04-23 Thread Me


> : I'd expect . to match newlines by default. For a . that
> : didn't match newlines, I'd expect to need to use [^\n].
> 
> But . has never matched newlines by default, not even in grep.

Perhaps. But:

First, I would have thought you *can't* make . match newlines
in grep, period. If so, then when perl is handling a multi-line
string, it is handling a case grep never encounters.

Second, I think the perl 5 default is the wrong one from the
point of view of a typical newbie's guess.

Third, I was thinking that having perl 6 regexen have /s on
by default would be easy for perl 5 coders to understand;
not too hard to get used to; and have no negative effects
for existing coders beyond getting used to the change.

--
ralph

Re: Regex and Matched Delimiters

2002-04-23 Thread Aaron Sherman

On Tue, 2002-04-23 at 12:48, Larry Wall wrote:
> Brent Dax writes:

> : # \talso 
> : # \nalso  or  (latter matching
> : logical newline)
> : # \ralso 
> : # \falso 
> : # \aalso 
> : # \ealso 
> : 
> : I can tell you right now that these are going to screw people up.
> : They'll try to use these in normal strings and be confused when it
> : doesn't work.  And you probably won't be able to emit a warning,
> : considering how much CGI Perl munches.
> 
> I can see pragmatic variants in which those *do* interpolate by default.
> And pragmatic variants where they don't.

If you put them in one, put them in the other, HOWEVER, there's a strong
pragmatic reason for neither that i can see.

HTML/XML/SGML

I hate to say it, but if <> interpolates in everything cleanly with no
overloading, the *ML camps will thank you deeply. How often I've
written:

qq{$content}

I cannot tell you, but it's large.

Why not use {} for this and add an {eval:code}?

> I'm just wondering how far I can drive the principle that {} is always
> a closure (even though it isn't).  I admit that it's probably overkill
> here, which is why there are question marks.

I like the idea, but I don't think it fits. On the other hand, if inside
all interpolating operators {} is the special thing that gets
interpolated (and NOTHING else), I could see liking the new look:

qq{a${x}b}  => qq{a{$x}b}
qr{a\Q${x}\Eb$} => qr{a{q:$x}b$}
qr{a${x}b$} => qr{a{$x}b$}
q{a}.eval($x).q{b}  => qq{a{e:$x}b} or qq{a{{$x}}b}
"ajs\@ajs.com"  => qq{[EMAIL PROTECTED]}
"ajs". @{ajs} .".com"   => qq{ajs{@ajs}.com}

I know it's a departure from your original idea, but it certainly
unifies the syntax nicely:

qq{Hello, World!{nl}}
qr{Hello, World!{nl}}

> With respect to Perl 5, I'm trying to unhijack curlies as much as possible.

Ooops :-)

Re: Regex and Matched Delimiters

2002-04-23 Thread Larry Wall


Brent Dax writes:
: Sorry to reply to the same message twice, but I just noticed something.
: 
: Larry Wall:
: # {n,m}   
: 
: Isn't that the only use of angle brackets as a quantifier?  That's going
: to make parsing more difficult...

How so?  It's just a one-character lookahead to see if it's a digit.

But we could actually use a more general syntax:



Larry

Re: Regex and Matched Delimiters

2002-04-23 Thread Larry Wall


Me writes:
: > /pat/i m:i/pat/ or // or even m ???
: 
: Why lose the modifier-following-final-delimiter
: syntax? Is this to avoid a parsing issue, or
: because it's linguistically odd to have a modifier
: at the end?

Haven't decided for sure to lose it, but it does have several problems.
First is the parsing issue, but there's also what in natural language
is called the "end weight" problem.  We often rearrange our sentences
in English so that the short things come first and the long things come
last.  That's why you choose indirect object syntax sometimes and not
others.  Try turning either of these to the other form:

I gave him a big, smelly tuna-fish and cucumber sandwich.
I gave the sandwich to a big, smelly tuna fisherman and his dog "Cucumber".

Now, options are always little, so it seems that they should come early.

: > /^pat$/m /^^pat$$/
: 
: What's the mnemonic here? It feels the wrong
: way round -- like a single ^ or $ should match
: at newlines, double ^ or $ should only match
: at start/end string.

Well, I though of it as ^^ or $$ matching potentially multiple places
in the string.

: Ah. The newline matches between the ^^ or $$.
: That works.

Except that the newline doesn't match between the characters.  You could
say /$$\n^^/ for instance.

: Then there's the PID issue. Hmm. How to save $$
: (it is nice for one liners)?

$PID is only two chars worse.  (The * of $*PID is optional.)

: Sorry if this is a dumb suggestion, but could you have
: just one assertion, say ^$, that alternates matching
: just before and just after a newline?

^$ matches a null string.  That aside, I don't think stateful assertions
would be unconfusing in the extreme.

: > /./s // or /<.>/ ???
: 
: I'd expect . to match newlines by default. For a . that
: didn't match newlines, I'd expect to need to use [^\n].

But . has never matched newlines by default, not even in grep.  Possibly
some editors do it that way, but if so, it's non-standard.

: > space  (or \h for "horizontal"?)
: 
: Can one quote a substring of a regex? In a later part you
: say that \Q...\E is going away, so it seems not. It would be
: nice to say something like:
: 
: /foo bar baz 'qux waldo' emerson/
: 
: and have the space between qux and waldo be literal.
: Similar arguments apply more broadly so that one
: could escape the usual meaning of metacharacters etc.

Well, <"qux waldo"> could be made to mean that, I suppose.  For that
matter, so might \q{qux waldo}.  Er, \q?

: > \Lstring\E \L
: > \Ustring\E \U
: 
: Maybe, if I wasn't too far off with the quote mark
: suggestion above, then  \L'string' would be more
: natural.

Maybe \L and \q are in the same class, in which case that would work.

: > (?#...) {"..."} :-)
: 
: Will plain # comments work in p6  regexen?

Yes, just as in /x.  And there's no ambiguity in the end delimiter
any more because we parse in one pass.

: > (?:...) <:...>
: > (?=...) 
: > (?!...) 
: > (?<=...) 
: > (?
: > (?>...) 
: 
: Hmm. So <> are clustering just like ().

Yes, and you can quantify them where it makes sense.

: One difference is that () always capture whereas <>
: only do so sometimes. Oh, and {} can too.

Eh?  <> never capture.  None of those constructs above capture.
Nothing inside a {} can capture anything that influences the paren
count outsid the {}, because any inner regex has its own paren count.

: () are no longer used for clever stuff, <> are instead.
: And {}.

Basically, yes.

: Hmm. Time for bed.

Why?  I just got up.  :-)

Larry

RE: Regex and Matched Delimiters

2002-04-23 Thread Brent Dax


Sorry to reply to the same message twice, but I just noticed something.

Larry Wall:
# {n,m} 

Isn't that the only use of angle brackets as a quantifier?  That's going
to make parsing more difficult...

--Brent Dax <[EMAIL PROTECTED]>
@roles=map {"Parrot $_"} qw(embedding regexen Configure)

#define private public
--Spotted in a C++ program just before a #include

Re: Regex and Matched Delimiters

2002-04-23 Thread Larry Wall


Aaron Sherman writes:
: On Mon, 2002-04-22 at 21:53, Larry Wall wrote:
: 
: > * Parens always capture.
: > * Braces are always closures.
: > * Square brackets are always character classes.
: > * Angle brackets are always metasyntax (along with backslash).
: > 
: > So a first whack at the differences might be:
: [...]
: > space(or \h for "horizontal"?)
: > {n,m}   
: > 
: > \t  also 
: 
: I want to know how he does this!!

Could have something to do with the fact that I've been banging my head
against this for a couple of months already...

: We sit around scratching out heads
: looking for a syntax that fits and refines and he jumps in with
: something that redefines and simplifies. Larry is wasted on Perl. He
: needs to run for office ;-)

Agh, no!  I'm okay at simplifying, but I'm terrible at oversimplifying.

: > \Lstring\E  \L
: > \Ustring\E  \U
: 
: This one boggles me. Wouldn't that be something like:
: 
:  or string # ;-)

Well,  makes sense only if <> works in ordinary double quotes.

: Seriously, it seems that "\L" would be confusing.

Potentially, except that you almost never use it on anything but variable
interpoations.  So \L<$foo> would be a better example.  The confusing thing
is that $foo would not be assumed to be a regular expression, whereas it
would in bare <$foo> (at least in a regex).

: > \Q$var\E$varalways assumed literal, so $1 is literal 
:backref
: > $var<$var>  assumed to be regex
: 
: Very nice. I can get behind this, and a lot of people will thank you who
: have to maintain code.

Well, almost anything is an improvement over the current syntax.

: > =~ $re  =~ /<$re>/   ouch?
: 
: If $re is a regexp, wouldn't "$str =~ $re" turn into "$re.match($str)"?
: Perhaps "$re.m $str" which is no more typing and pretty clear to me.

Sure, but I was illustrating the situation of a non-qr string being
forced to be a regex.

: > Obviously the  and  syntaxes will be user extensible.
: > We have to be able to support full grammars.  I consider it a feature
: > that  looks like a non-terminal in standard BNF notation.  I do
: > not consider it a misfeature that  resembles an HTML or XML tag,
: > since most of those languages need to be matched with a fancy rule
: > named  anyway.
: 
: It's too bad that  would be messy with standard Perl //-enclosed
: regexes, as it would be a nice way to pass parameters to user-defined
: tags. It would also allow XML-like propagation of results:
: 
:   xyz

Gee, maybe we could make a way for people to use alternate dilimiters
like they've always done with s///.  :-)

Larry

Re: Regex and Matched Delimiters

2002-04-23 Thread Larry Wall


Brent Dax writes:
: # ?pat?   // or even m ???
: 
: Whoa, those are moving to the front?!?

The problem with options in general is that they can't easily modify
parsing if they come in back.  Now in the particular case of /f and /i,
it probably doesn't matter.  But I was trying to see if there was some way
to do away with trailing options altogether.  This might even extend to
things like:

qq:s"$interpolates @doesn't %doesn't"

And that's definitely a situation where it changes the parse.  Hmm, if
strings have options, they're probably addititive, so to add scalar
interpolation you'd want to base it on "q", not "qq":

q:s"$interpolates @doesn't %doesn't"

On the other hand, that doesn't work for the other things like "qr", so
maybe any of :s, :a, :h turn off default interpolations, so qr:a would
only interpolate arrays, for instance.

: # /pat/x  /pat/
: # /^pat$/m/^^pat$$/
: 
: That's...odd.  Is $$ (the variable) going away?

Maybe.  It'd be $*PID if so, since it's truly global to the process.
But if not, we could special case $$ inside regexes, just as we already
special case $ itself.

: # \p{prop}<+prop>  ???
: # \P{prop}<-prop>  ???
: 
: Intriguing.

Yeah, especially when you start stacking them.  But maybe we're treading
on [...] territory.  It could be argued that <...> is just a generalized
form of POSIX's [:...:] construct

: # \t  also 
: # \n  also  or  (latter matching
: logical newline)
: # \r  also 
: # \f  also 
: # \a  also 
: # \e  also 
: 
: I can tell you right now that these are going to screw people up.
: They'll try to use these in normal strings and be confused when it
: doesn't work.  And you probably won't be able to emit a warning,
: considering how much CGI Perl munches.

I can see pragmatic variants in which those *do* interpolate by default.
And pragmatic variants where they don't.

: # \033same
: # \x1Bsame
: # \x{263a}\x<263a> ???
: 
: Why?  Wouldn't we want the same thing to work in quoted strings?  (Or
: are those changing syntaxes too?)

I'm just wondering how far I can drive the principle that {} is always
a closure (even though it isn't).  I admit that it's probably overkill
here, which is why there are question marks.

: # \c[ same
: # \N{name}
: # \l  same
: # \u  same
: # \Lstring\E  \L
: # \Ustring\E  \U
: 
: So that's changed from whenever you talked about \q{} ?

Possibly.  Again, the question is whether {} more strongly imply
something that's not true.  But curlies were so overloaded in Perl 5
that I don't think people are going to necessarily expect them to do
only one thing.  Still, if <> are taking over the role of "unmarked
metasyntactic delimiters", maybe they belong here too.

: # \E  gone
: # [\040\t]\hplus any Unicode horizontal whitespace
: # [\r\n\ck]   \v  plus any Unicode vertical whitespace
: #=20
: # \b  same
: # \B  same
: 
: # \A  ^
: # \Z  same?
: # \z  $
: 
: Are you sure that optimizes for the common case?

No, I'm not sure, but we have to clean up the \A...\z mess somehow.

: # \G  , but assumed in nested patterns?
: # =20
: # \1  $1
: #=20
: # \Q$var\E$varalways assumed literal, so $1 is literal
: backref
: 
: So these are reinterpolated every time you backtrack?  Are you *trying*
: to destroy regex performance?  :^)

They're not interpolated.  They're matched, as in string comparison, just
as backrefs are matched right now.

: # $var<$var>  assumed to be regex
: 
: What if $var is a qr//ed object?

Then it's a pretty easy assumption that it's a regex.  :-)

: # =~ $re  =~ /<$re>/   ouch?
: 
: I don't see the win.

No difference if $re is qr//, but if it's not, that is the syntax for
forcing $re to be interpreted as a regex.

: # (??{$rule}) 
: # (?{ code }) { code } with failure semantics
: # (?#...) {"..."} :-)
: # (?:...) <:...>
: # (?=3D...)   
: # (?!...) 
: # (?<=3D...)  
: # (?
: 
: Cute.  (Wait a minute, aren't those reversed?)

Nope, I realized they were ambiguous depending on whether you think of
them as declarative or operational, but I settled on the declarative
reading because it works with their being assertions.  All the other
options I could think of are either really clunky or similarly ambigu

Re: Regex and Matched Delimiters

2002-04-23 Thread Aaron Sherman

On Tue, 2002-04-23 at 04:32, Ariel Scolnicov wrote:
> Larry Wall <[EMAIL PROTECTED]> writes:
> 
> [...]
> 
> > /pat/x  /pat/
> 
> How do I do a "no /x"?  I know that commented /x'ed regexps are easier
> reading (I even write them myself, I swear I do!), but having to
> escape whitespace is often very annoying.  Will I really have to
> escape all spaces (or use , below)?
> 

I'm not sure that that's a bad thing. Regular expressions are the
hairiest, ugliest thing in Perl. If they change in this way, I see them
getting a tad more verbose, and a whole lot more readable and
maintainable. Besides you can always do this:

$str = "COPYING file for more information";
/$str/

since scalars will be interpolated as quoted by default.

Re: Regex and Matched Delimiters

2002-04-23 Thread Luke Palmer


On Wed, 24 Apr 2002, Iain Truskett wrote:

> * Larry Wall ([EMAIL PROTECTED]) [23 Apr 2002 11:56]:
> 
> [...]
> > * Parens always capture.
> 
> Maybe I missed something in the rest of the details, but is anything
> going to replace non-capturing parens? It's just that I do find them
> quite useful.

Yes.

/indeed <:this>+ wont capture/

Re: Regex and Matched Delimiters

2002-04-23 Thread Iain Truskett


* Larry Wall ([EMAIL PROTECTED]) [23 Apr 2002 11:56]:

[...]
> * Parens always capture.

Maybe I missed something in the rest of the details, but is anything
going to replace non-capturing parens? It's just that I do find them
quite useful.

-- 
iain.

RE: Regex and Matched Delimiters

2002-04-23 Thread Luke Palmer


> # =~ $re  =~ /<$re>/   ouch?
> 
> I don't see the win.

Naturally =~ $re is a bit cleaner, but we can't do that because =~ is 
smart match, not regex match.


> # (?=...) 
> # (?!...) 
> # (?<=...)
> # (?
> 
> Cute.  (Wait a minute, aren't those reversed?)

Hehe. I thought that was cool. 

/foobar/
/ foobar/
 
You see, foobar before snafoo, which is what it is.
After snafoo, foobar.

It reads very nicely.



Luke

Re: Regex and Matched Delimiters

2002-04-23 Thread Aaron Sherman

On Mon, 2002-04-22 at 21:53, Larry Wall wrote:

> * Parens always capture.
> * Braces are always closures.
> * Square brackets are always character classes.
> * Angle brackets are always metasyntax (along with backslash).
> 
> So a first whack at the differences might be:
[...]
> space  (or \h for "horizontal"?)
> {n,m} 
> 
> \talso 

I want to know how he does this!! We sit around scratching out heads
looking for a syntax that fits and refines and he jumps in with
something that redefines and simplifies. Larry is wasted on Perl. He
needs to run for office ;-)

> \Lstring\E\L
> \Ustring\E\U

This one boggles me. Wouldn't that be something like:

 or string # ;-)

Seriously, it seems that "\L" would be confusing.

> \Q$var\E  $varalways assumed literal, so $1 is literal backref
> $var  <$var>  assumed to be regex

Very nice. I can get behind this, and a lot of people will thank you who
have to maintain code.

> =~ $re=~ /<$re>/   ouch?

If $re is a regexp, wouldn't "$str =~ $re" turn into "$re.match($str)"?
Perhaps "$re.m $str" which is no more typing and pretty clear to me.

> Obviously the  and  syntaxes will be user extensible.
> We have to be able to support full grammars.  I consider it a feature
> that  looks like a non-terminal in standard BNF notation.  I do
> not consider it a misfeature that  resembles an HTML or XML tag,
> since most of those languages need to be matched with a fancy rule
> named  anyway.

It's too bad that  would be messy with standard Perl //-enclosed
regexes, as it would be a nice way to pass parameters to user-defined
tags. It would also allow XML-like propagation of results:

xyz

Re: Regex and Matched Delimiters

2002-04-23 Thread Me


> /pat/i m:i/pat/ or // or even m ???

Why lose the modifier-following-final-delimiter
syntax? Is this to avoid a parsing issue, or
because it's linguistically odd to have a modifier
at the end?


> /^pat$/m /^^pat$$/

What's the mnemonic here? It feels the wrong
way round -- like a single ^ or $ should match
at newlines, double ^ or $ should only match
at start/end string.

Ah. The newline matches between the ^^ or $$.
That works.

Then there's the PID issue. Hmm. How to save $$
(it is nice for one liners)?

Sorry if this is a dumb suggestion, but could you have
just one assertion, say ^$, that alternates matching
just before and just after a newline?


> /./s // or /<.>/ ???

I'd expect . to match newlines by default. For a . that
didn't match newlines, I'd expect to need to use [^\n].


> space  (or \h for "horizontal"?)

Can one quote a substring of a regex? In a later part you
say that \Q...\E is going away, so it seems not. It would be
nice to say something like:

/foo bar baz 'qux waldo' emerson/

and have the space between qux and waldo be literal.
Similar arguments apply more broadly so that one
could escape the usual meaning of metacharacters etc.


> \Lstring\E \L
> \Ustring\E \U

Maybe, if I wasn't too far off with the quote mark
suggestion above, then  \L'string' would be more
natural.


> (?#...) {"..."} :-)

Will plain # comments work in p6  regexen?


> (?:...) <:...>
> (?=...) 
> (?!...) 
> (?<=...) 
> (?
> (?>...) 

Hmm. So <> are clustering just like ().

One difference is that () always capture whereas <>
only do so sometimes. Oh, and {} can too.

() are no longer used for clever stuff, <> are instead.
And {}.

Hmm. Time for bed.


--
ralph

Re: Regex and Matched Delimiters

2002-04-23 Thread Ariel Scolnicov


Larry Wall <[EMAIL PROTECTED]> writes:

[...]

> /pat/x/pat/

How do I do a "no /x"?  I know that commented /x'ed regexps are easier
reading (I even write them myself, I swear I do!), but having to
escape whitespace is often very annoying.  Will I really have to
escape all spaces (or use , below)?

This also marks a significant departure from UN*X-style regexps.  One
reason learning Perl's regexp language was so convenient (to me) was
that that most of what I knew of UN*X regexps was applicable.
Changing the behaviour of a rather useful character (like ASCII 32) is
going to produce many references to the FAQ "Why doesn't /a word/
match 'a word'?".  (Having to escape #s is not as bad, as they are
less common).

[...]

-- 
Ariel Scolnicov|http://3w.compugen.co.il/~ariels
Compugen Ltd.  |[EMAIL PROTECTED]
72 Pinhas Rosen St.|Tel: +972-3-7658117  "fast, good, and cheap;
Tel-Aviv 69512, ISRAEL |Fax: +972-3-7658555   pick any two!"

RE: Regex and Matched Delimiters

2002-04-23 Thread Brent Dax


Piers Cawley:
# "Brent Dax" <[EMAIL PROTECTED]> writes:
# > Larry Wall:
# > That's...odd.  Is $$ (the variable) going away?
# >
# > # /./s  // or /<.>/ ???
# >
# > I think that . is too common a metacharacter to be 
# relegated to this.
# 
# I think you failed to notice that '/s' on the regex. In 
# general . will still mean . but if you want it to match 
# *anything* including a new line, you have to call it <.>. 
# Personally, I don't have a problem with that.

Ah, you're right.  My bad.

# > # space  (or \h for "horizontal"?)
# >
# > Same thinking as '.'.
# 
# The golfers aren't going to like it for sure. But most of the 
# time when I'm doing production code I have /x turned on 
# anyway, and in that context, if I want to match a space and 
# only a space, I have to do [ ] anyway. 
# 
# It might be nice if we could have m:X// mean 'space and hash 
# match themselves'. 

I was thinking that  would replace \s.  If that isn't the case, I
have no real complaint (if you can turn off /x).

# > # \talso 
# > # \nalso  or  (latter matching
# > logical newline)
# > # \ralso 
# > # \falso 
# > # \aalso 
# > # \ealso 
# >
# > I can tell you right now that these are going to screw people up. 
# > They'll try to use these in normal strings and be confused when it 
# > doesn't work.  And you probably won't be able to emit a warning, 
# > considering how much CGI Perl munches.
# 
# But assigning meaning to < and > is going to do that anyway. 

Not if the things are meaningless outside of regexes.  For example,
lookahead sequences make absolutely no sense in a quoted string.

--Brent Dax <[EMAIL PROTECTED]>
@roles=map {"Parrot $_"} qw(embedding regexen Configure)

#define private public
--Spotted in a C++ program just before a #include

Re: Regex and Matched Delimiters

2002-04-23 Thread Piers Cawley

"Brent Dax" <[EMAIL PROTECTED]> writes:
> Larry Wall:
> That's...odd.  Is $$ (the variable) going away?
>
> # /./s// or /<.>/ ???
>
> I think that . is too common a metacharacter to be relegated to
> this.

I think you failed to notice that '/s' on the regex. In general . will
still mean . but if you want it to match *anything* including a new
line, you have to call it <.>. Personally, I don't have a problem with
that.

> # space(or \h for "horizontal"?)
>
> Same thinking as '.'.

The golfers aren't going to like it for sure. But most of the time
when I'm doing production code I have /x turned on anyway, and in that
context, if I want to match a space and only a space, I have to do [ ]
anyway. 

It might be nice if we could have m:X// mean 'space and hash match
themselves'. 

> # \t  also 
> # \n  also  or  (latter matching
> logical newline)
> # \r  also 
> # \f  also 
> # \a  also 
> # \e  also 
>
> I can tell you right now that these are going to screw people up.
> They'll try to use these in normal strings and be confused when it
> doesn't work.  And you probably won't be able to emit a warning,
> considering how much CGI Perl munches.

But assigning meaning to < and > is going to do that anyway. 

-- 
Piers

   "It is a truth universally acknowledged that a language in
possession of a rich syntax must be in need of a rewrite."
 -- Jane Austen?

Re: Regex and Matched Delimiters

2002-04-23 Thread Piers Cawley


Larry Wall <[EMAIL PROTECTED]> writes:
> /^pat$/m  /^^pat$$/

$$ is no longer the current PID? Or will we have to call that '${$}'
in a regex?

-- 
Piers

   "It is a truth universally acknowledged that a language in
possession of a rich syntax must be in need of a rewrite."
 -- Jane Austen?

RE: Regex and Matched Delimiters

2002-04-22 Thread Brent Dax


Larry Wall:
# Me writes:
# : > Very nice (but, I assume you meant {$foo data})!
# : 
# : I didn't mean that (even if I should have).
# : 
# : Aiui, Mike's final suggestion was that parens end up
# : doing all the (ops data) tricks, and braces are used
# : purely to do code insertions. (I really liked that idea.)
# : 
# : So:
# : 
# : Perl 5Perl6
# : (data)( data)
# : (?opsdata)(ops data)
# : ({})  {}  
# 
# Hmm.  Let me spill a few beans about where I'm going with A5. 
#  I've been thinking similar thoughts about the problem of 
# overloading parens so heavily in Perl 5, but I'm going in a 
# slightly different direction with it.  The basic principles 
# for the new regexen are:
# 
# * Parens always capture.
# * Braces are always closures.
# * Square brackets are always character classes.
# * Angle brackets are always metasyntax (along with backslash).
# 
# So a first whack at the differences might be:
# 
# Old   New
# ---   ---
# ////  ???
# ?pat? // or even m ???

Whoa, those are moving to the front?!?

# /pat/x/pat/
# /^pat$/m  /^^pat$$/

That's...odd.  Is $$ (the variable) going away?

# /./s  // or /<.>/ ???

I think that . is too common a metacharacter to be relegated to this.

# \p{prop}  <+prop>  ???
# \P{prop}  <-prop>  ???

Intriguing.

# space  (or \h for "horizontal"?)

Same thinking as '.'.

# {n,m} 

Ah, OK.

# \talso 
# \nalso  or  (latter matching
logical newline)
# \ralso 
# \falso 
# \aalso 
# \ealso 

I can tell you right now that these are going to screw people up.
They'll try to use these in normal strings and be confused when it
doesn't work.  And you probably won't be able to emit a warning,
considering how much CGI Perl munches.

# \033  same
# \x1B  same
# \x{263a}  \x<263a> ???

Why?  Wouldn't we want the same thing to work in quoted strings?  (Or
are those changing syntaxes too?)

# \c[   same
# \N{name}  
# \lsame
# \usame
# \Lstring\E\L
# \Ustring\E\U

So that's changed from whenever you talked about \q{} ?

# \Egone
# [\040\t]  \hplus any Unicode horizontal whitespace
# [\r\n\ck] \v  plus any Unicode vertical whitespace
# 
# \bsame
# \Bsame

# \A^
# \Zsame?
# \z$

Are you sure that optimizes for the common case?

# \G, but assumed in nested patterns?
#  
# \1$1
# 
# \Q$var\E  $varalways assumed literal, so $1 is literal
backref

So these are reinterpolated every time you backtrack?  Are you *trying*
to destroy regex performance?  :^)

# $var  <$var>  assumed to be regex

What if $var is a qr//ed object?

# =~ $re=~ /<$re>/   ouch?

I don't see the win.

# (??{$rule})   
# (?{ code })   { code } with failure semantics
# (?#...)   {"..."} :-)
# (?:...)   <:...>
# (?=...)   
# (?!...)   
# (?<=...)  
# (?

Cute.  (Wait a minute, aren't those reversed?)

# (?>...)   
# (?(cond)t|f)  Not sure.  Could just use { if ... }

?

# Obviously the  and  syntaxes will be user 
# extensible. We have to be able to support full grammars.  I 
# consider it a feature that  looks like a non-terminal in 
# standard BNF notation.  I do not consider it a misfeature 
# that  resembles an HTML or XML tag, since most of those 
# languages need to be matched with a fancy rule named  anyway.

But that *does* make it harder to define the fancy rules.  I could see
someone defining rules like:

'gt' => qr/\ qr/\>/

just to get around backslashing everything in sight.

# An interesting idea would be that if you say
# 
# m
# 
# or
# 
# m{code}
# 
# it's as if you said
# 
# m//
# 
# or
# 
# m/{code}/

I don't know about that one.  I often use {} as delimiters on regexen
because it's a character that doesn't occur in data very often.  I think
the gain of two characters isn't as critical as the loss of options.
 
Understand, I'm not a regex Luddite.  I've been working with yacc and
lex a lot lately, so I have at least a hint of how powerful formal
parsing is--and I love all of these features.  However, I think that
syntactically a l

Re: Regex and Matched Delimiters

2002-04-22 Thread Luke Palmer


> (?=...)   
> (?!...)   
> (?<=...)  
> (?
> (?>...)   

Yummy :)
I'd say this is about perfect. The look(ahead|behind)s, er, 
look<:ahead|behind>s are used seldom enough that this is practical. And 
it's I much clea[nr]er than that (?=...) crap. (Think I'm going 
overboard with this tregext?)

And are you going to reveal the method by which you define your own 
s, so we can overload it with personal ungrounded opinions? (On the 
other hand, it'd probably just stick and not move, because you said it.)

> Sorry if this is a bit delirious--I'm fighting off some kind of
> infection, and my nights have been shortchanged lately by the
> neighborhood panhandler who doesn't seem to understand either
> complicated concepts like "bedtime" or simple concepts like "no".

bed...what?


Luke

Re: Regex and Matched Delimiters

2002-04-22 Thread Larry Wall


Me writes:
: > Very nice (but, I assume you meant {$foo data})!
: 
: I didn't mean that (even if I should have).
: 
: Aiui, Mike's final suggestion was that parens end up
: doing all the (ops data) tricks, and braces are used
: purely to do code insertions. (I really liked that idea.)
: 
: So:
: 
: Perl 5Perl6
: (data)( data)
: (?opsdata)(ops data)
: ({})  {}  

Hmm.  Let me spill a few beans about where I'm going with A5.  I've
been thinking similar thoughts about the problem of overloading parens
so heavily in Perl 5, but I'm going in a slightly different direction
with it.  The basic principles for the new regexen are:

* Parens always capture.
* Braces are always closures.
* Square brackets are always character classes.
* Angle brackets are always metasyntax (along with backslash).

So a first whack at the differences might be:

Old New
--- ---
//  //  ???
?pat?   // or even m ???
/pat/x  /pat/
/^pat$/m/^^pat$$/
/./s// or /<.>/ ???

\p{prop}<+prop>  ???
\P{prop}<-prop>  ???
space(or \h for "horizontal"?)
{n,m}   

\t  also 
\n  also  or  (latter matching logical newline)
\r  also 
\f  also 
\a  also 
\e  also 
\033same
\x1Bsame
\x{263a}\x<263a> ???
\c[ same
\N{name}
\l  same
\u  same
\Lstring\E  \L
\Ustring\E  \U
\E  gone
[\040\t]\h  plus any Unicode horizontal whitespace
[\r\n\ck]   \v  plus any Unicode vertical whitespace

\b  same
\B  same
\A  ^
\Z  same?
\z  $
\G  , but assumed in nested patterns?
 
\1  $1

\Q$var\E$varalways assumed literal, so $1 is literal backref
$var<$var>  assumed to be regex
=~ $re  =~ /<$re>/   ouch?

(??{$rule}) 
(?{ code }) { code } with failure semantics
(?#...) {"..."} :-)
(?:...) <:...>
(?=...) 
(?!...) 
(?<=...)
(?
(?>...) 
(?(cond)t|f)Not sure.  Could just use { if ... }

Obviously the  and  syntaxes will be user extensible.
We have to be able to support full grammars.  I consider it a feature
that  looks like a non-terminal in standard BNF notation.  I do
not consider it a misfeature that  resembles an HTML or XML tag,
since most of those languages need to be matched with a fancy rule
named  anyway.

An interesting idea would be that if you say

m

or

m{code}

it's as if you said

m//

or

m/{code}/

The latter is particularly interesting to me in that I can see uses for
patterns that are Perl code at the top level rather than regex
literal.  Any closure within a regular expression has full access to
the current state object for the match.  So most of the RFCs proposing
ad hoc mechanisms for saving submatches in various kinds of variables
can be handled with closures.

/(...)(...)(...) { @array = .all } /

or

/(...) { $first  = $+ }
 (...) { $second = $+ }
 (...) { $third  = $+ }/

or

/ () () { .node = ["if",$1,$2] } /  # shades of yacc

or whatever.  Could have a <$foo=...> as syntactic sugar, perhaps.
But we need the general mechanism for building up parse trees of
arrays of hashes of arrays of arrays of hashes of arrays of hashes of...

I haven't decided yet whether matches embedded in the closure should
automatically pick up where the outer match is, or whether there should
be some explicit match op to mean that, much like \G only better.  I'm
thinking when the current topic is a match state, we automatically
continue where we left off, and require explicit =~ to start an unrelated
match.

I also haven't committed to any particular mechanism for defining a
set of related rules in a grammar.  Obviously it needs to be a good
enough mechanism to parse Perl and its variants, which means it
probably needs to be OO based, and you make new grammars by derivation
from the base grammar and overriding the rules you want to change.

Sorry if this is a bit delirious--I'm fighting off some kind of
infection, and my nights have been shortchanged lately by the
neighborhood panhandler who doesn't seem to understand either
complicated concepts like "bedtime" or simple concepts like "no".

Larry

Re: Regex and Matched Delimiters

2002-04-22 Thread Aaron Sherman


On Mon, 2002-04-22 at 14:18, Me wrote:
> > Very nice (but, I assume you meant {$foo data})!
> 
> I didn't mean that (even if I should have).
> 
> Aiui, Mike's final suggestion was that parens end up
> doing all the (ops data) tricks, and braces are used
> purely to do code insertions. (I really liked that idea.)
> 
> So:
> 
> Perl 5Perl6
> (data)( data)
> (?opsdata)(ops data)
> ({})  {}  

I don't like that particular way of looking at things, but either way my
comments about subroutines and closures still holds.

Re: Regex and Matched Delimiters

2002-04-22 Thread Me


> Very nice (but, I assume you meant {$foo data})!

I didn't mean that (even if I should have).

Aiui, Mike's final suggestion was that parens end up
doing all the (ops data) tricks, and braces are used
purely to do code insertions. (I really liked that idea.)

So:

Perl 5Perl6
(data)( data)
(?opsdata)(ops data)
({})  {}  


--
ralph

Re: Regex and Matched Delimiters

2002-04-22 Thread Aaron Sherman

On Sat, 2002-04-20 at 14:33, Me wrote:

> [2c. What about ( data) or (ops data) normally means non-capturing,
> ($2 data) captures into $2, ($foo data) captures into $foo?]

Very nice (but, I assume you meant {$foo data})! This does add another
special case to the regexp parser's handling of "$", but it seems like
it would be worth it.

Makes me think of the even slightly hairier:

{&foo data}

or even more hair-full:

{&{$foo} data}

for references.

Where you capture into the usual positional, and then invoke foo with
the variable as parameter.

Would be pretty nice closure-wise:

sub match_with_alert($re,$id,$ops,$fac,$pri) {
openlog $id,$ops,$fac;
my $alert = sub ($match) {
syslog $pri, "Matched regexp: $match";
}
return study /{&{$alert} $re}/;
}
my $m = match_with_alert('ROOT login',$0,0,LOG_USER,PRI_CRIT);
for <> -> $_ { /$m/ }

That would certainly be a handy thing that would set Perl apart from the
pack of advanced regexp languages that don't support closures

Some other things come to mind as well, but I'm not sure how evil they
are. For example:

sub decrypt($data is rw) {
$data = rot13($data);
}

print "The secret message is: ", /^Encrypted: {&decrypt .*}/,
  "\n";

RE: Regex and Matched Delimiters

2002-04-22 Thread Aaron Sherman


On Sat, 2002-04-20 at 05:06, Mike Lambert wrote:
> > He then went on to describe something I didn't understand at all.
> > Sorry.
> 
> Few corrections to what you wrote:
> 
> To avoid the problem of extending {} to support new features with a
> character 'x', without breaking stuff that might have an 'x' immediately
> after the '{', my proposal is to require one space after the { before the
> real regex appears.

I hope that you mean "one or more whitespace characters", not just a
space. The following would be correct, no?

/{|
.*
 }/

Anything else would seem rather confusing to the average Perl
programmer.

Re: Regex and Matched Delimiters

2002-04-20 Thread Me


> [2c. What about ( data) or (ops data) normally means non-capturing,
> ($2 data) captures into $2, ($foo data) captures into $foo?]

which is cool where being explicit simplifies things, but
ain't where implicit is simpler. So, maybe add an op ('$'?)
or switch that makes parens capturing by default, ie as
per perl5.

--
ralph

Re: Regex and Matched Delimiters

2002-04-20 Thread Me


Let me see if I understand the final version of your (Mike's)
suggestions
and where it appears to be headed:

Backwards compatibility:
perl5 extended syntax still works in perl6 if one happens to use it.

Forward conversion:
Automatic conversion of relevant perl5 regex syntax to perl6 is simple.

New extension syntax:
1. Syntax is (ops data).
2. There are a bunch of built-in ops, but user can define new ones.

[2c. What about ( data) or (ops data) normally means non-capturing,
($2 data) captures into $2, ($foo data) captures into $foo?]

Rationalized ops syntax:
Ops string consists of arbitrarily ordered individual op characters.
(eg '<' signifies a look behind, '!<' signifies fail if look behind
match.)

Embedded code:
Code is inserted using {} with something other than digits in them.

(Other stuff, such as sexegers, ignored.)

--
ralph

RE: Regex and Matched Delimiters

2002-04-20 Thread Mike Lambert

> He then went on to describe something I didn't understand at all.
> Sorry.

Few corrections to what you wrote:

To avoid the problem of extending {} to support new features with a
character 'x', without breaking stuff that might have an 'x' immediately
after the '{', my proposal is to require one space after the { before the
real regex appears.

So to correct the example I wrote of /{a|b|c}+/, it would become
/{ a|b|c}+/. It looks a bit weird if you're accustomed to perl5's behavior
of (?:). { \ } would then match a single space. {  } would do nothing,
since the second space falls under the whitespace-insensitive regex rule.

Now, since we require a space, all the characters before this space
now become 'special' in some form. This fact allows us to add new
special characters and map them to functionality, if perl doesn't
already do that.

For example, I would register | to be:
sub zerowidth ($regex) {
  return <<"EOF";
  push \@pos, pos();
  regex_run $( qr/$regex/ );
  pos() = pop \@pos;
  EOF
}

And conversely, _ would be written as:
sub regularwidth ($r) {
  return "regex_run $( qr/$r/ )"
}

This would allow me to do whacky things, like register these:
sub plus ($r) {return "\$level++;regex_run $( qr/$r/ )"}
sub minus($r) {return "\$level--;check();regex_run $( qr/$r/ )"}
sub check {assert($level>0)}
{ {+ \(} | {- \)} | . } ({ check() })

brent and I also disagreed on the use of sexegers. japhy has done more
thinking about this than either of us have, so perhaps we should just let
him weigh in on the issue. I proposed that {< be a sexeger, whereas he
prefers {< be a lookbehind. I'll use the former for the rest of this
discussion, since on IRC we hd to agree to disagree on it.

Regardless, having support for sexegers supports all of the behavior of
lookbehinds, since lookbehinds are just a constant-string, and could never
be a regex in Perl5. I still like the way lookbehinds work, and am not
suggesting that they disappear entirely, but rather that they be changed
into an underlying sexeger form.

sub b ($reg) {
  my $ger = reverse $reg;
  return "run_regex qr/{<|= \Q$ger\E}/"
}

The following perl5 regex:
/(?<=foo)bar/
is now equivalent to:
/(b foo)bar/

> The only major drawback I can see to that is the naïve user might type
> {.*?}+ expecting a bunch of text in bold tags and getting a

Sorry I forgot to make that clearer. The above regex would have to be
written as { .*}+ to work properly, specifiying that there are no
special tokens.

> Here's how it works:
>   -If the code returns undef, we backtrack.
>   -If the code returns the empty string, we move on.
>   -If the code returns anything else, we interpolate that into the
> regex.
>
> So, we now just have ({}).

({print "hello"}) will unfortunately, be really weird. Since it returns 1,
the block will return 1. We'd have to force-specify a return value of "".
While simplifying the set of operators is good, and I want do a bunch of
that myself, we should probably offer a way to perform 'execute with
no interpolated regex' behavior of before, somehow built up on top of
the existing ({}) operator.

Reflecting on it all a bit, if we're willing to make a larger sacrifice
in backwards compatibility, it might make things make more sense.
- {} would be the code operator, which was specified up above as ({}).
  This makes more sense, imo, since {} is traditionally used for
  blocks.
- () would have all the special semantics described for {} in this
  thread.

The default for () could still be capturing, so ( a*) performs capturing
on /a*/. We'd then have to define another pair of symbols for turning
capturing on and off. All instances of Perl5's (blah)  would convert to
( blah), and all instances of the special operators in perl5 a la
(?@#blah) would translate as they did before, but also specifying the
'dont capture within these parens' special identifier.

Basically, I'm trying to propose a system which makes all the regex stuff
become orthogonal. Rather than creating a bunch of hardcoded types of (?>=
regex operators, instead define small functionalities which can be
combined in ways to emulate these tried and true constructs.

Brent, let me know if I'm still spouting gibberish on this email. :)

Mike Lambert

RE: Regex and Matched Delimiters

2002-04-19 Thread Brent Dax


Mike Lambert:
(a bunch of stuff about regexes)

No offense intended, but I had trouble understanding that, and I helped
come up with the thing.  :^)  So, I'll try to interpret.

In Perl 5, we came up against the problem of simply running out of
characters in regexes.  To deal with this, Larry came up with the
(?_regex) syntax, where _ is some character.  Although a clever use of
an otherwise impossible sequence, it's also gratuitously ugly.

Consider the many roles (?_) plays:

Non-capturing parentheses: (?:)
Look(ahead|behind)s: (?=), (?!), (?<=), (?)

Obviously, this is getting out of hand--using more than one or two of
those constructs makes your regex much harder to read.

Let's first tackle non-capturing parentheses and lookarounds.  If we
think about what metacharacters are around, we can realize that {} is
only legal with numbers inside it.  [0]  That means that we can probably
reuse it.  If we think about it, we can derive a few basic categories:

-consuming (_) or not (|) [1]
Reasoning: _ is fat, | is skinny
-positive (=) or negative (!)
Reasoning: same as in Perl 5
-forwards (>) or backwards (<)
Reasoning: same as in Perl 5

The characters in parentheses are prefix characters that indicate which
is to be used.  A simple mapping of the five things this section covers
follows:

Perl 5  Perl 6
--  --
(?:regex)   {_=>regex}
(?=regex)   {|=>regex}
(?!regex)   {|!>regex}
(?<=regex)  {|=.
So here's a map of what you're more likely to see in a regex:

Perl 5  Perl 6
--  --
(?:regex)   {regex}
(?=regex)   {|regex}
(?!regex)   {|!regex}
(?<=regex)  {|regex)  --  Nonsensical.
{_=.*?}+ expecting a bunch of text in bold tags and getting a
lookbehind instead--so it may be wise to leave the | and _ specifiers
out of this altogether, and come up with a better way.  I'll address
that point shortly.

In the mean time, let's consider some of the other syntaxes.  The inline
code tings are a good opportunity for improvement--and they have a good
alternative.  In Perl 5, ({ ought not to be legal, but it is--it's
hacked in to be the same as (\{.  So, we can drop a question mark from
each of the block forms, getting ({code}) and (?{code}.  However, we can
go even further by combining the two.

Here's how it works:
-If the code returns undef, we backtrack.
-If the code returns the empty string, we move on.
-If the code returns anything else, we interpolate that into the
regex.

So, we now just have ({}).

Comments can go, since Larry has said that /x will be on by default
anyway.

That leaves conditionals, non-backtracking sections, inline modifiers,
and (maybe) non-capturing parens.  We now have three characters that
aren't valid in these places: *, +, and ?.

My suggestion is this:

Thing   Syntax  Logic
-   --  -
Conditionals(?()|)  The question mark makes sense
for a conditional.
Inline Modifiers(?imsx-imsx)Might as well be a
little bit compatible.
Non-backtracking(+) + requires more
than * does.
Non-capturing   (*) Suggestions welcome.
:^)

So, my final suggestions are:

Perl 5  Perl 6
--  --
(?:)(*)
(?=){}
(?!){!}
(?<=)   {<}
(?)(+)
(?{})   ({}) returning empty string
(??{})  ({}) returning a string or regex
(?#)N/A--obsolete

Please feel free to comment on these.

[0] Perl won't be the first tool to take advantage of this--lex uses
something similar for named subexpressions.

[1] Neither of these characters is ideal, however.  | looks like !, and
_ might reasonably be at the beginning of this sort of thing anyway.
Better suggestions are welcome.

[2] Mike originally had all the backwards matches as sexegers.  I think
this is a bad idea, but feel obligated to mention that.

[3] This seems a bit useless to me too.  It's probably more useful to
have a /r modifier on the entire regex.

[4] I changed the ordering for this one to avoid an ambiguity.

--Brent Dax <[EMAIL PROTECTED]>
@roles=map {"Parrot $_"} qw(embedding regexen Configure)

#define private public
--Spotted in a C++ program just before a #include

40 matches

Mail list logo