Re: regex and
On Tue, Aug 10, 2010 at 9:00 PM, wrote: > Once the & operator is in rakudo, though... I gather I /could/ do something > like the following > > ^ [ * & ] $ > > And this would in effect ensued that the sequence "abc" doesn't exist > anywhere across the match for > > > Is this correct? > Not quite, I suspect – the is still zero-width, so unless quantified zero-width assertions are DWIMmier than what's healthy, this is likely still equivalent to ^$. I think the following should DWYW, though: ^ [ [ . ] * & ] $ ... though perhaps there is a shorter way to write [ . ]? Feels like there should be one ... Eirik
RE: regex and
Back to your original advice... > If you want to match an alphabetic string which does not include 'abc' > anywhere, you can write this as > > ^ [ ]* $ I presume this only works here because is one character... if instead of I used anything more complicated (for example) token name { <[A..Z]>* } And then tried to do ^ [ ]* $ This wouldn't work since there's a wildcard within name. Once the & operator is in rakudo, though... I gather I /could/ do something like the following ^ [ * & ] $ And this would in effect ensued that the sequence "abc" doesn't exist anywhere across the match for Is this correct?
Re: regex and
philippe.beauch...@bell.ca wrote: > On the & operator... are you saying that it would operate basically as > expected... > allowing sets of rules and'ed rather than or's with the | ? Yes, with the limitation that both parts separated by & have to match the same length of string, so that for example ^ [ a+ & . ** 3 ] could only match exactly 3 a's. If you don't want to them tied to the same length, you look-ahead assertions instead. Cheers, Moritz
RE: regex and
Great! That does it. Thanks. :) I realized my error on the anchors after sending... but didn't think of the * on the grouping. On the & operator... are you saying that it would operate basically as expected... allowing sets of rules and'ed rather than or's with the | ? --- Phil -Original Message- From: Moritz Lenz [mailto:mor...@faui2k3.org] Sent: August 10, 2010 2:09 PM To: Beauchamp, Philippe (6009210) Cc: perl6-language@perl.org Subject: Re: regex and Hi, philippe.beauch...@bell.ca wrote: > rule TOP > { > ^ > [ > & * > & > ] > $ > } The & syntax is specced, but it's not yet implemented in Rakudo. But note that is a zero-width assertion, so your example regex matches at the start of a string, if it does not begin with 'abc'. Since you anchor it to the end of string too, it can only ever match the empty string. You can achieve the same with just ^$. If you want to match an alphabetic string which does not include 'abc' anywhere, you can write this as ^ [ ]* $ Cheers, Moritz
Re: regex and
Hi, philippe.beauch...@bell.ca wrote: > rule TOP > { > ^ > [ > & * > & > ] > $ > } The & syntax is specced, but it's not yet implemented in Rakudo. But note that is a zero-width assertion, so your example regex matches at the start of a string, if it does not begin with 'abc'. Since you anchor it to the end of string too, it can only ever match the empty string. You can achieve the same with just ^$. If you want to match an alphabetic string which does not include 'abc' anywhere, you can write this as ^ [ ]* $ Cheers, Moritz
Re: regex and xml/html/*ml
On Wed, 5 Jun 2002 [EMAIL PROTECTED] wrote: > Just read (skimmed) apocalypse 5, had one concern - it looks like we are on a > serious collision course with parsing the various *mls. > > before: > > m#..etc# > > after > > m#\\\# > > Also, the space being backslashed sort of bugs me. Surely there is going to be > a 'non-x' modifier? And perhaps a modifier to change the character for logical > tags from <> to something else (like <<>>, perhaps?) Hey, if that makes people more reluctant to use regexes to parse HTML or XML and leads them to use real parsers then this could be construed as a feature ;--) Michel Rodriguez Perl & XML http://www.xmltwig.com
RE: regex and xml/html/*ml
-- On Wed, 5 Jun 2002 13:21:39 Brent Dax wrote: >[EMAIL PROTECTED]: ># Just read (skimmed) apocalypse 5, had one concern - it looks ># like we are on a serious collision course with parsing the ># various *mls. ># ># before: ># ># m#..etc# ># ># after ># ># m#\\\# > >That's intentional. What will that regex do with this? > > > >That's interpreted the same way, but typed a bit differently. It won't >match your regex. > >The moral of the story is that you should not try to parse the *MLs with >regexen--use modules instead. > >--Brent Dax <[EMAIL PROTECTED]> >@roles=map {"Parrot $_"} qw(embedding regexen Configure) > >Early in the series, Patrick Stewart came up to us and asked how warp >drive worked. We explained some of the hypothetical principles . . . >"Nonsense," Patrick declared. "All you have to do is say, 'Engage.'" >--Star Trek: The Next Generation Technical Manual > > Is your boss reading your email? Probably Keep your messages private by using Lycos Mail. Sign up today at http://mail.lycos.com
RE: regex and xml/html/*ml
[EMAIL PROTECTED]: # Just read (skimmed) apocalypse 5, had one concern - it looks # like we are on a serious collision course with parsing the # various *mls. # # before: # # m#..etc# # # after # # m#\\\# That's intentional. What will that regex do with this? That's interpreted the same way, but typed a bit differently. It won't match your regex. The moral of the story is that you should not try to parse the *MLs with regexen--use modules instead. --Brent Dax <[EMAIL PROTECTED]> @roles=map {"Parrot $_"} qw(embedding regexen Configure) Early in the series, Patrick Stewart came up to us and asked how warp drive worked. We explained some of the hypothetical principles . . . "Nonsense," Patrick declared. "All you have to do is say, 'Engage.'" --Star Trek: The Next Generation Technical Manual
Re: Regex and Matched Delimiters
Michael G Schwern wrote in perl.perl6.language : > On Tue, Apr 23, 2002 at 11:11:28PM -0500, Me wrote: >> Third, I was thinking that having perl 6 regexen have /s on >> by default would be easy for perl 5 coders to understand; >> not too hard to get used to; and have no negative effects >> for existing coders beyond getting used to the change. > > I'm jumping in the middle of a conversation here, but consider the > problem of .* matching newlines by default and greediness. > > /(foo.*)$/, /(foo.*)$/m and /(foo.*)$/s This is so old-fashioned. > when matching against something like "foo\nwiffle\nbarfoo\n" One matches the > last line. One matches the first line. And one matches all three lines. And by the way, there's the semantic unaccuracy of $ matching transparently newlines, combined with the obscure variants \z and \Z. This needs (IMHO) some reshaping. -- Rafael Garcia-Suarez I'll better skip() some releases until it is() ok() to use Test::More without() going insane(). Any more than I already am, that is(). -- Tels in the perl-qa mailing list
Re: Regex and Matched Delimiters
> when matching against something like "foo\nwiffle\nbarfoo\n" >/(foo.*)$/ # matches the last line /(foo[^\n]*)$/ # assuming perl 6 meaning of $, end of string >/(foo.*)$/m # matches the first line /(foo[^\n]*)$$/ # assuming perl 6 meaning of $$, end of line or /(foo.*?)$$/ >/(foo.*)$/s # matches all three lines /(foo.*)$/ -- ralph
Re: Regex and Matched Delimiters
On Tue, Apr 23, 2002 at 11:11:28PM -0500, Me wrote: > Third, I was thinking that having perl 6 regexen have /s on > by default would be easy for perl 5 coders to understand; > not too hard to get used to; and have no negative effects > for existing coders beyond getting used to the change. I'm jumping in the middle of a conversation here, but consider the problem of .* matching newlines by default and greediness. /(foo.*)$/, /(foo.*)$/m and /(foo.*)$/s when matching against something like "foo\nwiffle\nbarfoo\n" One matches the last line. One matches the first line. And one matches all three lines. -- Michael G. Schwern <[EMAIL PROTECTED]>http://www.pobox.com/~schwern/ Perl Quality Assurance <[EMAIL PROTECTED]> Kwalitee Is Job One Consistency? I'm sorry, Sir, but you obviously chose the wrong door. -- Jarkko Hietaniemi in <[EMAIL PROTECTED]>
Re: Regex and Matched Delimiters
> > : I'd expect . to match newlines by default. I forgot, fourth, this simplifies the rule for . -- it would become period matches any char, period. Fifth, it makes the writing of "match anything but newline" into an explicit [^\n], which I consider a good thing. Of course, all this is minor stuff. But I can't get my head around parse trees and grammars, so I'll continue to fiddle around spraying a bit of grafitti here and there on the bikeshed. -- ralph
Re: Regex and Matched Delimiters
> : I'd expect . to match newlines by default. For a . that > : didn't match newlines, I'd expect to need to use [^\n]. > > But . has never matched newlines by default, not even in grep. Perhaps. But: First, I would have thought you *can't* make . match newlines in grep, period. If so, then when perl is handling a multi-line string, it is handling a case grep never encounters. Second, I think the perl 5 default is the wrong one from the point of view of a typical newbie's guess. Third, I was thinking that having perl 6 regexen have /s on by default would be easy for perl 5 coders to understand; not too hard to get used to; and have no negative effects for existing coders beyond getting used to the change. -- ralph
Re: Regex and Matched Delimiters
On Tue, 2002-04-23 at 12:48, Larry Wall wrote: > Brent Dax writes: > : # \talso > : # \nalso or (latter matching > : logical newline) > : # \ralso > : # \falso > : # \aalso > : # \ealso > : > : I can tell you right now that these are going to screw people up. > : They'll try to use these in normal strings and be confused when it > : doesn't work. And you probably won't be able to emit a warning, > : considering how much CGI Perl munches. > > I can see pragmatic variants in which those *do* interpolate by default. > And pragmatic variants where they don't. If you put them in one, put them in the other, HOWEVER, there's a strong pragmatic reason for neither that i can see. HTML/XML/SGML I hate to say it, but if <> interpolates in everything cleanly with no overloading, the *ML camps will thank you deeply. How often I've written: qq{$content} I cannot tell you, but it's large. Why not use {} for this and add an {eval:code}? > I'm just wondering how far I can drive the principle that {} is always > a closure (even though it isn't). I admit that it's probably overkill > here, which is why there are question marks. I like the idea, but I don't think it fits. On the other hand, if inside all interpolating operators {} is the special thing that gets interpolated (and NOTHING else), I could see liking the new look: qq{a${x}b} => qq{a{$x}b} qr{a\Q${x}\Eb$} => qr{a{q:$x}b$} qr{a${x}b$} => qr{a{$x}b$} q{a}.eval($x).q{b} => qq{a{e:$x}b} or qq{a{{$x}}b} "ajs\@ajs.com" => qq{[EMAIL PROTECTED]} "ajs". @{ajs} .".com" => qq{ajs{@ajs}.com} I know it's a departure from your original idea, but it certainly unifies the syntax nicely: qq{Hello, World!{nl}} qr{Hello, World!{nl}} > With respect to Perl 5, I'm trying to unhijack curlies as much as possible. Ooops :-)
Re: Regex and Matched Delimiters
Brent Dax writes: : Sorry to reply to the same message twice, but I just noticed something. : : Larry Wall: : # {n,m} : : Isn't that the only use of angle brackets as a quantifier? That's going : to make parsing more difficult... How so? It's just a one-character lookahead to see if it's a digit. But we could actually use a more general syntax: Larry
Re: Regex and Matched Delimiters
Me writes: : > /pat/i m:i/pat/ or // or even m ??? : : Why lose the modifier-following-final-delimiter : syntax? Is this to avoid a parsing issue, or : because it's linguistically odd to have a modifier : at the end? Haven't decided for sure to lose it, but it does have several problems. First is the parsing issue, but there's also what in natural language is called the "end weight" problem. We often rearrange our sentences in English so that the short things come first and the long things come last. That's why you choose indirect object syntax sometimes and not others. Try turning either of these to the other form: I gave him a big, smelly tuna-fish and cucumber sandwich. I gave the sandwich to a big, smelly tuna fisherman and his dog "Cucumber". Now, options are always little, so it seems that they should come early. : > /^pat$/m /^^pat$$/ : : What's the mnemonic here? It feels the wrong : way round -- like a single ^ or $ should match : at newlines, double ^ or $ should only match : at start/end string. Well, I though of it as ^^ or $$ matching potentially multiple places in the string. : Ah. The newline matches between the ^^ or $$. : That works. Except that the newline doesn't match between the characters. You could say /$$\n^^/ for instance. : Then there's the PID issue. Hmm. How to save $$ : (it is nice for one liners)? $PID is only two chars worse. (The * of $*PID is optional.) : Sorry if this is a dumb suggestion, but could you have : just one assertion, say ^$, that alternates matching : just before and just after a newline? ^$ matches a null string. That aside, I don't think stateful assertions would be unconfusing in the extreme. : > /./s // or /<.>/ ??? : : I'd expect . to match newlines by default. For a . that : didn't match newlines, I'd expect to need to use [^\n]. But . has never matched newlines by default, not even in grep. Possibly some editors do it that way, but if so, it's non-standard. : > space (or \h for "horizontal"?) : : Can one quote a substring of a regex? In a later part you : say that \Q...\E is going away, so it seems not. It would be : nice to say something like: : : /foo bar baz 'qux waldo' emerson/ : : and have the space between qux and waldo be literal. : Similar arguments apply more broadly so that one : could escape the usual meaning of metacharacters etc. Well, <"qux waldo"> could be made to mean that, I suppose. For that matter, so might \q{qux waldo}. Er, \q? : > \Lstring\E \L : > \Ustring\E \U : : Maybe, if I wasn't too far off with the quote mark : suggestion above, then \L'string' would be more : natural. Maybe \L and \q are in the same class, in which case that would work. : > (?#...) {"..."} :-) : : Will plain # comments work in p6 regexen? Yes, just as in /x. And there's no ambiguity in the end delimiter any more because we parse in one pass. : > (?:...) <:...> : > (?=...) : > (?!...) : > (?<=...) : > (? : > (?>...) : : Hmm. So <> are clustering just like (). Yes, and you can quantify them where it makes sense. : One difference is that () always capture whereas <> : only do so sometimes. Oh, and {} can too. Eh? <> never capture. None of those constructs above capture. Nothing inside a {} can capture anything that influences the paren count outsid the {}, because any inner regex has its own paren count. : () are no longer used for clever stuff, <> are instead. : And {}. Basically, yes. : Hmm. Time for bed. Why? I just got up. :-) Larry
RE: Regex and Matched Delimiters
Sorry to reply to the same message twice, but I just noticed something. Larry Wall: # {n,m} Isn't that the only use of angle brackets as a quantifier? That's going to make parsing more difficult... --Brent Dax <[EMAIL PROTECTED]> @roles=map {"Parrot $_"} qw(embedding regexen Configure) #define private public --Spotted in a C++ program just before a #include
Re: Regex and Matched Delimiters
Aaron Sherman writes: : On Mon, 2002-04-22 at 21:53, Larry Wall wrote: : : > * Parens always capture. : > * Braces are always closures. : > * Square brackets are always character classes. : > * Angle brackets are always metasyntax (along with backslash). : > : > So a first whack at the differences might be: : [...] : > space(or \h for "horizontal"?) : > {n,m} : > : > \t also : : I want to know how he does this!! Could have something to do with the fact that I've been banging my head against this for a couple of months already... : We sit around scratching out heads : looking for a syntax that fits and refines and he jumps in with : something that redefines and simplifies. Larry is wasted on Perl. He : needs to run for office ;-) Agh, no! I'm okay at simplifying, but I'm terrible at oversimplifying. : > \Lstring\E \L : > \Ustring\E \U : : This one boggles me. Wouldn't that be something like: : : or string # ;-) Well, makes sense only if <> works in ordinary double quotes. : Seriously, it seems that "\L" would be confusing. Potentially, except that you almost never use it on anything but variable interpoations. So \L<$foo> would be a better example. The confusing thing is that $foo would not be assumed to be a regular expression, whereas it would in bare <$foo> (at least in a regex). : > \Q$var\E$varalways assumed literal, so $1 is literal :backref : > $var<$var> assumed to be regex : : Very nice. I can get behind this, and a lot of people will thank you who : have to maintain code. Well, almost anything is an improvement over the current syntax. : > =~ $re =~ /<$re>/ ouch? : : If $re is a regexp, wouldn't "$str =~ $re" turn into "$re.match($str)"? : Perhaps "$re.m $str" which is no more typing and pretty clear to me. Sure, but I was illustrating the situation of a non-qr string being forced to be a regex. : > Obviously the and syntaxes will be user extensible. : > We have to be able to support full grammars. I consider it a feature : > that looks like a non-terminal in standard BNF notation. I do : > not consider it a misfeature that resembles an HTML or XML tag, : > since most of those languages need to be matched with a fancy rule : > named anyway. : : It's too bad that would be messy with standard Perl //-enclosed : regexes, as it would be a nice way to pass parameters to user-defined : tags. It would also allow XML-like propagation of results: : : xyz Gee, maybe we could make a way for people to use alternate dilimiters like they've always done with s///. :-) Larry
Re: Regex and Matched Delimiters
Brent Dax writes: : # ?pat? // or even m ??? : : Whoa, those are moving to the front?!? The problem with options in general is that they can't easily modify parsing if they come in back. Now in the particular case of /f and /i, it probably doesn't matter. But I was trying to see if there was some way to do away with trailing options altogether. This might even extend to things like: qq:s"$interpolates @doesn't %doesn't" And that's definitely a situation where it changes the parse. Hmm, if strings have options, they're probably addititive, so to add scalar interpolation you'd want to base it on "q", not "qq": q:s"$interpolates @doesn't %doesn't" On the other hand, that doesn't work for the other things like "qr", so maybe any of :s, :a, :h turn off default interpolations, so qr:a would only interpolate arrays, for instance. : # /pat/x /pat/ : # /^pat$/m/^^pat$$/ : : That's...odd. Is $$ (the variable) going away? Maybe. It'd be $*PID if so, since it's truly global to the process. But if not, we could special case $$ inside regexes, just as we already special case $ itself. : # \p{prop}<+prop> ??? : # \P{prop}<-prop> ??? : : Intriguing. Yeah, especially when you start stacking them. But maybe we're treading on [...] territory. It could be argued that <...> is just a generalized form of POSIX's [:...:] construct : # \t also : # \n also or (latter matching : logical newline) : # \r also : # \f also : # \a also : # \e also : : I can tell you right now that these are going to screw people up. : They'll try to use these in normal strings and be confused when it : doesn't work. And you probably won't be able to emit a warning, : considering how much CGI Perl munches. I can see pragmatic variants in which those *do* interpolate by default. And pragmatic variants where they don't. : # \033same : # \x1Bsame : # \x{263a}\x<263a> ??? : : Why? Wouldn't we want the same thing to work in quoted strings? (Or : are those changing syntaxes too?) I'm just wondering how far I can drive the principle that {} is always a closure (even though it isn't). I admit that it's probably overkill here, which is why there are question marks. : # \c[ same : # \N{name} : # \l same : # \u same : # \Lstring\E \L : # \Ustring\E \U : : So that's changed from whenever you talked about \q{} ? Possibly. Again, the question is whether {} more strongly imply something that's not true. But curlies were so overloaded in Perl 5 that I don't think people are going to necessarily expect them to do only one thing. Still, if <> are taking over the role of "unmarked metasyntactic delimiters", maybe they belong here too. : # \E gone : # [\040\t]\hplus any Unicode horizontal whitespace : # [\r\n\ck] \v plus any Unicode vertical whitespace : #=20 : # \b same : # \B same : : # \A ^ : # \Z same? : # \z $ : : Are you sure that optimizes for the common case? No, I'm not sure, but we have to clean up the \A...\z mess somehow. : # \G , but assumed in nested patterns? : # =20 : # \1 $1 : #=20 : # \Q$var\E$varalways assumed literal, so $1 is literal : backref : : So these are reinterpolated every time you backtrack? Are you *trying* : to destroy regex performance? :^) They're not interpolated. They're matched, as in string comparison, just as backrefs are matched right now. : # $var<$var> assumed to be regex : : What if $var is a qr//ed object? Then it's a pretty easy assumption that it's a regex. :-) : # =~ $re =~ /<$re>/ ouch? : : I don't see the win. No difference if $re is qr//, but if it's not, that is the syntax for forcing $re to be interpreted as a regex. : # (??{$rule}) : # (?{ code }) { code } with failure semantics : # (?#...) {"..."} :-) : # (?:...) <:...> : # (?=3D...) : # (?!...) : # (?<=3D...) : # (? : : Cute. (Wait a minute, aren't those reversed?) Nope, I realized they were ambiguous depending on whether you think of them as declarative or operational, but I settled on the declarative reading because it works with their being assertions. All the other options I could think of are either really clunky or similarly ambigu
Re: Regex and Matched Delimiters
On Tue, 2002-04-23 at 04:32, Ariel Scolnicov wrote: > Larry Wall <[EMAIL PROTECTED]> writes: > > [...] > > > /pat/x /pat/ > > How do I do a "no /x"? I know that commented /x'ed regexps are easier > reading (I even write them myself, I swear I do!), but having to > escape whitespace is often very annoying. Will I really have to > escape all spaces (or use , below)? > I'm not sure that that's a bad thing. Regular expressions are the hairiest, ugliest thing in Perl. If they change in this way, I see them getting a tad more verbose, and a whole lot more readable and maintainable. Besides you can always do this: $str = "COPYING file for more information"; /$str/ since scalars will be interpolated as quoted by default.
Re: Regex and Matched Delimiters
On Wed, 24 Apr 2002, Iain Truskett wrote: > * Larry Wall ([EMAIL PROTECTED]) [23 Apr 2002 11:56]: > > [...] > > * Parens always capture. > > Maybe I missed something in the rest of the details, but is anything > going to replace non-capturing parens? It's just that I do find them > quite useful. Yes. /indeed <:this>+ wont capture/
Re: Regex and Matched Delimiters
* Larry Wall ([EMAIL PROTECTED]) [23 Apr 2002 11:56]: [...] > * Parens always capture. Maybe I missed something in the rest of the details, but is anything going to replace non-capturing parens? It's just that I do find them quite useful. -- iain.
RE: Regex and Matched Delimiters
> # =~ $re =~ /<$re>/ ouch? > > I don't see the win. Naturally =~ $re is a bit cleaner, but we can't do that because =~ is smart match, not regex match. > # (?=...) > # (?!...) > # (?<=...) > # (? > > Cute. (Wait a minute, aren't those reversed?) Hehe. I thought that was cool. /foobar/ / foobar/ You see, foobar before snafoo, which is what it is. After snafoo, foobar. It reads very nicely. Luke
Re: Regex and Matched Delimiters
On Mon, 2002-04-22 at 21:53, Larry Wall wrote: > * Parens always capture. > * Braces are always closures. > * Square brackets are always character classes. > * Angle brackets are always metasyntax (along with backslash). > > So a first whack at the differences might be: [...] > space (or \h for "horizontal"?) > {n,m} > > \talso I want to know how he does this!! We sit around scratching out heads looking for a syntax that fits and refines and he jumps in with something that redefines and simplifies. Larry is wasted on Perl. He needs to run for office ;-) > \Lstring\E\L > \Ustring\E\U This one boggles me. Wouldn't that be something like: or string # ;-) Seriously, it seems that "\L" would be confusing. > \Q$var\E $varalways assumed literal, so $1 is literal backref > $var <$var> assumed to be regex Very nice. I can get behind this, and a lot of people will thank you who have to maintain code. > =~ $re=~ /<$re>/ ouch? If $re is a regexp, wouldn't "$str =~ $re" turn into "$re.match($str)"? Perhaps "$re.m $str" which is no more typing and pretty clear to me. > Obviously the and syntaxes will be user extensible. > We have to be able to support full grammars. I consider it a feature > that looks like a non-terminal in standard BNF notation. I do > not consider it a misfeature that resembles an HTML or XML tag, > since most of those languages need to be matched with a fancy rule > named anyway. It's too bad that would be messy with standard Perl //-enclosed regexes, as it would be a nice way to pass parameters to user-defined tags. It would also allow XML-like propagation of results: xyz
Re: Regex and Matched Delimiters
> /pat/i m:i/pat/ or // or even m ??? Why lose the modifier-following-final-delimiter syntax? Is this to avoid a parsing issue, or because it's linguistically odd to have a modifier at the end? > /^pat$/m /^^pat$$/ What's the mnemonic here? It feels the wrong way round -- like a single ^ or $ should match at newlines, double ^ or $ should only match at start/end string. Ah. The newline matches between the ^^ or $$. That works. Then there's the PID issue. Hmm. How to save $$ (it is nice for one liners)? Sorry if this is a dumb suggestion, but could you have just one assertion, say ^$, that alternates matching just before and just after a newline? > /./s // or /<.>/ ??? I'd expect . to match newlines by default. For a . that didn't match newlines, I'd expect to need to use [^\n]. > space (or \h for "horizontal"?) Can one quote a substring of a regex? In a later part you say that \Q...\E is going away, so it seems not. It would be nice to say something like: /foo bar baz 'qux waldo' emerson/ and have the space between qux and waldo be literal. Similar arguments apply more broadly so that one could escape the usual meaning of metacharacters etc. > \Lstring\E \L > \Ustring\E \U Maybe, if I wasn't too far off with the quote mark suggestion above, then \L'string' would be more natural. > (?#...) {"..."} :-) Will plain # comments work in p6 regexen? > (?:...) <:...> > (?=...) > (?!...) > (?<=...) > (? > (?>...) Hmm. So <> are clustering just like (). One difference is that () always capture whereas <> only do so sometimes. Oh, and {} can too. () are no longer used for clever stuff, <> are instead. And {}. Hmm. Time for bed. -- ralph
Re: Regex and Matched Delimiters
Larry Wall <[EMAIL PROTECTED]> writes: [...] > /pat/x/pat/ How do I do a "no /x"? I know that commented /x'ed regexps are easier reading (I even write them myself, I swear I do!), but having to escape whitespace is often very annoying. Will I really have to escape all spaces (or use , below)? This also marks a significant departure from UN*X-style regexps. One reason learning Perl's regexp language was so convenient (to me) was that that most of what I knew of UN*X regexps was applicable. Changing the behaviour of a rather useful character (like ASCII 32) is going to produce many references to the FAQ "Why doesn't /a word/ match 'a word'?". (Having to escape #s is not as bad, as they are less common). [...] -- Ariel Scolnicov|http://3w.compugen.co.il/~ariels Compugen Ltd. |[EMAIL PROTECTED] 72 Pinhas Rosen St.|Tel: +972-3-7658117 "fast, good, and cheap; Tel-Aviv 69512, ISRAEL |Fax: +972-3-7658555 pick any two!"
RE: Regex and Matched Delimiters
Piers Cawley: # "Brent Dax" <[EMAIL PROTECTED]> writes: # > Larry Wall: # > That's...odd. Is $$ (the variable) going away? # > # > # /./s // or /<.>/ ??? # > # > I think that . is too common a metacharacter to be # relegated to this. # # I think you failed to notice that '/s' on the regex. In # general . will still mean . but if you want it to match # *anything* including a new line, you have to call it <.>. # Personally, I don't have a problem with that. Ah, you're right. My bad. # > # space (or \h for "horizontal"?) # > # > Same thinking as '.'. # # The golfers aren't going to like it for sure. But most of the # time when I'm doing production code I have /x turned on # anyway, and in that context, if I want to match a space and # only a space, I have to do [ ] anyway. # # It might be nice if we could have m:X// mean 'space and hash # match themselves'. I was thinking that would replace \s. If that isn't the case, I have no real complaint (if you can turn off /x). # > # \talso # > # \nalso or (latter matching # > logical newline) # > # \ralso # > # \falso # > # \aalso # > # \ealso # > # > I can tell you right now that these are going to screw people up. # > They'll try to use these in normal strings and be confused when it # > doesn't work. And you probably won't be able to emit a warning, # > considering how much CGI Perl munches. # # But assigning meaning to < and > is going to do that anyway. Not if the things are meaningless outside of regexes. For example, lookahead sequences make absolutely no sense in a quoted string. --Brent Dax <[EMAIL PROTECTED]> @roles=map {"Parrot $_"} qw(embedding regexen Configure) #define private public --Spotted in a C++ program just before a #include
Re: Regex and Matched Delimiters
"Brent Dax" <[EMAIL PROTECTED]> writes: > Larry Wall: > That's...odd. Is $$ (the variable) going away? > > # /./s// or /<.>/ ??? > > I think that . is too common a metacharacter to be relegated to > this. I think you failed to notice that '/s' on the regex. In general . will still mean . but if you want it to match *anything* including a new line, you have to call it <.>. Personally, I don't have a problem with that. > # space(or \h for "horizontal"?) > > Same thinking as '.'. The golfers aren't going to like it for sure. But most of the time when I'm doing production code I have /x turned on anyway, and in that context, if I want to match a space and only a space, I have to do [ ] anyway. It might be nice if we could have m:X// mean 'space and hash match themselves'. > # \t also > # \n also or (latter matching > logical newline) > # \r also > # \f also > # \a also > # \e also > > I can tell you right now that these are going to screw people up. > They'll try to use these in normal strings and be confused when it > doesn't work. And you probably won't be able to emit a warning, > considering how much CGI Perl munches. But assigning meaning to < and > is going to do that anyway. -- Piers "It is a truth universally acknowledged that a language in possession of a rich syntax must be in need of a rewrite." -- Jane Austen?
Re: Regex and Matched Delimiters
Larry Wall <[EMAIL PROTECTED]> writes: > /^pat$/m /^^pat$$/ $$ is no longer the current PID? Or will we have to call that '${$}' in a regex? -- Piers "It is a truth universally acknowledged that a language in possession of a rich syntax must be in need of a rewrite." -- Jane Austen?
RE: Regex and Matched Delimiters
Larry Wall: # Me writes: # : > Very nice (but, I assume you meant {$foo data})! # : # : I didn't mean that (even if I should have). # : # : Aiui, Mike's final suggestion was that parens end up # : doing all the (ops data) tricks, and braces are used # : purely to do code insertions. (I really liked that idea.) # : # : So: # : # : Perl 5Perl6 # : (data)( data) # : (?opsdata)(ops data) # : ({}) {} # # Hmm. Let me spill a few beans about where I'm going with A5. # I've been thinking similar thoughts about the problem of # overloading parens so heavily in Perl 5, but I'm going in a # slightly different direction with it. The basic principles # for the new regexen are: # # * Parens always capture. # * Braces are always closures. # * Square brackets are always character classes. # * Angle brackets are always metasyntax (along with backslash). # # So a first whack at the differences might be: # # Old New # --- --- # //// ??? # ?pat? // or even m ??? Whoa, those are moving to the front?!? # /pat/x/pat/ # /^pat$/m /^^pat$$/ That's...odd. Is $$ (the variable) going away? # /./s // or /<.>/ ??? I think that . is too common a metacharacter to be relegated to this. # \p{prop} <+prop> ??? # \P{prop} <-prop> ??? Intriguing. # space (or \h for "horizontal"?) Same thinking as '.'. # {n,m} Ah, OK. # \talso # \nalso or (latter matching logical newline) # \ralso # \falso # \aalso # \ealso I can tell you right now that these are going to screw people up. They'll try to use these in normal strings and be confused when it doesn't work. And you probably won't be able to emit a warning, considering how much CGI Perl munches. # \033 same # \x1B same # \x{263a} \x<263a> ??? Why? Wouldn't we want the same thing to work in quoted strings? (Or are those changing syntaxes too?) # \c[ same # \N{name} # \lsame # \usame # \Lstring\E\L # \Ustring\E\U So that's changed from whenever you talked about \q{} ? # \Egone # [\040\t] \hplus any Unicode horizontal whitespace # [\r\n\ck] \v plus any Unicode vertical whitespace # # \bsame # \Bsame # \A^ # \Zsame? # \z$ Are you sure that optimizes for the common case? # \G, but assumed in nested patterns? # # \1$1 # # \Q$var\E $varalways assumed literal, so $1 is literal backref So these are reinterpolated every time you backtrack? Are you *trying* to destroy regex performance? :^) # $var <$var> assumed to be regex What if $var is a qr//ed object? # =~ $re=~ /<$re>/ ouch? I don't see the win. # (??{$rule}) # (?{ code }) { code } with failure semantics # (?#...) {"..."} :-) # (?:...) <:...> # (?=...) # (?!...) # (?<=...) # (? Cute. (Wait a minute, aren't those reversed?) # (?>...) # (?(cond)t|f) Not sure. Could just use { if ... } ? # Obviously the and syntaxes will be user # extensible. We have to be able to support full grammars. I # consider it a feature that looks like a non-terminal in # standard BNF notation. I do not consider it a misfeature # that resembles an HTML or XML tag, since most of those # languages need to be matched with a fancy rule named anyway. But that *does* make it harder to define the fancy rules. I could see someone defining rules like: 'gt' => qr/\ qr/\>/ just to get around backslashing everything in sight. # An interesting idea would be that if you say # # m # # or # # m{code} # # it's as if you said # # m// # # or # # m/{code}/ I don't know about that one. I often use {} as delimiters on regexen because it's a character that doesn't occur in data very often. I think the gain of two characters isn't as critical as the loss of options. Understand, I'm not a regex Luddite. I've been working with yacc and lex a lot lately, so I have at least a hint of how powerful formal parsing is--and I love all of these features. However, I think that syntactically a l
Re: Regex and Matched Delimiters
> (?=...) > (?!...) > (?<=...) > (? > (?>...) Yummy :) I'd say this is about perfect. The look(ahead|behind)s, er, look<:ahead|behind>s are used seldom enough that this is practical. And it's I much clea[nr]er than that (?=...) crap. (Think I'm going overboard with this tregext?) And are you going to reveal the method by which you define your own s, so we can overload it with personal ungrounded opinions? (On the other hand, it'd probably just stick and not move, because you said it.) > Sorry if this is a bit delirious--I'm fighting off some kind of > infection, and my nights have been shortchanged lately by the > neighborhood panhandler who doesn't seem to understand either > complicated concepts like "bedtime" or simple concepts like "no". bed...what? Luke
Re: Regex and Matched Delimiters
Me writes: : > Very nice (but, I assume you meant {$foo data})! : : I didn't mean that (even if I should have). : : Aiui, Mike's final suggestion was that parens end up : doing all the (ops data) tricks, and braces are used : purely to do code insertions. (I really liked that idea.) : : So: : : Perl 5Perl6 : (data)( data) : (?opsdata)(ops data) : ({}) {} Hmm. Let me spill a few beans about where I'm going with A5. I've been thinking similar thoughts about the problem of overloading parens so heavily in Perl 5, but I'm going in a slightly different direction with it. The basic principles for the new regexen are: * Parens always capture. * Braces are always closures. * Square brackets are always character classes. * Angle brackets are always metasyntax (along with backslash). So a first whack at the differences might be: Old New --- --- // // ??? ?pat? // or even m ??? /pat/x /pat/ /^pat$/m/^^pat$$/ /./s// or /<.>/ ??? \p{prop}<+prop> ??? \P{prop}<-prop> ??? space(or \h for "horizontal"?) {n,m} \t also \n also or (latter matching logical newline) \r also \f also \a also \e also \033same \x1Bsame \x{263a}\x<263a> ??? \c[ same \N{name} \l same \u same \Lstring\E \L \Ustring\E \U \E gone [\040\t]\h plus any Unicode horizontal whitespace [\r\n\ck] \v plus any Unicode vertical whitespace \b same \B same \A ^ \Z same? \z $ \G , but assumed in nested patterns? \1 $1 \Q$var\E$varalways assumed literal, so $1 is literal backref $var<$var> assumed to be regex =~ $re =~ /<$re>/ ouch? (??{$rule}) (?{ code }) { code } with failure semantics (?#...) {"..."} :-) (?:...) <:...> (?=...) (?!...) (?<=...) (? (?>...) (?(cond)t|f)Not sure. Could just use { if ... } Obviously the and syntaxes will be user extensible. We have to be able to support full grammars. I consider it a feature that looks like a non-terminal in standard BNF notation. I do not consider it a misfeature that resembles an HTML or XML tag, since most of those languages need to be matched with a fancy rule named anyway. An interesting idea would be that if you say m or m{code} it's as if you said m// or m/{code}/ The latter is particularly interesting to me in that I can see uses for patterns that are Perl code at the top level rather than regex literal. Any closure within a regular expression has full access to the current state object for the match. So most of the RFCs proposing ad hoc mechanisms for saving submatches in various kinds of variables can be handled with closures. /(...)(...)(...) { @array = .all } / or /(...) { $first = $+ } (...) { $second = $+ } (...) { $third = $+ }/ or / () () { .node = ["if",$1,$2] } / # shades of yacc or whatever. Could have a <$foo=...> as syntactic sugar, perhaps. But we need the general mechanism for building up parse trees of arrays of hashes of arrays of arrays of hashes of arrays of hashes of... I haven't decided yet whether matches embedded in the closure should automatically pick up where the outer match is, or whether there should be some explicit match op to mean that, much like \G only better. I'm thinking when the current topic is a match state, we automatically continue where we left off, and require explicit =~ to start an unrelated match. I also haven't committed to any particular mechanism for defining a set of related rules in a grammar. Obviously it needs to be a good enough mechanism to parse Perl and its variants, which means it probably needs to be OO based, and you make new grammars by derivation from the base grammar and overriding the rules you want to change. Sorry if this is a bit delirious--I'm fighting off some kind of infection, and my nights have been shortchanged lately by the neighborhood panhandler who doesn't seem to understand either complicated concepts like "bedtime" or simple concepts like "no". Larry
Re: Regex and Matched Delimiters
On Mon, 2002-04-22 at 14:18, Me wrote: > > Very nice (but, I assume you meant {$foo data})! > > I didn't mean that (even if I should have). > > Aiui, Mike's final suggestion was that parens end up > doing all the (ops data) tricks, and braces are used > purely to do code insertions. (I really liked that idea.) > > So: > > Perl 5Perl6 > (data)( data) > (?opsdata)(ops data) > ({}) {} I don't like that particular way of looking at things, but either way my comments about subroutines and closures still holds.
Re: Regex and Matched Delimiters
> Very nice (but, I assume you meant {$foo data})! I didn't mean that (even if I should have). Aiui, Mike's final suggestion was that parens end up doing all the (ops data) tricks, and braces are used purely to do code insertions. (I really liked that idea.) So: Perl 5Perl6 (data)( data) (?opsdata)(ops data) ({}) {} -- ralph
Re: Regex and Matched Delimiters
On Sat, 2002-04-20 at 14:33, Me wrote: > [2c. What about ( data) or (ops data) normally means non-capturing, > ($2 data) captures into $2, ($foo data) captures into $foo?] Very nice (but, I assume you meant {$foo data})! This does add another special case to the regexp parser's handling of "$", but it seems like it would be worth it. Makes me think of the even slightly hairier: {&foo data} or even more hair-full: {&{$foo} data} for references. Where you capture into the usual positional, and then invoke foo with the variable as parameter. Would be pretty nice closure-wise: sub match_with_alert($re,$id,$ops,$fac,$pri) { openlog $id,$ops,$fac; my $alert = sub ($match) { syslog $pri, "Matched regexp: $match"; } return study /{&{$alert} $re}/; } my $m = match_with_alert('ROOT login',$0,0,LOG_USER,PRI_CRIT); for <> -> $_ { /$m/ } That would certainly be a handy thing that would set Perl apart from the pack of advanced regexp languages that don't support closures Some other things come to mind as well, but I'm not sure how evil they are. For example: sub decrypt($data is rw) { $data = rot13($data); } print "The secret message is: ", /^Encrypted: {&decrypt .*}/, "\n";
RE: Regex and Matched Delimiters
On Sat, 2002-04-20 at 05:06, Mike Lambert wrote: > > He then went on to describe something I didn't understand at all. > > Sorry. > > Few corrections to what you wrote: > > To avoid the problem of extending {} to support new features with a > character 'x', without breaking stuff that might have an 'x' immediately > after the '{', my proposal is to require one space after the { before the > real regex appears. I hope that you mean "one or more whitespace characters", not just a space. The following would be correct, no? /{| .* }/ Anything else would seem rather confusing to the average Perl programmer.
Re: Regex and Matched Delimiters
> [2c. What about ( data) or (ops data) normally means non-capturing, > ($2 data) captures into $2, ($foo data) captures into $foo?] which is cool where being explicit simplifies things, but ain't where implicit is simpler. So, maybe add an op ('$'?) or switch that makes parens capturing by default, ie as per perl5. -- ralph
Re: Regex and Matched Delimiters
Let me see if I understand the final version of your (Mike's) suggestions and where it appears to be headed: Backwards compatibility: perl5 extended syntax still works in perl6 if one happens to use it. Forward conversion: Automatic conversion of relevant perl5 regex syntax to perl6 is simple. New extension syntax: 1. Syntax is (ops data). 2. There are a bunch of built-in ops, but user can define new ones. [2c. What about ( data) or (ops data) normally means non-capturing, ($2 data) captures into $2, ($foo data) captures into $foo?] Rationalized ops syntax: Ops string consists of arbitrarily ordered individual op characters. (eg '<' signifies a look behind, '!<' signifies fail if look behind match.) Embedded code: Code is inserted using {} with something other than digits in them. (Other stuff, such as sexegers, ignored.) -- ralph
RE: Regex and Matched Delimiters
> He then went on to describe something I didn't understand at all. > Sorry. Few corrections to what you wrote: To avoid the problem of extending {} to support new features with a character 'x', without breaking stuff that might have an 'x' immediately after the '{', my proposal is to require one space after the { before the real regex appears. So to correct the example I wrote of /{a|b|c}+/, it would become /{ a|b|c}+/. It looks a bit weird if you're accustomed to perl5's behavior of (?:). { \ } would then match a single space. { } would do nothing, since the second space falls under the whitespace-insensitive regex rule. Now, since we require a space, all the characters before this space now become 'special' in some form. This fact allows us to add new special characters and map them to functionality, if perl doesn't already do that. For example, I would register | to be: sub zerowidth ($regex) { return <<"EOF"; push \@pos, pos(); regex_run $( qr/$regex/ ); pos() = pop \@pos; EOF } And conversely, _ would be written as: sub regularwidth ($r) { return "regex_run $( qr/$r/ )" } This would allow me to do whacky things, like register these: sub plus ($r) {return "\$level++;regex_run $( qr/$r/ )"} sub minus($r) {return "\$level--;check();regex_run $( qr/$r/ )"} sub check {assert($level>0)} { {+ \(} | {- \)} | . } ({ check() }) brent and I also disagreed on the use of sexegers. japhy has done more thinking about this than either of us have, so perhaps we should just let him weigh in on the issue. I proposed that {< be a sexeger, whereas he prefers {< be a lookbehind. I'll use the former for the rest of this discussion, since on IRC we hd to agree to disagree on it. Regardless, having support for sexegers supports all of the behavior of lookbehinds, since lookbehinds are just a constant-string, and could never be a regex in Perl5. I still like the way lookbehinds work, and am not suggesting that they disappear entirely, but rather that they be changed into an underlying sexeger form. sub b ($reg) { my $ger = reverse $reg; return "run_regex qr/{<|= \Q$ger\E}/" } The following perl5 regex: /(?<=foo)bar/ is now equivalent to: /(b foo)bar/ > The only major drawback I can see to that is the naïve user might type > {.*?}+ expecting a bunch of text in bold tags and getting a Sorry I forgot to make that clearer. The above regex would have to be written as { .*}+ to work properly, specifiying that there are no special tokens. > Here's how it works: > -If the code returns undef, we backtrack. > -If the code returns the empty string, we move on. > -If the code returns anything else, we interpolate that into the > regex. > > So, we now just have ({}). ({print "hello"}) will unfortunately, be really weird. Since it returns 1, the block will return 1. We'd have to force-specify a return value of "". While simplifying the set of operators is good, and I want do a bunch of that myself, we should probably offer a way to perform 'execute with no interpolated regex' behavior of before, somehow built up on top of the existing ({}) operator. Reflecting on it all a bit, if we're willing to make a larger sacrifice in backwards compatibility, it might make things make more sense. - {} would be the code operator, which was specified up above as ({}). This makes more sense, imo, since {} is traditionally used for blocks. - () would have all the special semantics described for {} in this thread. The default for () could still be capturing, so ( a*) performs capturing on /a*/. We'd then have to define another pair of symbols for turning capturing on and off. All instances of Perl5's (blah) would convert to ( blah), and all instances of the special operators in perl5 a la (?@#blah) would translate as they did before, but also specifying the 'dont capture within these parens' special identifier. Basically, I'm trying to propose a system which makes all the regex stuff become orthogonal. Rather than creating a bunch of hardcoded types of (?>= regex operators, instead define small functionalities which can be combined in ways to emulate these tried and true constructs. Brent, let me know if I'm still spouting gibberish on this email. :) Mike Lambert
RE: Regex and Matched Delimiters
Mike Lambert: (a bunch of stuff about regexes) No offense intended, but I had trouble understanding that, and I helped come up with the thing. :^) So, I'll try to interpret. In Perl 5, we came up against the problem of simply running out of characters in regexes. To deal with this, Larry came up with the (?_regex) syntax, where _ is some character. Although a clever use of an otherwise impossible sequence, it's also gratuitously ugly. Consider the many roles (?_) plays: Non-capturing parentheses: (?:) Look(ahead|behind)s: (?=), (?!), (?<=), (?) Obviously, this is getting out of hand--using more than one or two of those constructs makes your regex much harder to read. Let's first tackle non-capturing parentheses and lookarounds. If we think about what metacharacters are around, we can realize that {} is only legal with numbers inside it. [0] That means that we can probably reuse it. If we think about it, we can derive a few basic categories: -consuming (_) or not (|) [1] Reasoning: _ is fat, | is skinny -positive (=) or negative (!) Reasoning: same as in Perl 5 -forwards (>) or backwards (<) Reasoning: same as in Perl 5 The characters in parentheses are prefix characters that indicate which is to be used. A simple mapping of the five things this section covers follows: Perl 5 Perl 6 -- -- (?:regex) {_=>regex} (?=regex) {|=>regex} (?!regex) {|!>regex} (?<=regex) {|=. So here's a map of what you're more likely to see in a regex: Perl 5 Perl 6 -- -- (?:regex) {regex} (?=regex) {|regex} (?!regex) {|!regex} (?<=regex) {|regex) -- Nonsensical. {_=.*?}+ expecting a bunch of text in bold tags and getting a lookbehind instead--so it may be wise to leave the | and _ specifiers out of this altogether, and come up with a better way. I'll address that point shortly. In the mean time, let's consider some of the other syntaxes. The inline code tings are a good opportunity for improvement--and they have a good alternative. In Perl 5, ({ ought not to be legal, but it is--it's hacked in to be the same as (\{. So, we can drop a question mark from each of the block forms, getting ({code}) and (?{code}. However, we can go even further by combining the two. Here's how it works: -If the code returns undef, we backtrack. -If the code returns the empty string, we move on. -If the code returns anything else, we interpolate that into the regex. So, we now just have ({}). Comments can go, since Larry has said that /x will be on by default anyway. That leaves conditionals, non-backtracking sections, inline modifiers, and (maybe) non-capturing parens. We now have three characters that aren't valid in these places: *, +, and ?. My suggestion is this: Thing Syntax Logic - -- - Conditionals(?()|) The question mark makes sense for a conditional. Inline Modifiers(?imsx-imsx)Might as well be a little bit compatible. Non-backtracking(+) + requires more than * does. Non-capturing (*) Suggestions welcome. :^) So, my final suggestions are: Perl 5 Perl 6 -- -- (?:)(*) (?=){} (?!){!} (?<=) {<} (?)(+) (?{}) ({}) returning empty string (??{}) ({}) returning a string or regex (?#)N/A--obsolete Please feel free to comment on these. [0] Perl won't be the first tool to take advantage of this--lex uses something similar for named subexpressions. [1] Neither of these characters is ideal, however. | looks like !, and _ might reasonably be at the beginning of this sort of thing anyway. Better suggestions are welcome. [2] Mike originally had all the backwards matches as sexegers. I think this is a bad idea, but feel obligated to mention that. [3] This seems a bit useless to me too. It's probably more useful to have a /r modifier on the entire regex. [4] I changed the ordering for this one to avoid an ambiguity. --Brent Dax <[EMAIL PROTECTED]> @roles=map {"Parrot $_"} qw(embedding regexen Configure) #define private public --Spotted in a C++ program just before a #include