Re: pattern alternation (was Re: How are ...)
Darren Duncan wrote: David Green wrote: On 2010-08-05, at 8:27 am, Aaron Sherman wrote: On Thu, Aug 5, 2010 at 7:55 AM, Carl Mäsak cma...@gmail.com wrote: I see this particular thinko a lot, though. Maybe some Perl 6 lint tool or another will detect when you have a regex containing ^ at its start, $ at the end, | somewhere in the middle, and no [] to disambiguate. I think conceptually the beginning and the end of a string feels like a bracketing construct (only without symmetrical symbols). At least that seems to be my instinct. Well, it doesn't in / ^foo | ^bar | ^qux /, but in something like /^ foo|bar $/, the context immediately implies a higher precedence for ^ and $. Maybe something like // foo|bar // could work as a bracketing version? Personally, I had always considered the ^ and $ to be the lowest precedence things in a pattern. Meta characters don't have a precedence on their on - concatenation has. Cheers, Moritz
pattern alternation (was Re: How are ...)
Carl Mäsak wrote: Darren (): Read what I said again. I was proposing that the namespace comprised of names matching a pattern like this: /^ [A..Z]+ | [a..z]+ $/ /^ [[A..Z]+ | [a..z]+] $/ Are the square brackets necessary when the pattern doesn't contain anything other than the alternatives? I would have thought them optional in the case I mentioned. Rather, they would just be necessary in a case like this: /^ foo [[A..Z]+ | [a..z]+] bar $/ -- Darren Duncan
Re: pattern alternation (was Re: How are ...)
On Thu, Aug 05, 2010 at 12:29:38AM -0700, Darren Duncan wrote: Carl Mäsak wrote: Darren (): Read what I said again. I was proposing that the namespace comprised of names matching a pattern like this: /^ [A..Z]+ | [a..z]+ $/ /^ [[A..Z]+ | [a..z]+] $/ Are the square brackets necessary when the pattern doesn't contain anything other than the alternatives? In this case yes -- the original pattern without the square brackets would act like: / [^ [A..Z]+] | [[a..z]+ $] / In other words, the original pattern says starting with uppercase or ending with lowercase. Pm
Re: pattern alternation (was Re: How are ...)
Darren (), Carl (), Darren (), Patrick (): Read what I said again. I was proposing that the namespace comprised of names matching a pattern like this: /^ [A..Z]+ | [a..z]+ $/ /^ [[A..Z]+ | [a..z]+] $/ Are the square brackets necessary when the pattern doesn't contain anything other than the alternatives? In this case yes -- the original pattern without the square brackets would act like: / [^ [A..Z]+] | [[a..z]+ $] / In other words, the original pattern says starting with uppercase or ending with lowercase. I see this particular thinko a lot, though. Maybe some Perl 6 lint tool or another will detect when you have a regex containing ^ at its start, $ at the end, | somewhere in the middle, and no [] to disambiguate. // Carl
Re: pattern alternation (was Re: How are ...)
On Thu, Aug 5, 2010 at 7:55 AM, Carl Mäsak cma...@gmail.com wrote: Darren (), Carl (), Darren (), Patrick (): In this case yes -- the original pattern without the square brackets would act like: / [^ [A..Z]+] | [[a..z]+ $] / In other words, the original pattern says starting with uppercase or ending with lowercase. I see this particular thinko a lot, though. Maybe some Perl 6 lint tool or another will detect when you have a regex containing ^ at its start, $ at the end, | somewhere in the middle, and no [] to disambiguate. You know, this problem would go away, almost entirely, if we had a :f[ull] adverb for regex matching that imposed ^[...]$ around the entire match. Then your code becomes: m:f/[A..Z]+|[a..z]+/ for grins, :f[ull]l[ine] could use ^^ and $$. I suspect :full would almost always be associated with TOP, in fact. Boy am I tired of typing ^ and $ in TOP ;-) -- Aaron Sherman Email or GTalk: a...@ajs.com http://www.ajs.com/~ajs
Re: pattern alternation (was Re: How are ...)
Aaron Sherman wrote: You know, this problem would go away, almost entirely, if we had a :f[ull] adverb for regex matching that imposed ^[...]$ around the entire match. Then your code becomes: m:f/[A..Z]+|[a..z]+/ for grins, :f[ull]l[ine] could use ^^ and $$. I suspect :full would almost always be associated with TOP, in fact. Boy am I tired of typing ^ and $ in TOP ;-) The regex counterpart of C say $x vs. C print $x\n . Yes, this would indeed solve a lot of problems. It also reflects a tendency in some regular expression engines out there to automatically impose full string matching (i.e., an implicit ^ at the start and $ at the end). That said: for mnemonic purposes, I'd be inclined to have :f do /^[$pattern]$/, while :ff does /^^[$pattern]$$/. -- Jonathan Dataweaver Lang
Re: pattern alternation (was Re: How are ...)
On 2010-08-05, at 8:27 am, Aaron Sherman wrote: On Thu, Aug 5, 2010 at 7:55 AM, Carl Mäsak cma...@gmail.com wrote: I see this particular thinko a lot, though. Maybe some Perl 6 lint tool or another will detect when you have a regex containing ^ at its start, $ at the end, | somewhere in the middle, and no [] to disambiguate. I think conceptually the beginning and the end of a string feels like a bracketing construct (only without symmetrical symbols). At least that seems to be my instinct. Well, it doesn't in / ^foo | ^bar | ^qux /, but in something like /^ foo|bar $/, the context immediately implies a higher precedence for ^ and $. Maybe something like // foo|bar // could work as a bracketing version? You know, this problem would go away, almost entirely, if we had a :f[ull] adverb for regex matching that imposed ^[...]$ around the entire match. I was thinking of that too. I suspect :full would almost always be associated with TOP, in fact. Boy am I tired of typing ^ and $ in TOP ;-) Does it make sense for ^[...]$ to be assumed in TOP by default? (Though not necessary if there's a shortcut like //...//.) -David
Re: pattern alternation (was Re: How are ...)
On Thu, Aug 05, 2010 at 10:27:50AM -0400, Aaron Sherman wrote: On Thu, Aug 5, 2010 at 7:55 AM, Carl Mäsak cma...@gmail.com wrote: I see this particular thinko a lot, though. Maybe some Perl 6 lint tool or another will detect when you have a regex containing ^ at its start, $ at the end, | somewhere in the middle, and no [] to disambiguate. You know, this problem would go away, almost entirely, if we had a :f[ull] adverb for regex matching that imposed ^[...]$ around the entire match. Then your code becomes: m:f/[A..Z]+|[a..z]+/ There's a version of this already. Matching against an explicit 'regex', 'token', or 'rule' automatically anchors it on both ends. Thus: $string ~~ regex { [A..Z]+ | [a..z]+ } is equivalent to $string ~~ regex { ^ [ A..Z+ | [a..z]+ ] $ } Pm
Re: pattern alternation (was Re: How are ...)
On Thu, Aug 5, 2010 at 11:09 AM, Patrick R. Michaud pmich...@pobox.comwrote: On Thu, Aug 05, 2010 at 10:27:50AM -0400, Aaron Sherman wrote: On Thu, Aug 5, 2010 at 7:55 AM, Carl Mäsak cma...@gmail.com wrote: I see this particular thinko a lot, though. Maybe some Perl 6 lint tool or another will detect when you have a regex containing ^ at its start, $ at the end, | somewhere in the middle, and no [] to disambiguate. You know, this problem would go away, almost entirely, if we had a :f[ull] adverb for regex matching that imposed ^[...]$ around the entire match. Then your code becomes: m:f/[A..Z]+|[a..z]+/ There's a version of this already. Matching against an explicit 'regex', 'token', or 'rule' automatically anchors it on both ends. Thus: $string ~~ regex { [A..Z]+ | [a..z]+ } is equivalent to $string ~~ regex { ^ [ A..Z+ | [a..z]+ ] $ } While that's a nifty special case (I'm sure it will surprise me someday, and I'll spend a half hour debugging before I remember this mail), it doesn't help in the general case (see my example grammar, below). After doing some more thinking and comparing this to other languages (python, for example has match which matches only at the start of a string), it seems to me that there is a sort of out-of-band need to have a more general solution at match time. Here's my second pass suggestion: m:r / m:rooted -- Match is rooted on both ends (^...$) m:rs / m:rootedstart - Match is rooted at the start of string (^, ala Python re.match) m:re / m:rootedend - Match is rooted at the end of string ($) m:rn / m:rootednone - Match is not rooted (default) m:o / m:oneline - Modify :r and friends to use ^^/$$ Here's one way I can see that being routinely used: # Simplistic shell scripts rule TOP :r {stmt*} # Match the whole script rule stmt :r :o { cmd arg* } # One statement per line The other way to go about that would be with parameterized adverbs. I'm not sure how comfy people are with those, but they're in the spec. So this: m:r / m:rooted -- Match is rooted (default is ^...$) Parameters: :s / :start -- Match is rooted only at start (^) :e / :end -- Match is rooted only at end ($) [note: :s :e should produce a warning] :n / :none -- Match is not rooted (null modifier) [note: combining :n with :s or :e should warn] :o / :oneline -- Use ^^ and $$ instead of ^ and $ [note: combining :o with :n should warn?] So our statement matching grammar becomes: rule TOP :r {stmt*} rule stmt :r(:o) { cmd arg* } The clown nose is just a side benefit ;-) Seriously, though, I prefer :r(:o) because :r:o looks like it should be the opposite of :rw (there is no :ro, as far as I know). PS: I see no reason that any of this is needed for 6.0.0 -- Aaron Sherman Email or GTalk: a...@ajs.com http://www.ajs.com/~ajs
Re: pattern alternation (was Re: How are ...)
On Thu, Aug 5, 2010 at 12:28 PM, Aaron Sherman a...@ajs.com wrote: While that's a nifty special case (I'm sure it will surprise me someday, and I'll spend a half hour debugging before I remember this mail), it doesn't help in the general case (see my example grammar, below). In the general case, no. In the case of your grammar, and all grammars, it does help. All regex routines, when called standalone, are anchored to the beginning and end of the string. So, having ^ and $ at the beginning and end of your TOP is a no-op unless some other rule calls it as a subrule. S05 says: In general, the anchoring of any subrule call is controlled by its calling context. When a regex, token, or rule method is called as a subrule, the front is anchored to the current position (as with :p), while the end is not anchored, since the calling context will likely wish to continue parsing. However, when such a method is smartmatched directly, it is automatically anchored on both ends to the beginning and end of the string. and that The basic rule of thumb is that the keyword-defined methods never do implicit .*?-like scanning, while the m// and s// quotelike forms do such scanning in the absence of explicit anchoring. Given that the Grammar.parse is specified to create a new Grammar object and directly match its TOP(or the value of the :rule adverb) method, without any specification that it does implicit .*? like scanning, I think that Grammar.parse should always anchor. This doesn't appear to work quite properly in Rakudo currently. It anchors to the beginning but not to the end. I'm about to check if there's a rakudobug for this already, and submit it if not. After doing some more thinking and comparing this to other languages (python, for example has match which matches only at the start of a string), it seems to me that there is a sort of out-of-band need to have a more general solution at match time. Here's my second pass suggestion: m:r / m:rooted -- Match is rooted on both ends (^...$) m:rs / m:rootedstart - Match is rooted at the start of string (^, ala Python re.match) m:re / m:rootedend - Match is rooted at the end of string ($) m:rn / m:rootednone - Match is not rooted (default) m:o / m:oneline - Modify :r and friends to use ^^/$$ Here's one way I can see that being routinely used: # Simplistic shell scripts rule TOP :r {stmt*} # Match the whole script rule stmt :r :o { cmd arg* } # One statement per line :oneline or similar might be useful. I'm not sure about :rootedend and :rootedstart. :rooted is useful only in one situation: when implicitly matching against the topic. You could do m:r/ foo /; to match against the topic, but regex { foo }; would not do what you want (I think). I don't know if doing an anchored match against the topic is really important enough to justify an adverb just so you don't have to do $_ ~~ regex { foo }. The other way to go about that would be with parameterized adverbs. I'm not sure how comfy people are with those, but they're in the spec. So this: m:r / m:rooted -- Match is rooted (default is ^...$) Parameters: :s / :start -- Match is rooted only at start (^) :e / :end -- Match is rooted only at end ($) [note: :s :e should produce a warning] :n / :none -- Match is not rooted (null modifier) [note: combining :n with :s or :e should warn] :o / :oneline -- Use ^^ and $$ instead of ^ and $ [note: combining :o with :n should warn?] So our statement matching grammar becomes: rule TOP :r {stmt*} rule stmt :r(:o) { cmd arg* } The clown nose is just a side benefit ;-) Seriously, though, I prefer :r(:o) because :r:o looks like it should be the opposite of :rw (there is no :ro, as far as I know). PS: I see no reason that any of this is needed for 6.0.0 -- Aaron Sherman Email or GTalk: a...@ajs.com http://www.ajs.com/~ajs -- Tyler Curtis
Re: pattern alternation (was Re: How are ...)
David Green wrote: On 2010-08-05, at 8:27 am, Aaron Sherman wrote: On Thu, Aug 5, 2010 at 7:55 AM, Carl Mäsak cma...@gmail.com wrote: I see this particular thinko a lot, though. Maybe some Perl 6 lint tool or another will detect when you have a regex containing ^ at its start, $ at the end, | somewhere in the middle, and no [] to disambiguate. I think conceptually the beginning and the end of a string feels like a bracketing construct (only without symmetrical symbols). At least that seems to be my instinct. Well, it doesn't in / ^foo | ^bar | ^qux /, but in something like /^ foo|bar $/, the context immediately implies a higher precedence for ^ and $. Maybe something like // foo|bar // could work as a bracketing version? Personally, I had always considered the ^ and $ to be the lowest precedence things in a pattern. But I can understand the flexibility one gains from that not being so, having seen David's example here, which it never occurred to me before was possible. -- Darren Duncan
Re: pattern alternation (was Re: How are ...)
On Thu, Aug 5, 2010 at 2:43 PM, Tyler Curtis ekir...@gmail.com wrote: On Thu, Aug 5, 2010 at 12:28 PM, Aaron Sherman a...@ajs.com wrote: While that's a nifty special case (I'm sure it will surprise me someday, and I'll spend a half hour debugging before I remember this mail), it doesn't help in the general case (see my example grammar, below). In the general case, no. In the case of your grammar, and all grammars, it does help. All regex routines, when called standalone, are anchored to the beginning and end of the string. So, having ^ and $ at the beginning and end of your TOP is a no-op unless some other rule calls it as a subrule. There's something deeply disturbing to me in that... but I can't fully express what it is. It just feels like I'm going to end up debugging mountains of code, written by people who didn't understand that that was the case. Several times over the past few weeks, I've mentioned something on this list only to find that, buried somewhere deep in a synopsis, there was a special case I was unaware of. The sheer volume of silent special cases in Perl 6 appears to be dwarfing that of Perl 5, but perhaps that's just because I know Perl 5 far better than I know Perl 6. Mind you, I'm not complaining, so much as working out how I feel out loud Am I the only one who feels this way at this point? :oneline or similar might be useful. I'm not sure about :rootedend and :rootedstart. Are you saying that you can't think of examples of where you want to root a regex only to the start or end, or that you just don't think you need an adverb to do it? If the former, then I submit the 1536 examples of matching only at the end of strings in my local Perl library (mostly for matching whitespace or filename extensions it looks like) and the 3199 examples of matching only at the start which includes headers of all types (RFC2822 and friends, HTTP, CPAN configs, etc.), whitespace, command sequence matching (e.g. /^GET /) and so on. If the latter, then I guess you and I just have a different take, here, and that's fine. I respect your opinion, but in this case, I happen to disagree. PS: You can also search through any typical python install for \.match which will yield quite a lot of additional examples. I don't know Ruby or Java very well, or I'd go looking for examples there too. -- Aaron Sherman Email or GTalk: a...@ajs.com http://www.ajs.com/~ajs