Re: comprehensive list of perl6 rule tokens
Further woes, arguments, questions: In regards to <@array>, A5 says "A leading @ matches like a bare array..." but this is an over-generalization. A leading '@' merely indicates the rule is found in an array. <@array[3]> would be the same as <$fourth_element_of_array>, assuming those two values are identical. Next, about and . What is the justification for that syntax? There is no other example of a <-sequence with whitespace, at least that I can see. It would appear "RULE" is an argument of sorts to the 'before' and 'after' rules, but how do they access that argument? How do I write a rule that takes an argument? -- Jeff "japhy" Pinyan % How can we ever be the sold short or RPI Acacia Brother #734 % the cheated, we who for every service http://japhy.perlmonk.org/ % have long ago been overpaid? http://www.perlmonks.org/ %-- Meister Eckhart
Re: comprehensive list of perl6 rule tokens
On May 26, Patrick R. Michaud said: N backtracking fails completely N remove what matched up to this point from the string N we must be after the pattern P N we must NOT be after the pattern P N we must be before the pattern P N we must NOT be before the pattern P As with ':words', etc., I'm not sure that these qualify as "tokens" when parsing the regex -- the tokens are actually "<" or " I'm curious if and "capture" anything. They don't start with '?', so following the guidelines, it would appear they capture, but that doesn't make sense. Should they be written as and , or is the fact that they capture silently ignored because they're not consuming anything? Same thing with and . And with and . It should be assumed that doesn't capture because it can only capture if P matches, in which case fails. So, what's the deal? -- Jeff "japhy" Pinyan % How can we ever be the sold short or RPI Acacia Brother #734 % the cheated, we who for every service http://japhy.perlmonk.org/ % have long ago been overpaid? http://www.perlmonks.org/ %-- Meister Eckhart
Re: comprehensive list of perl6 rule tokens
In regards to http://www.nntp.perl.org/group/perl.perl6.language/21120 which discusses character class syntax in Perl 6, I have some comments to make. First, I've been very interested in seeing proper set notation for char classes in Perl 5. I was pretty vocal about it during TPC in 2002, I think, and have since added some features that are in Perl 5 now that allow you to define your own Unicode properties with not only + and - and ! but & as well. If we want to treat character classes as sets, then we should try to use notation that reads properly. I don't see how '+' and '|' are any different in this case: <+Foo +Bar> and should produce the same results always. I suppose the + is helpful in distinguishing a character class assertion from any other, though. To *complement* a character class, I think the character ~ is appropriate. Intersection should be done with &. Subtraction can be provided with -, although it's really just a shorthand: A - B is really A & ~B... but I suppose huffman encoding tells us we should provide the - sign. Here are some examples, then: <+alpha -vowels>all alphabetic characters except vowels <+alpha & ~vowels> same thing <[a..z] -[aeiou]> all characters 'a' through 'z' minus vowels <[a..z] & ~[aeiou]> same thing <~(X & Y) | Z> all characters not in X-and-Y, or in Z The last example shows <~ which is currently unclaimed as far as assertions go. Since I'd be advocating the removal of a unary - in character classes (to be replaced by ~), I think this would be ok. The allowance for a unary + in character classes has already been justified. For the people who are really going to use it, the notation won't be foreign. And I'd expect most people who'd use it would actually abstract a good portion of it away into their own property definitions, so that <~(X & Y) | Z> would actually just be <+My_XYZ_Property> which would be defined elsewhere. What say you? -- Jeff "japhy" Pinyan % How can we ever be the sold short or RPI Acacia Brother #734 % the cheated, we who for every service http://japhy.perlmonk.org/ % have long ago been overpaid? http://www.perlmonks.org/ %-- Meister Eckhart
Re: comprehensive list of perl6 rule tokens
On May 26, Patrick R. Michaud said: On Tue, May 24, 2005 at 08:25:03PM -0400, Jeff 'japhy' Pinyan wrote: I have looked through the latest revisions of Apo05 and Syn05 (from Dec 2004) and come up with the following list: http://japhy.perlmonk.org/perl6/rules.txt I'll review the list below, but it's also worthwhile to read http://www.nntp.perl.org/group/perl.perl6.language/21120 which is Larry's latest missive on character classes, and http://www.nntp.perl.org/group/perl.perl6.language/20985 which describes the capturing semantics (but be sure to note the lengthy threads that follow concerning changes in the indexing from $1, $2, ... to $0, $1, ... ). I'll check them out. Right now, I'm really only concerned with syntax rather than implementation. Perl6::Rule::Parser will only parse the rule into a tree structure. & a&b N conjunction &varN subroutine I'm not sure that "&var" means subroutine anymore. A05 does mention Ok. If it goes away, I'm fine with that. x**{n..m} N previous atom n..m times Keeping in mind that the "n..m" can actually be any sort of closure Yeah, I know. ( (x) Y capture 'x' ) Y must match opening '(' It may be worth noting that parens not only capture, they also introduce a new scope for any nested subpattern and subrule captures. Ok. I don't think that'll affects me right now. :ignorecase N case insensitivity :i :global N match globally :g :continue N start scanning after previous match :c ...etc I'm not sure these are "tokens" in the sense of "single unit of purpose" in your original message. I think these are all adverbs, and the "token" is just the initial C<:> at the beginning of a group. I understand, but that set is particularly important to me, because as far as I am concerned, the rule /abc/ is the object Perl6::Rule::Parser::exact->new('abc'), whereas the rule /:i abc/ is the object Perl6::Rule::Parser::exactf->new('abc') -- this is using node terminology from Perl 5, where "exactf" means "exact with case folding". :keepallN all rules and invoked rules remember everything That's now ":parsetree" according to Damian's proposed capture rules. Ok. I haven't seen those yet. N backtracking fails completely N remove what matched up to this point from the string N we must be after the pattern P N we must NOT be after the pattern P N we must be before the pattern P N we must NOT be before the pattern P As with ':words', etc., I'm not sure that these qualify as "tokens" when parsing the regex -- the tokens are actually "<" or " I understand. Luckily this new syntax will enable me to abstract things in the parser. my $obj = $S->object(assertion => $name, $neg); # where $name is the part after the < or Since there's no longer different prefixes for every type of assertion, I no longer need to make specific classes of objects. N match whitespace by :w rules N match a space character (chr 32 ONLY) Here the token is " Right. <$rule> N indirect rule <::$rulename> N indirect symbolic rule <@rules> N like '@rules' <%rules> N like '%rules' <{ code }>N code produces a rule <&foo()> N subroutine returns rule <( code )>N code must return true or backtracking ensues Here the leading tokens are actually "<$", "<::$", "<@", "<%", "<{", "<&", and "<(", and I suspect we have " Per your second message, <[EMAIL PROTECTED]> would mean >, right? Of course, one could claim that these are really separated as in "<", "?", and "$" tokens, but PGE's parser currently treats them as a unit to make it easier to jump directly into the correct handler for what follows. Yes, so does mine. :) <[a-z]> N character class <+alpha> N character class <-[a-z]> N complemented character class The tokens for character class manipulation are currently "<[", "<+", and "&
Re: comprehensive list of perl6 rule tokens
On May 25, Mark A. Biggar said: Jonathan Scott Duff wrote: On Tue, May 24, 2005 at 11:24:50PM -0400, Jeff 'japhy' Pinyan wrote: I wish was allowed. I don't see why has to be confined to zero-width assertions. I don't either actually. One thing that occurred to me while responding to your original email was that might have slightly wrong huffmanization. Is zero-width the common case? If not, we could use character doubling for emphasis: consumes, while is zero-width. Now is a character class just like <+digit> and so under the new character class syntax, would probably be written <+prop X> or if the white space is a problem, then maybe <+prop:X> (or <+prop(X)> as Larry gets the colon :-), but that is a pretty adverbial case so ':' maybe okay) with the complemented case being <-prop:X>. Actually the 'prop' may be unnecessary at all, as we know we're in the character class sub-language because we saw the '<+', '<-' or '<[', so we could just define the various Unicode character property codes (I.e., Lu, Ll, Zs, etc) as pre-defined character class names just like 'digit' or 'letter'. Yeah, that was going to be my next step, except that the unknowing person might make a sub-rule of their own called, say, "Zs", and then which would take precedence? Perhaps is a good way of writing it. BTW, as a matter of terminology, <-digit> should probably be called the complement of <+digit> instead of the negation so as not to confuse it with the negative zero-width assertion case. Yeah, I just wrote that in my recent reply to Scott. I realized the nomenclature would be a point of confusion. -- Jeff "japhy" Pinyan % How can we ever be the sold short or RPI Acacia Brother #734 % the cheated, we who for every service http://japhy.perlmonk.org/ % have long ago been overpaid? http://www.perlmonks.org/ %-- Meister Eckhart
Re: comprehensive list of perl6 rule tokens
On May 25, Jonathan Scott Duff said: On Tue, May 24, 2005 at 11:24:50PM -0400, Jeff 'japhy' Pinyan wrote: I wish was allowed. I don't see why has to be confined to zero-width assertions. I don't either actually. One thing that occurred to me while responding to your original email was that might have slightly wrong huffmanization. Is zero-width the common case? If not, we could use character doubling for emphasis: consumes, while is zero-width. But that's not even the point. The ! in is not what makes a zero-width assertion, it's the 'after' that does that. All the ! does is negate the boolean sense of the assertion, which seems like a useful thing to have. Hrm, but I think I see the problem. How does one define "negation" for an arbitrary assertion? Is saying "if matches, fail"? Because then doesn't make mean the same as <-prop X>. We don't want negation, we want complement. I guess '!' is only well-defined for zero-width assertions. When you want to say , I guess > or > is the proper way to go. -- Jeff "japhy" Pinyan % How can we ever be the sold short or RPI Acacia Brother #734 % the cheated, we who for every service http://japhy.perlmonk.org/ % have long ago been overpaid? http://www.perlmonks.org/ %-- Meister Eckhart
Re: comprehensive list of perl6 rule tokens
On May 24, Jonathan Scott Duff said: On Tue, May 24, 2005 at 08:25:03PM -0400, Jeff 'japhy' Pinyan wrote: http://japhy.perlmonk.org/perl6/rules.txt That looks completish to me. (At least I didn't think, "hey! where's such and such?") Oh, frabjous day! One thing that I noticed and had to look up was <-prop X> though. Because ... I wish was allowed. I don't see why has to be confined to zero-width assertions. The part which needs a bit of clarification right now, in my opinion, is character classes. From what I can gather, these are character classes: <[a-z] +> <+ -[aeiouAEIOU]> I believe that Larry blessed Pm's idea to allow <[a..z]+digit> <+alpha-[aeiouAEIOU]> Ok, that's news to me. (I have yet to peruse the archives.) That's nice, not requiring you to <>-ize property names inside a character class assertion. I'd think whitespace would be permitted in between parts of a character class, but perhaps I'm wrong. That would kinda go against the whole "whitespace for readability" idea of Perl 6 rules, though. which implies to me that assertions starting with one of "<[", "<-" or "<+" should be treated as character classes. This doesn't seem to play well with <-prop X>. Maybe it does though. Considering the Unicode properties are like char class macro-things (like \w and \d), I don't see a problem, except for the fact that there's more than one "word" (chunk of non-whitespace) associated with them. Maybe Unicode properties retain their enclosing <>'s? Also, I think that it's [a..z] now rather than [a-z] but I'm not entirely sure. At least that's how PGE implements it. Ok. I'll wait for a message from On High about that. It's a minor detail. but I want to be sure. I'm also curious about whitespace. Is "<[" one token, or can I write "< [a-z] >" and have it be a character class? I think you need to write "<[" I expected as much. -- Jeff "japhy" Pinyan % How can we ever be the sold short or RPI Acacia Brother #734 % the cheated, we who for every service http://japhy.perlmonk.org/ % have long ago been overpaid? http://www.perlmonks.org/ %-- Meister Eckhart
comprehensive list of perl6 rule tokens
I'm working on a Perl 5 module that will allow for the parsing of a Perl 6 rule into a tree structure -- specifically, I'm subclassing/extending Regexp::Parser into Perl6::Rule::Parser. This module is designed ONLY to PARSE the contents of a rule; it is not concerned with the implementation of all the new things Perl 6 rules will offer, merely their syntax. Once this module is done, I'll work on a slightly broader one which will concern itself with the exterior of the rule (the m:xyz:abc('def')/.../ part, rather than the contents of the rule itself). To do this effectively, I need an exhaustive list of all tokens that can appear in a Perl 6 rule. By "token", I mean a single unit of purpose, such as ^^ and and **{3..6}. I have looked through the latest revisions of Apo05 and Syn05 (from Dec 2004) and come up with the following list: http://japhy.perlmonk.org/perl6/rules.txt The list is split up by leading character. I think it's complete, but I'm probably wrong, which is why I need more eyes to look over it and tell me what I've missed. I just got an email back from Damian which will help me move in the right direction, but I'd like this to be open to as many knowledgeable minds as possible. The part which needs a bit of clarification right now, in my opinion, is character classes. From what I can gather, these are character classes: <[a-z] +> <+ -[aeiouAEIOU]> but I want to be sure. I'm also curious about whitespace. Is "<[" one token, or can I write "< [a-z] >" and have it be a character class? Thanks for your help. Unless you're difficult. -- Jeff "japhy" Pinyan % How can we ever be the sold short or RPI Acacia Brother #734 % the cheated, we who for every service http://japhy.perlmonk.org/ % have long ago been overpaid? http://www.perlmonks.org/ %-- Meister Eckhart
explicit laws about whitespace in rules
I'd like to know where EXACTLY whitespace is permitted in rules. Is it legal to write \c [CHARACTER NAME] or must I write \c[CHARACTER NAME] -- Jeff "japhy" Pinyan % How can we ever be the sold short or RPI Acacia Brother #734 % the cheated, we who for every service http://japhy.perlmonk.org/ % have long ago been overpaid? http://www.perlmonks.org/ %-- Meister Eckhart