Re: some newbie questions about synopsis 5
Patrick clarified: At any rate, I find that having a subpattern capture base its index on the highest index of all of the previous alternation branches is easy to understand and works well in practice. It can also be easily changed with another alias if needed. I strongly agree, and would be unhappy to see it work any other way. * If a subrule appears two (or more) times in the same lexical scope (i.e. twice within the same subpattern and alternation), or if the subrule is quantified anywhere within the entire rule, then its corresponding hash entry is always assigned a reference to an array of Match objects, rather than a single Match object. Maybe you're not the right person to ask, but is there a particular reason for the entire rule bit? / (foo|None) foo (foo) / Here we get three Matches $0foo (possibly undefined), $foo, and $1foo. At least, I think so. / (foo?) foo (foo) / Now, we suddenly get three more or less unrelated arrays with lengths 1..1, 1, and 1. Of course, I admit this example is a bit artificial. Oh, I hadn't caught that particular clause (or hadn't read it as you just did). PGE certainly doesn't implement things that way. I think the entire rule clause was intended to cover cases like / [ foo ]* / where foo is indirectly quantified and therefore is an array of match objects. We should probably reword it, or get a clarification of what is intended. (Damian, @Larry: can you confirm or clarify this for us?) Sorry, you're correct that it's not what was intended. I was specifically trying to address the case where the same subrule appears with different quantifications in different alternations in the same scope. That is, the difference between: m/ bar foo | baz foo / # $foo always contains a scalar and: m/ bar foo | baz foo* /# $foo always contains an array ref Is this clearer: * If a subrule appears two (or more) times in any branch of a lexical scope (i.e. twice within the same subpattern and alternation), or if the subrule is quantified anywhere within a given scope, then its corresponding hash entry is always assigned a reference to an array of Match objects, rather than a single Match object. ??? If so, I'd be happy if someone wanted to update the Synposis that way. Note, however, that this question suggests that we need a more overt statement about what consistitutes a scope within a regex. I'll work on providing that when I take my next pass through the Synopses (probably next week). Furthermore, I think within the same subpattern and alternation is not quite correct, at least it wouldn't apply to somethink like / (foo [ foo | ... ]) / unless we consider the (...) sequence as a kind of single branch alternation. And why are alternation branches considered to be lexical scopes, anyway? In the example you give, $0foo is indeed an array of match objects. The same alternation in this case is the subpattern... compare to / (foo [ foo | ... ]) | foo / $0foo is an array, $foo is a single match object. Alternation branches don't create new lexical scopes, they just affect quantification and subpattern numbering. In both of the following examples / abc foo def foo / / ghi foo | jkl foo / each foo has the same lexical scope ($foo), but in the abc example $foo is an array of match objects, while in the ghi example $foo is a single match object. Patrick is spot-on here. In simplest terms, the only things that create a scope are the regex delimiters (which delimit the outermost lexical scope), and any pair of capturing parentheses (which delimit some nested scope). My second question is why adding a ? or ?? to an unquantified subrule which would otherwise result in a single Match object should result in an array, rather than a single (possibly undefined) Match. The specification was originally this way but was later changed to the current definition. I think people found the idea of ? producing a single match object confusing, so for consistency we ended up with all quantifiers produces arrays of match objects. That's my recollection too. And I certainly agree with the decision, even though I proposed it the other way originally. Damian
Re: some newbie questions about synopsis 5
Patrick R. Michaud wrote: In the following, / (a) [ (b) (c) | $5 := (d) $0 := (e) ] (f) / does the first alias have any effect on where the f's will go (probably not)? I'll defer to @Larry on this one, but my initial impression is that the (f) capture would go into $6. I think that sequences should behave exactly as single branch alternations (only that there is no such thing, although we can write [foo|fail]). So I would rather opt for $1. - Which rules do apply to repeated captures with the same alias? For example, the second array aliasing example m:w/ Mr?s? @names := ident W\. @names := ident | Mr?s? @names := ident /; seems to suggests that by using $names, the lower branch would have resulted in a single Match object instead of an array (like the array we would have gotten if we hadn't used the aliases in the first place). Is this right? Yes, that's correct. But wouldn't it be nice if the same rules applied to aliases and subrule invocations, that is, recursion put aside, to think of / foo / simply as a shorter way to say / $foo := ([definition of foo]:) /? And I've got two more somewhat related questions: The synopsis says: * If a subrule appears two (or more) times in the same lexical scope (i.e. twice within the same subpattern and alternation), or if the subrule is quantified anywhere within the entire rule, then its corresponding hash entry is always assigned a reference to an array of Match objects, rather than a single Match object. Maybe you're not the right person to ask, but is there a particular reason for the entire rule bit? / (foo|None) foo (foo) / Here we get three Matches $0foo (possibly undefined), $foo, and $1foo. At least, I think so. / (foo?) foo (foo) / Now, we suddenly get three more or less unrelated arrays with lengths 0..1, 1, and 1. Of course, I admit this example is a bit artificial. Furthermore, I think within the same subpattern and alternation is not quite correct, at least it wouldn't apply to somethink like / (foo [ foo | ... ]) / unless we consider the (...) sequence as a kind of single branch alternation. And why are alternation branches considered to be lexical scopes, anyway? Just because of subpattern numbering? My second question is why adding a ? or ?? to an unquantified subrule which would otherwise result in a single Match object should result in an array, rather than a single (possibly undefined) Match. That is, why doesn't foo? rather behave like [foo|null]? This would save us the trouble to create all these tiny arrays, or having to write [...|null] all the time. Or maybe one could define one's own quantifiers?
Re: some newbie questions about synopsis 5
On Fri, Feb 17, 2006 at 02:33:12PM +0100, H. Stelling wrote: Patrick R. Michaud wrote: In the following, / (a) [ (b) (c) | $5 := (d) $0 := (e) ] (f) / does the first alias have any effect on where the f's will go (probably not)? I'll defer to @Larry on this one, but my initial impression is that the (f) capture would go into $6. I think that sequences should behave exactly as single branch alternations (only that there is no such thing, although we can write [foo|fail]). So I would rather opt for $1. The current implementation is that a capturing subpattern is indexed based on the largest index in all of the alternation branches. I'm not sure it makes sense to base it on aliases of the last alternation branch. Here are some examples we can chew on: / (a) [ (b) (c) | (d) ] (f) / # (f) is $3 or $2? (currently $3) / (a) [ (b) (c) | $1 := (d) ] (f) / # (f) is $3 or $2? Since the second example is essentially saying the same as the first, the (f) capture ought to go to the same place in each case. If we say that the existence of the $1 causes the (f) to go into $2, it also becomes the case that $2 is an array of match objects, which isn't technically problematic but it might be a bit surprising for many. Some other examples to consider: / (a) [ (b) (c) | $0 := (d) ] (f) / # (f) is $3 or $1? / (a) [ (b) (c) | $0 := (d) (3) ] (f) / # (f) is $3 or $2? At any rate, I find that having a subpattern capture base its index on the highest index of all of the previous alternation branches is easy to understand and works well in practice. It can also be easily changed with another alias if needed. But wouldn't it be nice if the same rules applied to aliases and subrule invocations, that is, recursion put aside, to think of / foo / simply as a shorter way to say / $foo := ([definition of foo]:) /? First, is that colon following [definition of foo] intentional or a typo? Currently we can backtrack into subrules -- there's no cut assumed after them. But secondly, I'm not sure we can casually toss recursion aside when thinking about this, since it's really a driving force behind having named subrules. :-) There's also a difference in that subrules can take arguments, as in foo('args'), or can come from another grammar, as in Rule::foo, which seems to argue that foo is really something other than an alias shorthand. The synopsis says: * If a subrule appears two (or more) times in the same lexical scope (i.e. twice within the same subpattern and alternation), or if the subrule is quantified anywhere within the entire rule, then its corresponding hash entry is always assigned a reference to an array of Match objects, rather than a single Match object. Maybe you're not the right person to ask, but is there a particular reason for the entire rule bit? / (foo|None) foo (foo) / Here we get three Matches $0foo (possibly undefined), $foo, and $1foo. At least, I think so. / (foo?) foo (foo) / Now, we suddenly get three more or less unrelated arrays with lengths 1..1, 1, and 1. Of course, I admit this example is a bit artificial. Oh, I hadn't caught that particular clause (or hadn't read it as you just did). PGE certainly doesn't implement things that way. I think the entire rule clause was intended to cover cases like / [ foo ]* / where foo is indirectly quantified and therefore is an array of match objects. We should probably reword it, or get a clarification of what is intended. (Damian, @Larry: can you confirm or clarify this for us?) Furthermore, I think within the same subpattern and alternation is not quite correct, at least it wouldn't apply to somethink like / (foo [ foo | ... ]) / unless we consider the (...) sequence as a kind of single branch alternation. And why are alternation branches considered to be lexical scopes, anyway? In the example you give, $0foo is indeed an array of match objects. The same alternation in this case is the subpattern... compare to / (foo [ foo | ... ]) | foo / $0foo is an array, $foo is a single match object. Alternation branches don't create new lexical scopes, they just affect quantification and subpattern numbering. In both of the following examples / abc foo def foo / / ghi foo | jkl foo / each foo has the same lexical scope ($foo), but in the abc example $foo is an array of match objects, while in the ghi example $foo is a single match object. My second question is why adding a ? or ?? to an unquantified subrule which would otherwise result in a single Match object should result in an array, rather than a single (possibly undefined) Match. The specification was originally this way but was later changed to the current definition. I think people found the idea of ? producing a single match object confusing, so for consistency we ended up with all quantifiers produces arrays of match objects. (Note also that even if ? produced
Re: some newbie questions about synopsis 5
On Fri, Feb 17, 2006 at 08:32:18AM -0600, Patrick R. Michaud wrote: : The synopsis says: : : * If a subrule appears two (or more) times in the same lexical scope :(i.e. twice within the same subpattern and alternation), or if the :subrule is quantified anywhere within the entire rule, then its :corresponding hash entry is always assigned a reference to an array :of Match objects, rather than a single Match object. : : Maybe you're not the right person to ask, but is there a particular : reason for the entire rule bit? : : / (foo|None) foo (foo) / : : Here we get three Matches $0foo (possibly undefined), $foo, and : $1foo. At least, I think so. : : / (foo?) foo (foo) / : : Now, we suddenly get three more or less unrelated arrays with lengths : 1..1, 1, and 1. Of course, I admit this example is a bit artificial. : : Oh, I hadn't caught that particular clause (or hadn't read it as : you just did). PGE certainly doesn't implement things that way. : I think the entire rule clause was intended to cover cases like : : / [ foo ]* / : : where foo is indirectly quantified and therefore is an array of : match objects. We should probably reword it, or get a clarification : of what is intended. (Damian, @Larry: can you confirm or clarify : this for us?) I believe that was the intent, but I'll defer to Damian on the wordsmithing because I'm a bit out of sorts at the moment and it'd probably come out all sideways. Larry
some newbie questions about synopsis 5
Hello, I've stumbled upon Perl6 a couple of weeks ago and I'm really looking forward to seeing the finished product. Currently, I'm trying to implement a perl-like rules module for Python, and I've got some questions which I think aren't covered in the Synopsis or anywhere else I looked, mostly concerning captures and aliases: - Capture numbering: /(a) [ (b) (c) (d) | (e) (f) ] (g)/ capture.t suggests something like $0$1 $2 $3$1$2$4, but I'm only guessing about the bit. In the following, / (a) [ (b) (c) | $5 := (d) $0 := (e) ] (f) / does the first alias have any effect on where the f's will go (probably not)? - Which rules do apply to repeated captures with the same alias? For example, the second array aliasing example m:w/ Mr?s? @names := ident W\. @names := ident | Mr?s? @names := ident /; seems to suggests that by using $names, the lower branch would have resulted in a single Match object instead of an array (like the array we would have gotten if we hadn't used the aliases in the first place). Is this right? And could the same effect have been achieved by something like / $names := indent**{1} / ? - More array aliasing: is / mv @files := [...]* / just (slightly) shorter for / mv [$files := [...]]* / ? Likewise, could/ @pairs := ( (\w+) \: (\N+) )+ / have also been written / [ $pairs := (\w+) \: $pairs := (\N+) ]+ / ? - Array and hash aliasing of quantified subpatterns or subrules: what happens to the named captures? / @foo := ( ... $bar := (...) ... )* / And if the subpattern or subrule ends with an alternation, can the number of array elements to be appended (or hashed) vary depending on whitch branch is taken? - Which of the following constructs could possibly be ok (I hope, none)? / $foo := ... $foo := ... / / $foo := ... %foo := ... / / $foo := ... | %foo := ... / / $foo := $foo := ... / - Do aliases bind right-to-left, as do assignments? / $2 := $5 := ... / # next should be $3, not $6 - Which kind of escape sequences are allowed (or required) in enumerated character classes? Thanks in advance for any answers!
Re: some newbie questions about synopsis 5
On Wed, Feb 15, 2006 at 10:09:05AM +0100, H. Stelling wrote: - Capture numbering: /(a) [ (b) (c) (d) | (e) (f) ] (g)/ capture.t suggests something like $0$1 $2 $3$1$2$4, but I'm only guessing about the bit. Yes. In the following, / (a) [ (b) (c) | $5 := (d) $0 := (e) ] (f) / does the first alias have any effect on where the f's will go (probably not)? I'll defer to @Larry on this one, but my initial impression is that the (f) capture would go into $6. - Which rules do apply to repeated captures with the same alias? For example, the second array aliasing example m:w/ Mr?s? @names := ident W\. @names := ident | Mr?s? @names := ident /; seems to suggests that by using $names, the lower branch would have resulted in a single Match object instead of an array (like the array we would have gotten if we hadn't used the aliases in the first place). Is this right? Yes, that's correct. And could the same effect have been achieved by something like / $names := indent**{1} / ? Yes, a quantified capturing subrule or subpattern results in an array of Match objects (even if the quantification is 1). - More array aliasing: is / mv @files := [...]* / just (slightly) shorter for / mv [$files := [...]]* / ? I think so. Likewise, could/ @pairs := ( (\w+) \: (\N+) )+ / have also been written / [ $pairs := (\w+) \: $pairs := (\N+) ]+ / ? Seems like it would work. - Array and hash aliasing of quantified subpatterns or subrules: what happens to the named captures? / @foo := ( ... $bar := (...) ... )* / Presuming you meant $bar there instead of $bar, I have no idea what would happen. (With $bar it's an external alias and would capture an array of matches into the scope in which the rule was declared.) And if the subpattern or subrule ends with an alternation, can the number of array elements to be appended (or hashed) vary depending on whitch branch is taken? Again I have to refer this to @Larry, but my initial impression is yes, it would vary. - Which of the following constructs could possibly be ok (I hope, none)? / $foo := ... $foo := ... / I think this one is okay. $foo is an array of Match objects, and each Match is likely repeated within the array. / $foo := ... %foo := ... / I hope this is not okay. It's certainly not going to be okay anytime soon in the PGE implementation of Perl 6 rules. :-) / $foo := ... | %foo := ... / Since the two aliases are in separate alternation branches, I think this is okay. The argument would be similar to / $foo := ... | @foo := .../ in which $foo is either a single Match object or an array of Match objects depending on the branch matched. / $foo := $foo := ... / While my instinctual reaction is to say that this ought to be okay, upon thinking about it a bit more I think I'd prefer to say that it's not. At least initially, if nothing else. In particular, I wonder about something like / @foo := $bar := [...]+ / If we say that an alias always requires a subpattern or subrule (and not another alias), then we avoid a lot of ambiguity, and the above could be written as / @foo := [ $bar := [...]+ ] / / @foo := [ $bar := [...] ]+ / depending on what is desired. - Do aliases bind right-to-left, as do assignments? / $2 := $5 := ... / # next should be $3, not $6 Assuming we allow chained aliases such as this (see above note), I'd still argue for $6 instead of $3. - Which kind of escape sequences are allowed (or required) in enumerated character classes? AFAIK, this hasn't been completely decided or specified yet. Pm