Re: S05 question
On Thu, Dec 09, 2004 at 11:18:34AM -0800, Larry Wall wrote: On Wed, Dec 08, 2004 at 08:24:20PM -0800, Ashley Winters wrote: : I'm still going to prefer using :=, simply as a good programming : practice. My mind sees a big difference between building a parse-tree : object and just grepping for some word I want in a string. Within a : rule{} block, there is no place except the rule object to keep your : data (hypothetically -- haha), so it makes sense to have everything : capture unless otherwise specified. There's no such limitation in a : regular code block, so I don't see the need. Since regex results are lexically scoped in Perl 6, in a regular code block we can do static analysis and determine whether there's any possibility that $foo is referenced at all, and optimize it away in many cases, if it turns out to be high overhead. But as Patrick points out, so far capture seems pretty cheap. It might turn out to be worth optimizing only when ALL of the capture blocks are unused - the saving from avoiding setup costs together with avoiding the (too small to be a bother by themselves) incremental costs, might be significantwhen taken together. --
Re: S05 question
On Thu, Dec 09, 2004 at 10:52:54AM +, Matthew Walton wrote: Of course, it then begs the question about word ws $foo ws number if we're thinking of parallels with qw//-like constructs, which I certainly am. I'm not quite sure what that would do, as it collides slightly with the existing rule match syntax (which I quite like), and thus it may already have a meaning. This already has a meaning, it calls the word assertion with the (rule) expression /ws $foo ws number/ as an argument. At least it's that way unless/until Larry changes (changed?) it. Pm
Re: S05 question
On Wed, Dec 08, 2004 at 08:24:20PM -0800, Ashley Winters wrote: I was working on the (possibly misguided) assumption that there's a cost to capturing, and that perhaps agressive capturing isn't worth having on in a one-liner. Some deep part of my mind remembers $` being bad, I think. If there's no consequence to having capture being on, then ignoring it is fine. I don't have a problem with that. As I said before, ?foo reads fine to me. At least in the current implementation of PGE there's not a big cost to capturing of any sort. Each capture is held as a pair of (start,end) offsets into the target string, so there's no string copying or other overhead until the captured item is actually referred to. It's even easy to determine $` as being the start of the string up to the beginning offset of the $0 capture. (Yes, the perl 5 docs indicate there's a cost to $` that's incurred for all regexps in a program once $` is used, but I don't think that will translate over to PGE.) That might change, of course, especially as we add the ability to modify the target string in the middle of the match. But even then we may be able to keep the offset pairs as a useful optimization. At the moment the bigger cost is calling the subrule itself -- and even here it's basically the equivalent of a method or subroutine call (actually, coroutine calls), since a called rule maintains its own match state just like any other match. Pm
Re: S05 question
On Wed, Dec 08, 2004 at 08:24:20PM -0800, Ashley Winters wrote: : I'm still going to prefer using :=, simply as a good programming : practice. My mind sees a big difference between building a parse-tree : object and just grepping for some word I want in a string. Within a : rule{} block, there is no place except the rule object to keep your : data (hypothetically -- haha), so it makes sense to have everything : capture unless otherwise specified. There's no such limitation in a : regular code block, so I don't see the need. Since regex results are lexically scoped in Perl 6, in a regular code block we can do static analysis and determine whether there's any possibility that $foo is referenced at all, and optimize it away in many cases, if it turns out to be high overhead. But as Patrick points out, so far capture seems pretty cheap. Larry
Re: S05 question
On Tue, Dec 07, 2004 at 10:36:53PM -0800, Larry Wall wrote: : But somehow I expect that when someone writes (foo) they probably : usually meant («foo»). If we're going to stick with the notion that foo captures and something else doesn't, I'm beginning to think that the other thing isn't «foo» for a couple of reasons. First, if other languages are going to borrow this notation, they're probably not going to buy into the French quotes. Second, I can think of several other possible uses for the French quotes to cure perceived ills such as the (...) vs {...} confusion. Third, it now bothers me to have a ! without a ?. So what if «foo» is instead written ?foo, meaning you only want to evaluate its success. (Unlike !foo, it's not zero-width, but that's just how success/failure works.) So we'd get things like / $bar := [ (?ident) = (\N+) ]* / And people would have to get used to seeing ? as non-capturing assertions: ?before ... ?after ... ?ws ?sp ?null This has a rather Ruby-esque I am a boolean feeling to it. I think I like it. It's pretty easy to type, at least on my keyboard. Now suppose that we extend that I am a boolean feeling to ?{ code } which might take the place of the confusing (...), and make consistent the notion that we always use {...} to invoke real code. : : Or is it that hypotheticals only bind to things captured by parens? : : If so, it might need clarification (or perhaps I'm overlooking the part : : that makes it clear). : : No, I think you just found a blind spot in the design. I think I'm leaning toward the idea that anything in angles that begins alpha is a capture to just the alpha part, so the ? prefix is merely a no-op that happens to make the assertion not start with an alpha. Interestingly, that gives these implicit bindings: after ... $after$` before ...$before $' Thought that's an argument for changing them to pre ... and post ..., I suppose, since if users are going to refer to $after in their main program, it doesn't look like a declarative assertion anymore. Another problem we've run into is naming if there are multiple assertions of the same name. If the capture name is just the alpha part of the assertion, then we could allow an optional number, and still recognize it as a ws: ws1 ws2 ws3 Except I can well imagine people wanting numbered rules. Drat. Could force people to say ws_1 if they want that, I suppose. Or we could use some standard delim for that: ws-1 ws-2 ws-3 which is vaguely reminiscent of our version syntax. Indeed, if we had quantifications, you might well want to have wildcards ws-* and let the name be filled in rather than autogenerating a list. But maybe we just stick with lists in that case. For captures of non-alpha assertions, we could say that ? is the same as true (just as with regular operators), and so true-3 +alpha-[aeiou] would capture to $true-3. (And one could always do an explicit binding for a different name.) Actually, I think people would find $match-3 more meaningful than Ctrue-3. I'm still thinking about what «...» might mean, if anything. Bonus points for interpolative and/or word-splitty. Anyway, that's where I am this week/day/hour/minute/second. Larry
Re: S05 question
Larry Wall wrote: Another problem we've run into is naming if there are multiple assertions of the same name. If the capture name is just the alpha part of the assertion, then we could allow an optional number, and still recognize it as a ws: ws1 ws2 ws3 Except I can well imagine people wanting numbered rules. Drat. Could force people to say ws_1 if they want that, I suppose. Or we could use some standard delim for that: ws-1 ws-2 ws-3 which is vaguely reminiscent of our version syntax. Indeed, if we had quantifications, you might well want to have wildcards ws-* and let the name be filled in rather than autogenerating a list. But maybe we just stick with lists in that case. For captures of non-alpha assertions, we could say that ? is the same as true (just as with regular operators), and so true-3 +alpha-[aeiou] would capture to $true-3. (And one could always do an explicit binding for a different name.) Actually, I think people would find $match-3 more meaningful than Ctrue-3. PHP's use of $array[] as push might work for this: true[] +alpha-[aeiou] or @true +alpha-[aeiou] or true=1.. +alpha-[aeiou] or true@ +alpha-[aeiou] I like the idea of being able to continue versus chunk patterns. How do you say This is a continuation of the other thing versus This is a separate thing ? =Austin
Re: S05 question
On Wed, Dec 08, 2004 at 08:19:17AM -0800, Larry Wall wrote: And people would have to get used to seeing ? as non-capturing assertions: ?before ... ?after ... ?ws ?sp ?null This has a rather Ruby-esque I am a boolean feeling to it. I think I like it. It's pretty easy to type, at least on my keyboard. FWIW, for some reason in rule contexts I tend to conflate I am a boolean feelings with zero-width assertion, so that each of those look vaguely to me as though I'm testing a zero-width proposition and not consuming any text. And I still tend to think of '?' in it's zero or one matches or minimal match connotations. Oh well, I suppose I could get used to that. Now suppose that we extend that I am a boolean feeling to ?{ code } which might take the place of the confusing (...), and make consistent the notion that we always use {...} to invoke real code. Hmm, this is nice, however. Another problem we've run into is naming if there are multiple assertions of the same name. If the capture name is just the alpha part of the assertion, then we could allow an optional number, and still recognize it as a ws: ws1 ws2 ws3 Except I can well imagine people wanting numbered rules. Drat. Could force people to say ws_1 if they want that, I suppose. I had been thinking that /ws foo ws bar/ would simply cause $ws to be a list of captured elements, similar to what might happen for $1 in / [ (.*?) , ]* / If someone really needs the contents of the first and second ws, they could do (ws) foo (ws) and get them as $1 and $2. But, seeing this tells me that perhaps (rule) should be used for capturing rules, analogous to the capturing parens, and leave rule to be the non-capturing version. But maybe that's anti-Huffman overall. Maybe the parens could also help for disambiguating (ws) foo (ws) so that we end up with $/ws[1], $/ws[2], etc. But then we might have to always subscript our named captures, which is icky, or maybe we'd only make $/ws act like list when there's more than one capturing (ws) in the rule. I dunno. I kinda like (rule) for capturing, but maybe it just doesn't work. Pm
Re: S05 question
Larry Wall writes: If we're going to stick with the notion that foo captures and something else doesn't, I'm beginning to think that the other thing isn't foo for a couple of reasons. I just sat down to say the exact same thing. I'm glad you beat me to it. And people would have to get used to seeing ? as non-capturing assertions: ?before ... ?after ... ?ws ?sp ?null This has a rather Ruby-esque I am a boolean feeling to it. I think I like it. It's pretty easy to type, at least on my keyboard. Yeah, I like it pretty well too. Better than the french quites for sure. Now suppose that we extend that I am a boolean feeling to ?{ code } which might take the place of the confusing (...), and make consistent the notion that we always use {...} to invoke real code. Hmm... I'm just so attached to (...). I find it quite beautiful. It also somehow communicates the feeling you shouldn't be putting side-effects here. I think I'm leaning toward the idea that anything in angles that begins alpha is a capture to just the alpha part, so the ? prefix is merely a no-op that happens to make the assertion not start with an alpha. Interestingly, that gives these implicit bindings: after ... $after$` before ... $before $' I don't quite follow. Wouldn't that mean that these guys would get clobbered if you used lookaheads or lookbehinds in your rules? Or we could use some standard delim for that: ws-1 ws-2 ws-3 which is vaguely reminiscent of our version syntax. Indeed, if we had quantifications, you might well want to have wildcards ws-* and let the name be filled in rather than autogenerating a list. But maybe we just stick with lists in that case. I can imagine this being a lot cleaner if the thing after the dash can be any sort of identifier: ws-indent if ?ws condition ws-comment On the other hand, it could be misleading, since the standard naming of BNF uses dashes instead of underscored. I don't think it should be a big problem though. I'm still thinking about what ... might mean, if anything. Bonus points for interpolative and/or word-splitty. Yeah... umm... nope. I got nothin. Luke
Re: S05 question
On Wed, 8 Dec 2004 08:19:17 -0800, Larry Wall [EMAIL PROTECTED] wrote: / $bar := [ (?ident) = (\N+) ]* / You know, to be honest I don't know that I want rules in one-liners to capture by default. I certainly want them to capture in rules, though. And people would have to get used to seeing ? as non-capturing assertions: ?before ... ?after ... ?ws ?sp ?null This has a rather Ruby-esque I am a boolean feeling to it. I think I like it. It's pretty easy to type, at least on my keyboard. I like it. It reads to me as if before ..., if null. Sounds good. I think I'm leaning toward the idea that anything in angles that begins alpha is a capture to just the alpha part, so the ? prefix is merely a no-op that happens to make the assertion not start with an alpha. Interestingly, that gives these implicit bindings: after ... $after$` before ...$before $' Again, I don't see the utility of that in a one-liner. In a grammar, you would create a real rule which would assert after ... and capture the result in a reasonable name. Anyway, that's where I am this week/day/hour/minute/second. I'm thinking capturing rules should be default in rules, where they're downright useful. Your hour/minute/second comment brings up parsing ISO time: grammar ISO8601::DateTime { rule year { \d4 } rule month { \d2 } rule day { \d2 } rule hour { \d2 } rule minute { \d2 } rule second { \d2 } rule fraction { \d+ } rule date { year -? month -? day } rule time { hour \:? minute \:? second [\. fraction]? } rule datetime { date T time } } For a grammar, that works perfectly! In a one-liner, I'd rather just use: $datetime ~~ /$year := (\d+) -? $month := (\d+) -? ./ and specify the vars I want to save directly in my own scope. Ashley Winters
Re: S05 question
Ashley Winters writes: I'm thinking capturing rules should be default in rules, where they're downright useful. Your hour/minute/second comment brings up parsing ISO time: grammar ISO8601::DateTime { rule year { \d4 } rule month { \d2 } rule day { \d2 } rule hour { \d2 } rule minute { \d2 } rule second { \d2 } rule fraction { \d+ } rule date { year -? month -? day } rule time { hour \:? minute \:? second [\. fraction]? } rule datetime { date T time } } For a grammar, that works perfectly! Yep. In a one-liner, I'd rather just use: $datetime ~~ /$year := (\d+) -? $month := (\d+) -? ./ Then go ahead and use that. If you're going to use subrules, you can either use the ?subrule form or just the regular old subrule form and ignore the result. There's nothing forcing you to pay attention to those. The number variables only get incremented when you use parentheses. I'd suspect that the return value of a rule only accounts for parenthecized captures as well. Or are you asking something different than that? Luke
Re: S05 question
On Wed, Dec 08, 2004 at 11:09:30AM -0700, Patrick R. Michaud wrote: : On Wed, Dec 08, 2004 at 08:19:17AM -0800, Larry Wall wrote: : And people would have to get used to seeing ? as non-capturing assertions: : ?before ... : ?after ... : ?ws : ?sp : ?null : This has a rather Ruby-esque I am a boolean feeling to it. I think : I like it. It's pretty easy to type, at least on my keyboard. : : FWIW, for some reason in rule contexts I tend to conflate : I am a boolean feelings with zero-width assertion, so that each : of those look vaguely to me as though I'm testing a zero-width : proposition and not consuming any text. And I still tend to think of : '?' in it's zero or one matches or minimal match connotations. : Oh well, I suppose I could get used to that. Yes, there are those interferences, which was one of the reasons for removing ? the last time we had it in that position (albeit on the captures rather than the non-captures). I think we'll have to let it set a while to see how it feels in this role. For the purpose of being a non-alpha no-op, any other non-alpha character would do as well, so maybe the I am a boolean feeling is not that useful. : Now suppose that we extend that I am a boolean feeling to : ?{ code } : which might take the place of the confusing (...), and make consistent : the notion that we always use {...} to invoke real code. : : Hmm, this is nice, however. In some ways, and not so nice in others, as Luke pointed out. : Another problem we've run into is naming if there are multiple assertions : of the same name. If the capture name is just the alpha part of the : assertion, then we could allow an optional number, and still recognize : it as a ws: : ws1 ws2 ws3 : Except I can well imagine people wanting numbered rules. Drat. Could : force people to say ws_1 if they want that, I suppose. : : I had been thinking that : : /ws foo ws bar/ : : would simply cause $ws to be a list of captured elements, similar to : what might happen for $1 in : : / [ (.*?) , ]* / That's what happens by default whenever there is a name conflict. This would just be a way of giving a rule a long name as well as a short one, much like abscomplex is the long name of abs when dispatched on a complex number, whereas abs is just the set of all abs() multis, if there is such a beastie. : If someone really needs the contents of the first and second ws, they : could do : :(ws) foo (ws) : : and get them as $1 and $2. But, seeing this tells me that perhaps : (rule) should be used for capturing rules, analogous to the : capturing parens, and leave rule to be the non-capturing version. : But maybe that's anti-Huffman overall. Maybe the parens could also : help for disambiguating : :(ws) foo (ws) : : so that we end up with $/ws[1], $/ws[2], etc. But then we might : have to always subscript our named captures, which is icky, or maybe : we'd only make $/ws act like list when there's more than one : capturing (ws) in the rule. : : I dunno. I kinda like (rule) for capturing, but maybe it just : doesn't work. I thought about that a long time, which was part of the reason I also thought about freeing up (...). But it just seems a little icky to mix together the named captures and numbered captures visually if not semantically. It starts not being at all clear which parentheses count and which ones not. Which is perhaps another reason for changing current (...) to ?{...}. We could, I suppose use a subscript inside: ws[0] foo ws[1] ws«first» foo ws«second» but then you'd reference it as $ws[0] $wsfirst which is a gratuitous difference, and suffers the same problem as the parenthese in confusing real arrays/hashes with sorta fake ones. So I think we'll stick with the hyphen names for now, which have the benefit of looking the same and not sending us to bracket heaven. ws-1 foo ws-2 ws-first foo ws-second $ws-1 $ws-first Larry
Re: S05 question
On Tue, Dec 07, 2004 at 12:11:18PM -0700, Patrick R. Michaud wrote: : I'm reviewing the updated S05 (2 Dec 2004) and ran across this : in the Hypothetical Variables section: : : # Pairs of repeated captures can be bound to hashes: : : / %options := [ (ident) = (\N+) ]* / : : Actually, I see three captures there, so should this instead read...? : : / %options := [ («ident») = (\N+) ]* / Probably--that was the intent. Or maybe that style of capture ignores rule captures. (If we do allow rule captures, we have to worry about setting up a hash that is indexed by object rather than by string, which seems not terribly useful for grammar reductions.) Or maybe if there are more than two captures in the brackets, the first becomes the key and the rest of them become a list value, in which case ident's $/ could be the first element and \N+ the second. But somehow I expect that when someone writes (foo) they probably usually meant («foo»). On the other hand, it's an interesting idiom if you really mean that you want the key to be the text value of the match, and the value to be the object value of the match. : Or is it that hypotheticals only bind to things captured by parens? : If so, it might need clarification (or perhaps I'm overlooking the part : that makes it clear). No, I think you just found a blind spot in the design. : A similar question arises a bit later, with : : And this puts a list of lists: : : / $bar := [ (ident) = (\N+) ]* / : : Is the ident capture part of the list of lists that goes into $bar? Wasn't intended to be. I think I just missed changing it to «ident» in the GBS. Larry