Re: Suggested RegExp Improvements
On Tue, 2010-11-16 at 13:09 +0100, Erik Corry wrote: 2010/11/15 Marc Harter wav...@gmail.com: On Mon, 2010-11-15 at 14:06 +0100, Erik Corry wrote: Your proposal seems to allow variable length lookbehind. This isn't allowed in perl as far as I know. I just tried the following: perl -e 'foobarbaz =~ /a(?=(ob|bab))/;' which gives an error on perl5. I think if we are going to allow variable length lookbehind we should first find out why they don't have it in perl. I think the implementation is a little tricky if you want to support the full regexp language in lookbehinds. This was not my intention. I am proposing zero-width lookbehind, which would not allow for the case you specified above. I will update the The issue is not with the number of characters consumed by the assertion. This is indeed zero. The issue is with the width of the text matched by the disjunction inside the brackets. This is not any disjunction, but rather a restricted part of the regexp language that can only match a particular number of characters. Sorry about that. I understand you now. It seems the .Net regexp library is able to handle arbitrary content in a lookbehind. It is almost the only one. Yes it appears that way, I wonder how beneficial that really is? I believe keeping the same disjunction we have for lookhead in ECMA-262 would make sense at this point in time but open to pushback. See http://www.regular-expressions.info/lookaround.html#lookbehind for more details. We could add this feature to JS. As far as I can work out it presupposes the ability to reverse an arbitrary regexp and run it backwards (stepping back and backtracking forwards). I don't think we should add it accidentally though, and perhaps the proposer should be the first to implement it. I can take a stab at writing a more detailed description on how to evaluate the Disjunction as Lasse Reichstein has pointed out (http://www.mail-archive.com/es-discuss@mozilla.org/msg05218.html) but wouldn't mind help if anyone else is interested or resources to implementation specs for Perl lookbehind, I haven't found any yet, just documentation. proposal. It is my understanding that lookahead as implemented in ECMAScript also is zero-width and not variable. This is also how Perl has implemented lookbehind. http://perldoc.perl.org/perlre.html#Extended-Patterns Updated Proposal: https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM The issue is not that the regexp doesn't match in perl. The issue is that it is not compiled at all. Is there an example of a language that supports the full regexp power in lookbehinds so we can look at their experiences with implementing it? As far as I know Perl is the de facto standard. 2010/11/15 Marc Harter wav...@gmail.com: Brendan et al., I have created a proposal for look-behind provided at this link: https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM I hope it is a format that will be helpful for discussion with TC39. Admittedly, I have never written one of these before so am completely open to any feedback or ways to improve the document from yourself or anyone else on this list. Marc On Sat, 2010-11-13 at 09:32 -0600, Marc Harter wrote: I would be game to write up a proposal for this. When would you need this by to discuss w/ TC39? Thanks for your consideration, Marc On Nov 12, 2010, at 5:04 PM, Brendan Eich bren...@mozilla.com wrote: On Nov 12, 2010, at 2:52 PM, Marc Harter wrote: After considering all the breadth this discussion could take maybe it would be wise to just focus on one issue at a time. For me, the biggest missing feature is lookbehind. Its common to most languages implementing the Perl-RegExp-syntax, it is very useful when looking for patterns that follow or don't follow a particular pattern. I guess I'm confused why lookahead made it in but not lookbehind. This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but we proposed to ECMA TC39 TG1 (the JS group -- things were different then, including capitalization) something based on Perl 5. We didn't get everything, and we had to rationalize some obvious quirks. I don't remember lookbehind (which emerged in Perl 5.005 in July '98) being left out on purpose. Waldemar may recall more, I'd handed him the JS keys inside netscape.com to go do mozilla.org. If you are game to write a proposal or mini-spec (in the style of ES5 even), let me know. I'll chat with other TC39'ers next week about this. /be What do people think about including this feature? Marc On Fri, 2010-11-12 at 16:20 -0600, Marc Harter wrote: I will start out with a disclaimer. I have not read both ECMAScript specifications for 3 and now 5, so I admit that I am not an expert in the spec itself but
Re: Suggested RegExp Improvements
On Mon, 15 Nov 2010 16:23:13 +0100, Marc Harter wav...@gmail.com wrote: [look-behind allowing variable length body] This was not my intention. I am proposing zero-width lookbehind, which would not allow for the case you specified above. The grammar allows it. In ECMAScript it would be: foobarbaz.match(/a(?=(ob|bab)?)/ which would match the first a. Had it been written foobarbaz.match(/a(?=(ob|bab)?.)/ I will update the proposal. It is my understanding that lookahead as implemented in ECMAScript also is zero-width and not variable. This is also how Perl has implemented lookbehind. The look-ahead in ECMAScript has a Disjunction as content, which basically means that it can contain *any* RegExp (including quantified statements and other lookaheads). This works fine because the semantics of the disjunction is the same as any other disjunction in a RegExp: it's matched forwards from a position in the input. Your proposal also uses a Disjunction as body, but it's not specified how to evaluate that body so that it *ends* at the position of the assertion. Executing a RegExp backwards isn't trivial. Well, mostly it is, by symmetry, but it's not part of the spec. The positive look-behind should probably be allowed to contain captures that are still participating after the assertion succeeds (mirroring the semantics of the positive look-ahead). I believe PCRE allows variable length (but structurally simple) look-behinds, where the structure ensures that it doesn't have to do backtracking while checking them, even though Perl itself does not [1]. Whether that's a desired property or not is a different question (I would actually prefer a full backwards-executed regexp to an artificial restriction, but that's mainly ideology :). /L [1] http://www.regular-expressions.info/lookaround.html http://perldoc.perl.org/perlre.html#Extended-Patterns Updated Proposal: https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM Is there an example of a language that supports the full regexp power in lookbehinds so we can look at their experiences with implementing it? As far as I know Perl is the de facto standard. 2010/11/15 Marc Harter wav...@gmail.com: Brendan et al., I have created a proposal for look-behind provided at this link: https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM I hope it is a format that will be helpful for discussion with TC39. Admittedly, I have never written one of these before so am completely open to any feedback or ways to improve the document from yourself or anyone else on this list. Marc On Sat, 2010-11-13 at 09:32 -0600, Marc Harter wrote: I would be game to write up a proposal for this. When would you need this by to discuss w/ TC39? Thanks for your consideration, Marc On Nov 12, 2010, at 5:04 PM, Brendan Eich bren...@mozilla.com wrote: On Nov 12, 2010, at 2:52 PM, Marc Harter wrote: After considering all the breadth this discussion could take maybe it would be wise to just focus on one issue at a time. For me, the biggest missing feature is lookbehind. Its common to most languages implementing the Perl-RegExp-syntax, it is very useful when looking for patterns that follow or don't follow a particular pattern. I guess I'm confused why lookahead made it in but not lookbehind. This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but we proposed to ECMA TC39 TG1 (the JS group -- things were different then, including capitalization) something based on Perl 5. We didn't get everything, and we had to rationalize some obvious quirks. I don't remember lookbehind (which emerged in Perl 5.005 in July '98) being left out on purpose. Waldemar may recall more, I'd handed him the JS keys inside netscape.com to go do mozilla.org. If you are game to write a proposal or mini-spec (in the style of ES5 even), let me know. I'll chat with other TC39'ers next week about this. /be What do people think about including this feature? Marc On Fri, 2010-11-12 at 16:20 -0600, Marc Harter wrote: I will start out with a disclaimer. I have not read both ECMAScript specifications for 3 and now 5, so I admit that I am not an expert in the spec itself but as I user of JavaScript, I would like to get some expert discussion over this topic as proposed enhancements to the RegExp engine for Harmony. I will start with a list of lacking features in JS as compared to Perl provided by (http://www.regular-expressions.info/javascript.html): * No \A or \Z anchors to match the start or end of the string. Use a caret or dollar instead. * Lookbehind is not supported at all. Lookahead is fully supported. * No atomic grouping or possessive quantifiers * No Unicode support, except for matching single characters with \u * No named capturing groups. Use
Re: Suggested RegExp Improvements
[Unterminated statement detected, fixing ...] On Tue, 16 Nov 2010 13:12:36 +0100, Lasse Reichstein reichsteinatw...@gmail.com wrote: On Mon, 15 Nov 2010 16:23:13 +0100, Marc Harter wav...@gmail.com wrote: [look-behind allowing variable length body] This was not my intention. I am proposing zero-width lookbehind, which would not allow for the case you specified above. The grammar allows it. In ECMAScript it would be: foobarbaz.match(/a(?=(ob|bab)?)/ which would match the first a. Had it been written foobarbaz.match(/a(?=(ob|bab)?.)/ ... then it would match a and capture ob, assuming semantics symmetric to look-ahead. I will update the proposal. It is my understanding that lookahead as implemented in ECMAScript also is zero-width and not variable. This is also how Perl has implemented lookbehind. The look-ahead in ECMAScript has a Disjunction as content, which basically means that it can contain *any* RegExp (including quantified statements and other lookaheads). This works fine because the semantics of the disjunction is the same as any other disjunction in a RegExp: it's matched forwards from a position in the input. Your proposal also uses a Disjunction as body, but it's not specified how to evaluate that body so that it *ends* at the position of the assertion. Executing a RegExp backwards isn't trivial. Well, mostly it is, by symmetry, but it's not part of the spec. The positive look-behind should probably be allowed to contain captures that are still participating after the assertion succeeds (mirroring the semantics of the positive look-ahead). I believe PCRE allows variable length (but structurally simple) look-behinds, where the structure ensures that it doesn't have to do backtracking while checking them, even though Perl itself does not [1]. Whether that's a desired property or not is a different question (I would actually prefer a full backwards-executed regexp to an artificial restriction, but that's mainly ideology :). /L [1] http://www.regular-expressions.info/lookaround.html http://perldoc.perl.org/perlre.html#Extended-Patterns Updated Proposal: https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM Is there an example of a language that supports the full regexp power in lookbehinds so we can look at their experiences with implementing it? As far as I know Perl is the de facto standard. 2010/11/15 Marc Harter wav...@gmail.com: Brendan et al., I have created a proposal for look-behind provided at this link: https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM I hope it is a format that will be helpful for discussion with TC39. Admittedly, I have never written one of these before so am completely open to any feedback or ways to improve the document from yourself or anyone else on this list. Marc On Sat, 2010-11-13 at 09:32 -0600, Marc Harter wrote: I would be game to write up a proposal for this. When would you need this by to discuss w/ TC39? Thanks for your consideration, Marc On Nov 12, 2010, at 5:04 PM, Brendan Eich bren...@mozilla.com wrote: On Nov 12, 2010, at 2:52 PM, Marc Harter wrote: After considering all the breadth this discussion could take maybe it would be wise to just focus on one issue at a time. For me, the biggest missing feature is lookbehind. Its common to most languages implementing the Perl-RegExp-syntax, it is very useful when looking for patterns that follow or don't follow a particular pattern. I guess I'm confused why lookahead made it in but not lookbehind. This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but we proposed to ECMA TC39 TG1 (the JS group -- things were different then, including capitalization) something based on Perl 5. We didn't get everything, and we had to rationalize some obvious quirks. I don't remember lookbehind (which emerged in Perl 5.005 in July '98) being left out on purpose. Waldemar may recall more, I'd handed him the JS keys inside netscape.com to go do mozilla.org. If you are game to write a proposal or mini-spec (in the style of ES5 even), let me know. I'll chat with other TC39'ers next week about this. /be What do people think about including this feature? Marc On Fri, 2010-11-12 at 16:20 -0600, Marc Harter wrote: I will start out with a disclaimer. I have not read both ECMAScript specifications for 3 and now 5, so I admit that I am not an expert in the spec itself but as I user of JavaScript, I would like to get some expert discussion over this topic as proposed enhancements to the RegExp engine for Harmony. I will start with a list of lacking features in JS as compared to Perl provided by (http://www.regular-expressions.info/javascript.html): * No \A or \Z anchors to match the start or end of the string. Use a caret or dollar instead. * Lookbehind is not
Re: Suggested RegExp Improvements
2010/11/16 Erik Corry erik.co...@gmail.com: 2010/11/15 Marc Harter wav...@gmail.com: On Mon, 2010-11-15 at 14:06 +0100, Erik Corry wrote: Your proposal seems to allow variable length lookbehind. This isn't allowed in perl as far as I know. I just tried the following: perl -e 'foobarbaz =~ /a(?=(ob|bab))/;' which gives an error on perl5. I think if we are going to allow variable length lookbehind we should first find out why they don't have it in perl. I think the implementation is a little tricky if you want to support the full regexp language in lookbehinds. This was not my intention. I am proposing zero-width lookbehind, which would not allow for the case you specified above. I will update the The issue is not with the number of characters consumed by the assertion. This is indeed zero. The issue is with the width of the text matched by the disjunction inside the brackets. This is not any disjunction, but rather a restricted part of the regexp language that can only match a particular number of characters. It seems the .Net regexp library is able to handle arbitrary content in a lookbehind. It is almost the only one. See http://www.regular-expressions.info/lookaround.html#lookbehind for more details. We could add this feature to JS. As far as I can work out it presupposes the ability to reverse an arbitrary regexp and run it backwards (stepping back and backtracking forwards). I don't think we should add it accidentally though, and perhaps the proposer should be the first to implement it. Don't you already have to do that to efficiently handle a regexp that ends at the end of the input (in JS, a non multiline $, or \z in java.util.regex parlance)? If you have the whole input string available in memory, and are trying to figure out whether a lookbehind (?=x) matches at position p, can't you just test /(?:x)$/ against the prefix of the input of length p. proposal. It is my understanding that lookahead as implemented in ECMAScript also is zero-width and not variable. This is also how Perl has implemented lookbehind. http://perldoc.perl.org/perlre.html#Extended-Patterns Updated Proposal: https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM The issue is not that the regexp doesn't match in perl. The issue is that it is not compiled at all. Is there an example of a language that supports the full regexp power in lookbehinds so we can look at their experiences with implementing it? As far as I know Perl is the de facto standard. 2010/11/15 Marc Harter wav...@gmail.com: Brendan et al., I have created a proposal for look-behind provided at this link: https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM I hope it is a format that will be helpful for discussion with TC39. Admittedly, I have never written one of these before so am completely open to any feedback or ways to improve the document from yourself or anyone else on this list. Marc On Sat, 2010-11-13 at 09:32 -0600, Marc Harter wrote: I would be game to write up a proposal for this. When would you need this by to discuss w/ TC39? Thanks for your consideration, Marc On Nov 12, 2010, at 5:04 PM, Brendan Eich bren...@mozilla.com wrote: On Nov 12, 2010, at 2:52 PM, Marc Harter wrote: After considering all the breadth this discussion could take maybe it would be wise to just focus on one issue at a time. For me, the biggest missing feature is lookbehind. Its common to most languages implementing the Perl-RegExp-syntax, it is very useful when looking for patterns that follow or don't follow a particular pattern. I guess I'm confused why lookahead made it in but not lookbehind. This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but we proposed to ECMA TC39 TG1 (the JS group -- things were different then, including capitalization) something based on Perl 5. We didn't get everything, and we had to rationalize some obvious quirks. I don't remember lookbehind (which emerged in Perl 5.005 in July '98) being left out on purpose. Waldemar may recall more, I'd handed him the JS keys inside netscape.com to go do mozilla.org. If you are game to write a proposal or mini-spec (in the style of ES5 even), let me know. I'll chat with other TC39'ers next week about this. /be What do people think about including this feature? Marc On Fri, 2010-11-12 at 16:20 -0600, Marc Harter wrote: I will start out with a disclaimer. I have not read both ECMAScript specifications for 3 and now 5, so I admit that I am not an expert in the spec itself but as I user of JavaScript, I would like to get some expert discussion over this topic as proposed enhancements to the RegExp engine for Harmony. I will start with a list of lacking features in JS as compared to Perl provided by (http://www.regular-expressions.info/javascript.html): * No \A or \Z anchors to match the start
Re: Suggested RegExp Improvements
2010/11/16 Mike Samuel mikesam...@gmail.com: 2010/11/16 Erik Corry erik.co...@gmail.com: 2010/11/15 Marc Harter wav...@gmail.com: On Mon, 2010-11-15 at 14:06 +0100, Erik Corry wrote: Your proposal seems to allow variable length lookbehind. This isn't allowed in perl as far as I know. I just tried the following: perl -e 'foobarbaz =~ /a(?=(ob|bab))/;' which gives an error on perl5. I think if we are going to allow variable length lookbehind we should first find out why they don't have it in perl. I think the implementation is a little tricky if you want to support the full regexp language in lookbehinds. This was not my intention. I am proposing zero-width lookbehind, which would not allow for the case you specified above. I will update the The issue is not with the number of characters consumed by the assertion. This is indeed zero. The issue is with the width of the text matched by the disjunction inside the brackets. This is not any disjunction, but rather a restricted part of the regexp language that can only match a particular number of characters. It seems the .Net regexp library is able to handle arbitrary content in a lookbehind. It is almost the only one. See http://www.regular-expressions.info/lookaround.html#lookbehind for more details. We could add this feature to JS. As far as I can work out it presupposes the ability to reverse an arbitrary regexp and run it backwards (stepping back and backtracking forwards). I don't think we should add it accidentally though, and perhaps the proposer should be the first to implement it. Don't you already have to do that to efficiently handle a regexp that ends at the end of the input (in JS, a non multiline $, or \z in java.util.regex parlance)? V8 doesn't have a general form of that optimization. Do the others? If you have the whole input string available in memory, and are trying to figure out whether a lookbehind (?=x) matches at position p, can't you just test /(?:x)$/ against the prefix of the input of length p. proposal. It is my understanding that lookahead as implemented in ECMAScript also is zero-width and not variable. This is also how Perl has implemented lookbehind. http://perldoc.perl.org/perlre.html#Extended-Patterns Updated Proposal: https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM The issue is not that the regexp doesn't match in perl. The issue is that it is not compiled at all. Is there an example of a language that supports the full regexp power in lookbehinds so we can look at their experiences with implementing it? As far as I know Perl is the de facto standard. 2010/11/15 Marc Harter wav...@gmail.com: Brendan et al., I have created a proposal for look-behind provided at this link: https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM I hope it is a format that will be helpful for discussion with TC39. Admittedly, I have never written one of these before so am completely open to any feedback or ways to improve the document from yourself or anyone else on this list. Marc On Sat, 2010-11-13 at 09:32 -0600, Marc Harter wrote: I would be game to write up a proposal for this. When would you need this by to discuss w/ TC39? Thanks for your consideration, Marc On Nov 12, 2010, at 5:04 PM, Brendan Eich bren...@mozilla.com wrote: On Nov 12, 2010, at 2:52 PM, Marc Harter wrote: After considering all the breadth this discussion could take maybe it would be wise to just focus on one issue at a time. For me, the biggest missing feature is lookbehind. Its common to most languages implementing the Perl-RegExp-syntax, it is very useful when looking for patterns that follow or don't follow a particular pattern. I guess I'm confused why lookahead made it in but not lookbehind. This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but we proposed to ECMA TC39 TG1 (the JS group -- things were different then, including capitalization) something based on Perl 5. We didn't get everything, and we had to rationalize some obvious quirks. I don't remember lookbehind (which emerged in Perl 5.005 in July '98) being left out on purpose. Waldemar may recall more, I'd handed him the JS keys inside netscape.com to go do mozilla.org. If you are game to write a proposal or mini-spec (in the style of ES5 even), let me know. I'll chat with other TC39'ers next week about this. /be What do people think about including this feature? Marc On Fri, 2010-11-12 at 16:20 -0600, Marc Harter wrote: I will start out with a disclaimer. I have not read both ECMAScript specifications for 3 and now 5, so I admit that I am not an expert in the spec itself but as I user of JavaScript, I would like to get some expert discussion over this topic as proposed enhancements to the RegExp engine for Harmony. I will start with a list of lacking features in JS as compared
Re: Suggested RegExp Improvements
Your proposal seems to allow variable length lookbehind. This isn't allowed in perl as far as I know. I just tried the following: perl -e 'foobarbaz =~ /a(?=(ob|bab))/;' which gives an error on perl5. I think if we are going to allow variable length lookbehind we should first find out why they don't have it in perl. I think the implementation is a little tricky if you want to support the full regexp language in lookbehinds. Is there an example of a language that supports the full regexp power in lookbehinds so we can look at their experiences with implementing it? 2010/11/15 Marc Harter wav...@gmail.com: Brendan et al., I have created a proposal for look-behind provided at this link: https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM I hope it is a format that will be helpful for discussion with TC39. Admittedly, I have never written one of these before so am completely open to any feedback or ways to improve the document from yourself or anyone else on this list. Marc On Sat, 2010-11-13 at 09:32 -0600, Marc Harter wrote: I would be game to write up a proposal for this. When would you need this by to discuss w/ TC39? Thanks for your consideration, Marc On Nov 12, 2010, at 5:04 PM, Brendan Eich bren...@mozilla.com wrote: On Nov 12, 2010, at 2:52 PM, Marc Harter wrote: After considering all the breadth this discussion could take maybe it would be wise to just focus on one issue at a time. For me, the biggest missing feature is lookbehind. Its common to most languages implementing the Perl-RegExp-syntax, it is very useful when looking for patterns that follow or don't follow a particular pattern. I guess I'm confused why lookahead made it in but not lookbehind. This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but we proposed to ECMA TC39 TG1 (the JS group -- things were different then, including capitalization) something based on Perl 5. We didn't get everything, and we had to rationalize some obvious quirks. I don't remember lookbehind (which emerged in Perl 5.005 in July '98) being left out on purpose. Waldemar may recall more, I'd handed him the JS keys inside netscape.com to go do mozilla.org. If you are game to write a proposal or mini-spec (in the style of ES5 even), let me know. I'll chat with other TC39'ers next week about this. /be What do people think about including this feature? Marc On Fri, 2010-11-12 at 16:20 -0600, Marc Harter wrote: I will start out with a disclaimer. I have not read both ECMAScript specifications for 3 and now 5, so I admit that I am not an expert in the spec itself but as I user of JavaScript, I would like to get some expert discussion over this topic as proposed enhancements to the RegExp engine for Harmony. I will start with a list of lacking features in JS as compared to Perl provided by (http://www.regular-expressions.info/javascript.html): * No \A or \Z anchors to match the start or end of the string. Use a caret or dollar instead. * Lookbehind is not supported at all. Lookahead is fully supported. * No atomic grouping or possessive quantifiers * No Unicode support, except for matching single characters with \u * No named capturing groups. Use numbered capturing groups instead. * No mode modifiers to set matching options within the regular expression. * No conditionals. * No regular expression comments. Describe your regular expression with JavaScript // comments instead, outside the regular expression string. I don't know if all of these need to be in the language but there have been some that I have personally wanted to use: * Lookbehind! ECMAScript fully supports lookahead, why not lookbehind? Seems like a big hole to me. * Named capturing groups and comments (e.g. http://xregexp.com/syntax/). Mostly I argue for this because it makes RegExp matches more self-documenting. Regular Expressions are already cryptic as it is. I do like some of the new flags proposed in (http://xregexp.com/flags/) but personally haven't used them but maybe that is something also for discussion. Marc Harter ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Suggested RegExp Improvements
On Mon, 2010-11-15 at 14:06 +0100, Erik Corry wrote: Your proposal seems to allow variable length lookbehind. This isn't allowed in perl as far as I know. I just tried the following: perl -e 'foobarbaz =~ /a(?=(ob|bab))/;' which gives an error on perl5. I think if we are going to allow variable length lookbehind we should first find out why they don't have it in perl. I think the implementation is a little tricky if you want to support the full regexp language in lookbehinds. This was not my intention. I am proposing zero-width lookbehind, which would not allow for the case you specified above. I will update the proposal. It is my understanding that lookahead as implemented in ECMAScript also is zero-width and not variable. This is also how Perl has implemented lookbehind. http://perldoc.perl.org/perlre.html#Extended-Patterns Updated Proposal: https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM Is there an example of a language that supports the full regexp power in lookbehinds so we can look at their experiences with implementing it? As far as I know Perl is the de facto standard. 2010/11/15 Marc Harter wav...@gmail.com: Brendan et al., I have created a proposal for look-behind provided at this link: https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM I hope it is a format that will be helpful for discussion with TC39. Admittedly, I have never written one of these before so am completely open to any feedback or ways to improve the document from yourself or anyone else on this list. Marc On Sat, 2010-11-13 at 09:32 -0600, Marc Harter wrote: I would be game to write up a proposal for this. When would you need this by to discuss w/ TC39? Thanks for your consideration, Marc On Nov 12, 2010, at 5:04 PM, Brendan Eich bren...@mozilla.com wrote: On Nov 12, 2010, at 2:52 PM, Marc Harter wrote: After considering all the breadth this discussion could take maybe it would be wise to just focus on one issue at a time. For me, the biggest missing feature is lookbehind. Its common to most languages implementing the Perl-RegExp-syntax, it is very useful when looking for patterns that follow or don't follow a particular pattern. I guess I'm confused why lookahead made it in but not lookbehind. This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but we proposed to ECMA TC39 TG1 (the JS group -- things were different then, including capitalization) something based on Perl 5. We didn't get everything, and we had to rationalize some obvious quirks. I don't remember lookbehind (which emerged in Perl 5.005 in July '98) being left out on purpose. Waldemar may recall more, I'd handed him the JS keys inside netscape.com to go do mozilla.org. If you are game to write a proposal or mini-spec (in the style of ES5 even), let me know. I'll chat with other TC39'ers next week about this. /be What do people think about including this feature? Marc On Fri, 2010-11-12 at 16:20 -0600, Marc Harter wrote: I will start out with a disclaimer. I have not read both ECMAScript specifications for 3 and now 5, so I admit that I am not an expert in the spec itself but as I user of JavaScript, I would like to get some expert discussion over this topic as proposed enhancements to the RegExp engine for Harmony. I will start with a list of lacking features in JS as compared to Perl provided by (http://www.regular-expressions.info/javascript.html): * No \A or \Z anchors to match the start or end of the string. Use a caret or dollar instead. * Lookbehind is not supported at all. Lookahead is fully supported. * No atomic grouping or possessive quantifiers * No Unicode support, except for matching single characters with \u * No named capturing groups. Use numbered capturing groups instead. * No mode modifiers to set matching options within the regular expression. * No conditionals. * No regular expression comments. Describe your regular expression with JavaScript // comments instead, outside the regular expression string. I don't know if all of these need to be in the language but there have been some that I have personally wanted to use: * Lookbehind! ECMAScript fully supports lookahead, why not lookbehind? Seems like a big hole to me. * Named capturing groups and comments (e.g. http://xregexp.com/syntax/). Mostly I argue for this because it makes RegExp matches more self-documenting. Regular Expressions are already cryptic as it is. I do like some of the new flags proposed in (http://xregexp.com/flags/) but personally haven't used them but maybe that is something also for discussion. Marc Harter
Re: Suggested RegExp Improvements
Brendan et al., I have created a proposal for look-behind provided at this link: https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM I hope it is a format that will be helpful for discussion with TC39. Admittedly, I have never written one of these before so am completely open to any feedback or ways to improve the document from yourself or anyone else on this list. Marc On Sat, 2010-11-13 at 09:32 -0600, Marc Harter wrote: I would be game to write up a proposal for this. When would you need this by to discuss w/ TC39? Thanks for your consideration, Marc On Nov 12, 2010, at 5:04 PM, Brendan Eich bren...@mozilla.com wrote: On Nov 12, 2010, at 2:52 PM, Marc Harter wrote: After considering all the breadth this discussion could take maybe it would be wise to just focus on one issue at a time. For me, the biggest missing feature is lookbehind. Its common to most languages implementing the Perl-RegExp-syntax, it is very useful when looking for patterns that follow or don't follow a particular pattern. I guess I'm confused why lookahead made it in but not lookbehind. This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but we proposed to ECMA TC39 TG1 (the JS group -- things were different then, including capitalization) something based on Perl 5. We didn't get everything, and we had to rationalize some obvious quirks. I don't remember lookbehind (which emerged in Perl 5.005 in July '98) being left out on purpose. Waldemar may recall more, I'd handed him the JS keys inside netscape.com to go do mozilla.org. If you are game to write a proposal or mini-spec (in the style of ES5 even), let me know. I'll chat with other TC39'ers next week about this. /be What do people think about including this feature? Marc On Fri, 2010-11-12 at 16:20 -0600, Marc Harter wrote: I will start out with a disclaimer. I have not read both ECMAScript specifications for 3 and now 5, so I admit that I am not an expert in the spec itself but as I user of JavaScript, I would like to get some expert discussion over this topic as proposed enhancements to the RegExp engine for Harmony. I will start with a list of lacking features in JS as compared to Perl provided by (http://www.regular-expressions.info/javascript.html): * No \A or \Z anchors to match the start or end of the string. Use a caret or dollar instead. * Lookbehind is not supported at all. Lookahead is fully supported. * No atomic grouping or possessive quantifiers * No Unicode support, except for matching single characters with \u * No named capturing groups. Use numbered capturing groups instead. * No mode modifiers to set matching options within the regular expression. * No conditionals. * No regular expression comments. Describe your regular expression with JavaScript // comments instead, outside the regular expression string. I don't know if all of these need to be in the language but there have been some that I have personally wanted to use: * Lookbehind! ECMAScript fully supports lookahead, why not lookbehind? Seems like a big hole to me. * Named capturing groups and comments (e.g. http://xregexp.com/syntax/). Mostly I argue for this because it makes RegExp matches more self-documenting. Regular Expressions are already cryptic as it is. I do like some of the new flags proposed in (http://xregexp.com/flags/) but personally haven't used them but maybe that is something also for discussion. Marc Harter ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Suggested RegExp Improvements
I would be game to write up a proposal for this. When would you need this by to discuss w/ TC39? Thanks for your consideration, Marc On Nov 12, 2010, at 5:04 PM, Brendan Eich bren...@mozilla.com wrote: On Nov 12, 2010, at 2:52 PM, Marc Harter wrote: After considering all the breadth this discussion could take maybe it would be wise to just focus on one issue at a time. For me, the biggest missing feature is lookbehind. Its common to most languages implementing the Perl-RegExp-syntax, it is very useful when looking for patterns that follow or don't follow a particular pattern. I guess I'm confused why lookahead made it in but not lookbehind. This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but we proposed to ECMA TC39 TG1 (the JS group -- things were different then, including capitalization) something based on Perl 5. We didn't get everything, and we had to rationalize some obvious quirks. I don't remember lookbehind (which emerged in Perl 5.005 in July '98) being left out on purpose. Waldemar may recall more, I'd handed him the JS keys inside netscape.com to go do mozilla.org. If you are game to write a proposal or mini-spec (in the style of ES5 even), let me know. I'll chat with other TC39'ers next week about this. /be What do people think about including this feature? Marc On Fri, 2010-11-12 at 16:20 -0600, Marc Harter wrote: I will start out with a disclaimer. I have not read both ECMAScript specifications for 3 and now 5, so I admit that I am not an expert in the spec itself but as I user of JavaScript, I would like to get some expert discussion over this topic as proposed enhancements to the RegExp engine for Harmony. I will start with a list of lacking features in JS as compared to Perl provided by (http://www.regular-expressions.info/javascript.html): * No \A or \Z anchors to match the start or end of the string. Use a caret or dollar instead. * Lookbehind is not supported at all. Lookahead is fully supported. * No atomic grouping or possessive quantifiers * No Unicode support, except for matching single characters with \u * No named capturing groups. Use numbered capturing groups instead. * No mode modifiers to set matching options within the regular expression. * No conditionals. * No regular expression comments. Describe your regular expression with JavaScript // comments instead, outside the regular expression string. I don't know if all of these need to be in the language but there have been some that I have personally wanted to use: * Lookbehind! ECMAScript fully supports lookahead, why not lookbehind? Seems like a big hole to me. * Named capturing groups and comments (e.g. http://xregexp.com/syntax/). Mostly I argue for this because it makes RegExp matches more self-documenting. Regular Expressions are already cryptic as it is. I do like some of the new flags proposed in (http://xregexp.com/flags/) but personally haven't used them but maybe that is something also for discussion. Marc Harter ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Suggested RegExp Improvements
After considering all the breadth this discussion could take maybe it would be wise to just focus on one issue at a time. For me, the biggest missing feature is lookbehind. Its common to most languages implementing the Perl-RegExp-syntax, it is very useful when looking for patterns that follow or don't follow a particular pattern. I guess I'm confused why lookahead made it in but not lookbehind. What do people think about including this feature? Marc On Fri, 2010-11-12 at 16:20 -0600, Marc Harter wrote: I will start out with a disclaimer. I have not read both ECMAScript specifications for 3 and now 5, so I admit that I am not an expert in the spec itself but as I user of JavaScript, I would like to get some expert discussion over this topic as proposed enhancements to the RegExp engine for Harmony. I will start with a list of lacking features in JS as compared to Perl provided by (http://www.regular-expressions.info/javascript.html): * No \A or \Z anchors to match the start or end of the string. Use a caret or dollar instead. * Lookbehind is not supported at all. Lookahead is fully supported. * No atomic grouping or possessive quantifiers * No Unicode support, except for matching single characters with \u * No named capturing groups. Use numbered capturing groups instead. * No mode modifiers to set matching options within the regular expression. * No conditionals. * No regular expression comments. Describe your regular expression with JavaScript // comments instead, outside the regular expression string. I don't know if all of these need to be in the language but there have been some that I have personally wanted to use: * Lookbehind! ECMAScript fully supports lookahead, why not lookbehind? Seems like a big hole to me. * Named capturing groups and comments (e.g. http://xregexp.com/syntax/). Mostly I argue for this because it makes RegExp matches more self-documenting. Regular Expressions are already cryptic as it is. I do like some of the new flags proposed in (http://xregexp.com/flags/) but personally haven't used them but maybe that is something also for discussion. Marc Harter ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Suggested RegExp Improvements
On Nov 12, 2010, at 2:52 PM, Marc Harter wrote: After considering all the breadth this discussion could take maybe it would be wise to just focus on one issue at a time. For me, the biggest missing feature is lookbehind. Its common to most languages implementing the Perl-RegExp-syntax, it is very useful when looking for patterns that follow or don't follow a particular pattern. I guess I'm confused why lookahead made it in but not lookbehind. This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but we proposed to ECMA TC39 TG1 (the JS group -- things were different then, including capitalization) something based on Perl 5. We didn't get everything, and we had to rationalize some obvious quirks. I don't remember lookbehind (which emerged in Perl 5.005 in July '98) being left out on purpose. Waldemar may recall more, I'd handed him the JS keys inside netscape.com to go do mozilla.org. If you are game to write a proposal or mini-spec (in the style of ES5 even), let me know. I'll chat with other TC39'ers next week about this. /be What do people think about including this feature? Marc On Fri, 2010-11-12 at 16:20 -0600, Marc Harter wrote: I will start out with a disclaimer. I have not read both ECMAScript specifications for 3 and now 5, so I admit that I am not an expert in the spec itself but as I user of JavaScript, I would like to get some expert discussion over this topic as proposed enhancements to the RegExp engine for Harmony. I will start with a list of lacking features in JS as compared to Perl provided by (http://www.regular-expressions.info/javascript.html): * No \A or \Z anchors to match the start or end of the string. Use a caret or dollar instead. * Lookbehind is not supported at all. Lookahead is fully supported. * No atomic grouping or possessive quantifiers * No Unicode support, except for matching single characters with \u * No named capturing groups. Use numbered capturing groups instead. * No mode modifiers to set matching options within the regular expression. * No conditionals. * No regular expression comments. Describe your regular expression with JavaScript // comments instead, outside the regular expression string. I don't know if all of these need to be in the language but there have been some that I have personally wanted to use: * Lookbehind! ECMAScript fully supports lookahead, why not lookbehind? Seems like a big hole to me. * Named capturing groups and comments (e.g. http://xregexp.com/syntax/). Mostly I argue for this because it makes RegExp matches more self-documenting. Regular Expressions are already cryptic as it is. I do like some of the new flags proposed in (http://xregexp.com/flags/) but personally haven't used them but maybe that is something also for discussion. Marc Harter ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Suggested RegExp Improvements
On 11/12/10 15:04, Brendan Eich wrote: On Nov 12, 2010, at 2:52 PM, Marc Harter wrote: After considering all the breadth this discussion could take maybe it would be wise to just focus on one issue at a time. For me, the biggest missing feature is lookbehind. Its common to most languages implementing the Perl-RegExp-syntax, it is very useful when looking for patterns that follow or don't follow a particular pattern. I guess I'm confused why lookahead made it in but not lookbehind. This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but we proposed to ECMA TC39 TG1 (the JS group -- things were different then, including capitalization) something based on Perl 5. We didn't get everything, and we had to rationalize some obvious quirks. I don't remember lookbehind (which emerged in Perl 5.005 in July '98) being left out on purpose. Waldemar may recall more, I'd handed him the JS keys inside netscape.com to go do mozilla.org. If you are game to write a proposal or mini-spec (in the style of ES5 even), let me know. I'll chat with other TC39'ers next week about this. The ES3 spec was based on what was stable at the time. Perl had been experimenting with other constructs in regexp's, but there was some churn there, and I didn't want to go for features that were still in flux. Waldemar ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss