Re: Q: Lonely surrogates and unicode regexps
On 28 Jan 2015, at 11:36, Marja Hölttä ma...@chromium.org wrote: TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ? The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters: https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atom https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation E.g., “Let A be the set of all *characters* except LineTerminator.” “Let ch be the *character* Input[e].” But is a lonely surrogate a character? According to the Unicode standard, it’s not. If it's not, what will ch be if the input string contains a lonely surrogate in the relevant position? Q1: Are lonely surrogates allowed in /u regexps? E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed? Will it match a lead surrogate inside a surrogate pair? Suggestion: we shouldn't allow lonely surrogates in /u regexps. If users actually want to match lonely surrogates (e.g., to check for them or remove them) then they can use non-/u regexps. You’re proposing to define “characters” in terms of Unicode scalar values in the case `/u` is used. I could get behind that — it reinforces the idea that `/u` is like a strict mode for regular expressions. Playing devil’s advocate, the problem is that regular expressions and strings go hand in hand, and there is no guarantee that JavaScript strings only consist of valid code points. Making `.` not match lone surrogates breaks the developer expectation that `(.)` matches every “part” of the string. Having to avoid `/u` to prevent this seems like a potentially bad thing. The regexp syntax treats a lonely surrogate as a normal unicode escape, and the rules say e.g., The production RegExpUnicodeEscapeSequence :: u Hex4Digits evaluates as follows: Return the character whose code is the SV of Hex4Digits. - it's also unclear what this means if no valid character has this code. Q2: If the string contains a lonely surrogate, what should it match? Should it match .? Should it match [^a] ? (Or is it undefined behavior?) Test cases: /foo.bar/u.test(foo\uD83Dbar) == ? /foo.bar/u.test(foo\uDC00bar) == ? /foo[^a]bar/u.test(foo\uD83Dbar) == ? /foo[^a]bar/u.test(foo\uDC00bar) == ? /foo/u.test(bar\uD83Dbarfoo) == ? /foo/u.test(bar\uDC00barfoo) == ? /foo(.*)bar\1/u.test(foo\uD834bar\uD834\uDC00) == ? // Should the backreference be allowed to match the lead surrogate of a surrogate pair? /^(.+)\1$/u.test(\uDC00foobar\uD83D\uDC00foobar\uD83D) == ?? // Should we allow splitting the surrogate pair like this? Suggestion: a lonely surrogate should not be a character and it should not match . or [^a] etc. However, a lonely surrogate in the input string shouldn't prevent some other part of the string from matching. If a lonely surrogate is treated as a character, the matching rule for . gets complicated and difficult / slow to implement: . should not match individual surrogates inside a surrogate pair, but if it has to match a lonely surrogate, we'll end up needing lookahead and lookbehind logic to implement that behavior. For example, the current version of Mathias’s ES6 Unicode regular expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\u]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/ and afaics it’s not yet fully consistent wrt lonely surrogates, so, a consistent implementation is going to be more complex than this. This is indeed an incomplete solution. The lack of lookbehind support in ES makes this hard to transpile correctly. Ideas welcome! ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Q: Lonely surrogates and unicode regexps
On 1/28/2015 2:51 PM, André Bargull wrote: For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works: foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so, generally, lonely surrogates match /./. Backreferences are allowed to consume the leading surrogate of a valid surrogate pair: Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1 But surprisingly: Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$ ... So Ex2 works as if the input string was converted to UTF-32 before matching, but Ex1 works as if it was def not. Idk what's the correct mental model where both Ex1 and Ex2 would make sense. java.util.regex.Pattern matches back references by comparing (Java) chars [1], but reads patterns as a sequence of code points [2]. That should help to explain the differences between ex1 and ex2. [1] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890 [2] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671 Err, the part about how patterns are read is not important here. What I should have written is that the input string is (also) read as a sequence of code points [3]. So in ex2 `\uD834\uDC00` is read as a single code point (and not split into \uD834 and \uDC00 during backtracking). [3] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l3773 ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Maximum String length
Typically, implementation-specific things aren't specified in the spec (like Math precision, etc) - although usually when it's implementation-specific, it's explicitly noted as such ( https://people.mozilla.org/~jorendorff/es6-draft.html#sec-date.parse , https://people.mozilla.org/~jorendorff/es6-draft.html#sec-math.hypot , https://people.mozilla.org/~jorendorff/es6-draft.html#sec-ecmascript-language-types-number-type , https://people.mozilla.org/~jorendorff/es6-draft.html#sec-object.keys , etc) Strings are only defined in ES6 as a primitive value that is a finite ordered sequence of zero or more 16-bit unsigned integer ( https://people.mozilla.org/~jorendorff/es6-draft.html#sec-terms-and-definitions-string-value ) and are not noted as having any implementation-specific or implementation-dependent qualities. To me, finite here means `Number.MAX_VALUE` - ie, the highest number I can get before I reach Infinity. An alternative reading is any number greater than zero that's not Infinity - but at that point an implementation conforms if it's max length is 1, which obviously would be silly. However, Chrome 40 and Opera 26-27 have a limit of `0xFF0` (`2**28 - 2**4`), Firefox 35 and IE 9-11 all have a limit of `0xFFF` (`2**28 - 1`), and Safari 8 has `0x7FFF` (`2**31 - 1`). There's many more browsers I haven't tested of course but it'd be interesting to know how wide these numbers deviate. 1) Should an engine's max string length be exposed, like `Number.MAX_VALUE`, as `String.MAX_LENGTH`? This will help, for example, my `String#repeat` polyfill throw an earlier `RangeError` rather than having to try to build a string of that length. 2) Should the spec require a minimum maximum string length, or at least be more specific in how it defines finite? ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Q: Lonely surrogates and unicode regexps
For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works: foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so, generally, lonely surrogates match /./. Backreferences are allowed to consume the leading surrogate of a valid surrogate pair: Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1 But surprisingly: Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$ ... So Ex2 works as if the input string was converted to UTF-32 before matching, but Ex1 works as if it was def not. Idk what's the correct mental model where both Ex1 and Ex2 would make sense. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Q: Lonely surrogates and unicode regexps
I think the cleanest mental model is where UTF-16 or UTF-8 strings are interpreted as if they were transformed into UTF-32. While that is generally feasible, it often represents a cost in performance which is not acceptable in practice. So you see various approaches that involve some deviation from that mental model. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Wed, Jan 28, 2015 at 2:15 PM, Marja Hölttä ma...@chromium.org wrote: For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works: foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so, generally, lonely surrogates match /./. Backreferences are allowed to consume the leading surrogate of a valid surrogate pair: Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1 But surprisingly: Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$ ... So Ex2 works as if the input string was converted to UTF-32 before matching, but Ex1 works as if it was def not. Idk what's the correct mental model where both Ex1 and Ex2 would make sense. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Maximum String length
On 28 January 2015 at 13:14, Claude Pache claude.pa...@gmail.com wrote: To me, finite is just to be taken in the common mathematical sense of the term; in particular you could have theoretically a string of length 10^1. But yes, it would be reasonable to restrict oneself to strings of length at most 2^52, so that `string.length` could always return an exact answer. To me it would be reasonable to restrict oneself to much shorter strings, since no existing machine has the memory to represent a string of length 2^52, nor will any in the foreseeable future. ;) VMs can always run into out-of-memory conditions. In general, there is no way to predict those. Even strings with less then the hard-coded length limit might cause you to go OOM. So providing reflection on a constant like that might do little but giving a false sense of safety. /Andreas ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Q: Lonely surrogates and unicode regexps
For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works: foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so, generally, lonely surrogates match /./. Backreferences are allowed to consume the leading surrogate of a valid surrogate pair: Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1 But surprisingly: Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$ ... So Ex2 works as if the input string was converted to UTF-32 before matching, but Ex1 works as if it was def not. Idk what's the correct mental model where both Ex1 and Ex2 would make sense. java.util.regex.Pattern matches back references by comparing (Java) chars [1], but reads patterns as a sequence of code points [2]. That should help to explain the differences between ex1 and ex2. [1] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890 [2] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671 ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Q: Lonely surrogates and unicode regexps
Based on Ex1, looks like the input string is not read as a sequence of code points when we try to find a match for \1. So it's mostly read as a sequence of code points except when it's not. :/ On Wed, Jan 28, 2015 at 3:11 PM, André Bargull andre.barg...@udo.edu wrote: On 1/28/2015 2:51 PM, André Bargull wrote: For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works: foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so, generally, lonely surrogates match /./. Backreferences are allowed to consume the leading surrogate of a valid surrogate pair: Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1 But surprisingly: Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$ ... So Ex2 works as if the input string was converted to UTF-32 before matching, but Ex1 works as if it was def not. Idk what's the correct mental model where both Ex1 and Ex2 would make sense. java.util.regex.Pattern matches back references by comparing (Java) chars [1], but reads patterns as a sequence of code points [2]. That should help to explain the differences between ex1 and ex2. [1] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/ c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890 [2] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/ c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671 Err, the part about how patterns are read is not important here. What I should have written is that the input string is (also) read as a sequence of code points [3]. So in ex2 `\uD834\uDC00` is read as a single code point (and not split into \uD834 and \uDC00 during backtracking). [3] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/ c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l3773 ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Q: Lonely surrogates and unicode regexps
Hello es-discuss, TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ? The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters: https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atom https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation E.g., “Let A be the set of all *characters* except LineTerminator.” “Let ch be the *character* Input[e].” But is a lonely surrogate a character? According to the Unicode standard, it’s not. If it's not, what will ch be if the input string contains a lonely surrogate in the relevant position? Q1: Are lonely surrogates allowed in /u regexps? E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed? Will it match a lead surrogate inside a surrogate pair? Suggestion: we shouldn't allow lonely surrogates in /u regexps. If users actually want to match lonely surrogates (e.g., to check for them or remove them) then they can use non-/u regexps. The regexp syntax treats a lonely surrogate as a normal unicode escape, and the rules say e.g., The production RegExpUnicodeEscapeSequence :: u Hex4Digits evaluates as follows: Return the character whose code is the SV of Hex4Digits. - it's also unclear what this means if no valid character has this code. Q2: If the string contains a lonely surrogate, what should it match? Should it match .? Should it match [^a] ? (Or is it undefined behavior?) Test cases: /foo.bar/u.test(foo\uD83Dbar) == ? /foo.bar/u.test(foo\uDC00bar) == ? /foo[^a]bar/u.test(foo\uD83Dbar) == ? /foo[^a]bar/u.test(foo\uDC00bar) == ? /foo/u.test(bar\uD83Dbarfoo) == ? /foo/u.test(bar\uDC00barfoo) == ? /foo(.*)bar\1/u.test(foo\uD834bar\uD834\uDC00) == ? // Should the backreference be allowed to match the lead surrogate of a surrogate pair? /^(.+)\1$/u.test(\uDC00foobar\uD83D\uDC00foobar\uD83D) == ?? // Should we allow splitting the surrogate pair like this? Suggestion: a lonely surrogate should not be a character and it should not match . or [^a] etc. However, a lonely surrogate in the input string shouldn't prevent some other part of the string from matching. If a lonely surrogate is treated as a character, the matching rule for . gets complicated and difficult / slow to implement: . should not match individual surrogates inside a surrogate pair, but if it has to match a lonely surrogate, we'll end up needing lookahead and lookbehind logic to implement that behavior. For example, the current version of Mathias’s ES6 Unicode regular expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\u]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/ and afaics it’s not yet fully consistent wrt lonely surrogates, so, a consistent implementation is going to be more complex than this. If we convert the string into UC-32 before matching, then the lonely surrogate is a character behavior gets easier to implement, but we wouldn't want to be forced to do that. The intention behind the ES6 spec seems to be that strings can / should still be stored as UC-16. Converting strings to UC-32 before matching with /u regexps would require an additional pass over the string which we'd want to avoid, and converting only when strictly needed for the lonely surrogate is a character implementation adds complexity. E.g., with some regexps we don't need to scan the whole input string to find a match, and also most input strings, even for /u regexps, probably won't contain surrogates (to find that out we'd also need to scan the whole string, or some logic to fall back to UC-32 matching when we see a surrogate). BR, Marja ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Q: Lonely surrogates and unicode regexps
On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä marja at chromium.org https://mail.mozilla.org/listinfo/es-discuss wrote: / The ES6 unicode regexp spec is not very clear regarding what should happen // if the regexp or the matched string contains lonely surrogates (a lead // surrogate without a trail, or a trail without a lead). For example, for the // . operator, the relevant parts of the spec speak about characters: // / Just a bit of terminology. The term character is overloaded, so Unicode provides the unambiguous term code point. For example, U+0378 is not (currently) an encoded character according to Unicode, but it would certainly be a terrible idea to disregard it, or not match it. It is a reserved code point that may be assigned as an encoded character in the future. So both U+D83D and U+0378 are not characters. If a ES spec uses the term character instead of code point, then at some point in the text it needs to disambiguate what is meant. character is defined in 21.2.2 Pattern Semantics [1]: In the context of describing the behaviour of a BMP pattern “character” means a single 16-bit Unicode BMP code point. In the context of describing the behaviour of a Unicode pattern “character” means a UTF-16 encoded code point. [1] https://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Q: Lonely surrogates and unicode regexps
Good, that sounds right. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Wed, Jan 28, 2015 at 12:57 PM, André Bargull andre.barg...@udo.edu wrote: On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä marja at chromium.org https://mail.mozilla.org/listinfo/es-discuss wrote: * The ES6 unicode regexp spec is not very clear regarding what should happen ** if the regexp or the matched string contains lonely surrogates (a lead ** surrogate without a trail, or a trail without a lead). For example, for the ** . operator, the relevant parts of the spec speak about characters: * Just a bit of terminology. The term character is overloaded, so Unicode provides the unambiguous term code point. For example, U+0378 is not (currently) an encoded character according to Unicode, but it would certainly be a terrible idea to disregard it, or not match it. It is a reserved code point that may be assigned as an encoded character in the future. So both U+D83D and U+0378 are not characters. If a ES spec uses the term character instead of code point, then at some point in the text it needs to disambiguate what is meant. character is defined in 21.2.2 Pattern Semantics [1]: In the context of describing the behaviour of a BMP pattern “character” means a single 16-bit Unicode BMP code point. In the context of describing the behaviour of a Unicode pattern “character” means a UTF-16 encoded code point. [1] https://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Q: Lonely surrogates and unicode regexps
On Wed, Jan 28, 2015 at 11:45 AM, Mathias Bynens math...@qiwi.be wrote: On 28 Jan 2015, at 11:36, Marja Hölttä ma...@chromium.org wrote: For example, the current version of Mathias’s ES6 Unicode regular expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\u]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/ and afaics it’s not yet fully consistent wrt lonely surrogates, so, a consistent implementation is going to be more complex than this. This is indeed an incomplete solution. The lack of lookbehind support in ES makes this hard to transpile correctly. Ideas welcome! I don't think your transpiler can work without lookbehind. If you could guarantee that none of your transpiled regexp matches a substring that ends in the middle of a pair, then I think you could get it right without lookbehind, but consider: TxL-TxLT.test(/(...)-\1./); Where L stands for a lead surrogate, and T stands for a trailing surrogate. There's no way to stop the backreference from swallowing the last L, and without lookbehind there is no way to stop the . from matching the final T. A second issue is having a match that starts in the middle of a pair. You could test for this after the matching if JS gave you the index of the match in the string, but I don't think it does. Ignoring the start-of-match-in-the-middle-of-a-pair issue, and the backreferences case, I think you can do without the backreference. Assuming the lonely-surrogates-are-a-character scenario, the period (.) transpiles to (ignore spaces added for readability): (?: \L(?!\T) | \L\T | \T | [^\L\T\N]) where \L means leading surrogates, \T means trailing surrogates, \N means all newlines. Whatever comes before the . is not allowed to match a half As an optimization, .x can transpile to (?: \L\T | . )x where the x stands in for any literal characters. For a JS engine implementor, like Marja, it is of course possible to add 1-character negative lookbehind (\b already has elements of this). Then your in-engine transpiler turns . into (?: \L(?!\T) | \L\T | (?!\L)\T | [^\L\T\N]) Which is going to be truly horrible in terms of code size and performance. It's not like the period operator is a rare thing in a regexp, and other common things like [^a-z] and [^\d] will expand into similar horrors. On the other hand, in the lonely-surrogates-match-nothing scenario, the . transpiles to (?: \l\t | [^\l\t\n] ) which is quite a lot nicer and faster. In this scenario, .x expands to (?: \L\T | [^\T\L\N ) which still has no lookaheads and lookbehinds. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Maximum String length
Le 28 janv. 2015 à 09:58, Jordan Harband ljh...@gmail.com a écrit : Typically, implementation-specific things aren't specified in the spec (like Math precision, etc) - although usually when it's implementation-specific, it's explicitly noted as such ( https://people.mozilla.org/~jorendorff/es6-draft.html#sec-date.parse https://people.mozilla.org/~jorendorff/es6-draft.html#sec-date.parse , https://people.mozilla.org/~jorendorff/es6-draft.html#sec-math.hypot https://people.mozilla.org/~jorendorff/es6-draft.html#sec-math.hypot , https://people.mozilla.org/~jorendorff/es6-draft.html#sec-ecmascript-language-types-number-type https://people.mozilla.org/~jorendorff/es6-draft.html#sec-ecmascript-language-types-number-type , https://people.mozilla.org/~jorendorff/es6-draft.html#sec-object.keys https://people.mozilla.org/~jorendorff/es6-draft.html#sec-object.keys , etc) Strings are only defined in ES6 as a primitive value that is a finite ordered sequence of zero or more 16-bit unsigned integer ( https://people.mozilla.org/~jorendorff/es6-draft.html#sec-terms-and-definitions-string-value https://people.mozilla.org/~jorendorff/es6-draft.html#sec-terms-and-definitions-string-value ) and are not noted as having any implementation-specific or implementation-dependent qualities. To me, finite here means `Number.MAX_VALUE` - ie, the highest number I can get before I reach Infinity. An alternative reading is any number greater than zero that's not Infinity - but at that point an implementation conforms if it's max length is 1, which obviously would be silly. To me, finite is just to be taken in the common mathematical sense of the term; in particular you could have theoretically a string of length 10^1. But yes, it would be reasonable to restrict oneself to strings of length at most 2^52, so that `string.length` could always return an exact answer. —Claude However, Chrome 40 and Opera 26-27 have a limit of `0xFF0` (`2**28 - 2**4`), Firefox 35 and IE 9-11 all have a limit of `0xFFF` (`2**28 - 1`), and Safari 8 has `0x7FFF` (`2**31 - 1`). There's many more browsers I haven't tested of course but it'd be interesting to know how wide these numbers deviate. 1) Should an engine's max string length be exposed, like `Number.MAX_VALUE`, as `String.MAX_LENGTH`? This will help, for example, my `String#repeat` polyfill throw an earlier `RangeError` rather than having to try to build a string of that length. 2) Should the spec require a minimum maximum string length, or at least be more specific in how it defines finite? ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Q: Lonely surrogates and unicode regexps
On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä ma...@chromium.org wrote: The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters: Just a bit of terminology. The term character is overloaded, so Unicode provides the unambiguous term code point. For example, U+0378 is not (currently) an encoded character according to Unicode, but it would certainly be a terrible idea to disregard it, or not match it. It is a reserved code point that may be assigned as an encoded character in the future. So both U+D83D and U+0378 are not characters. If a ES spec uses the term character instead of code point, then at some point in the text it needs to disambiguate what is meant. As to how this should be handled in regex expressions, I'd suggest looking at Java's approach. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Q: Lonely surrogates and unicode regexps
Some interesting questions here. 1 - What is a character? Is it a Unicode Code Point? 2 - Should we be able to match all possible JS Strings? 3 - Should we be able to match all possible Unicode Strings? 4 - What do we do if there is a character in a String we cannot match? 5 - Do unmatchable characters match . ? 6 - Are subsections of unmatchable strings matchable if they contain only matchable characters? It is important to remember in these discussions that the Unicode specification allows strings which contain unmatched surrogates. Do we want regular expressions that can't match some Unicode strings? Do we extend the regexp syntax to have a symbol which matches an unmatched surrogate? How about reserved code points? What happens when they become assigned? On 28 January 2015 at 05:36, Marja Hölttä ma...@chromium.org wrote: Hello es-discuss, TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ? The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters: https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atom https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation E.g., “Let A be the set of all *characters* except LineTerminator.” “Let ch be the *character* Input[e].” But is a lonely surrogate a character? According to the Unicode standard, it’s not. If it's not, what will ch be if the input string contains a lonely surrogate in the relevant position? Q1: Are lonely surrogates allowed in /u regexps? E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed? Will it match a lead surrogate inside a surrogate pair? Suggestion: we shouldn't allow lonely surrogates in /u regexps. If users actually want to match lonely surrogates (e.g., to check for them or remove them) then they can use non-/u regexps. The regexp syntax treats a lonely surrogate as a normal unicode escape, and the rules say e.g., The production RegExpUnicodeEscapeSequence :: u Hex4Digits evaluates as follows: Return the character whose code is the SV of Hex4Digits. - it's also unclear what this means if no valid character has this code. Q2: If the string contains a lonely surrogate, what should it match? Should it match .? Should it match [^a] ? (Or is it undefined behavior?) Test cases: /foo.bar/u.test(foo\uD83Dbar) == ? /foo.bar/u.test(foo\uDC00bar) == ? /foo[^a]bar/u.test(foo\uD83Dbar) == ? /foo[^a]bar/u.test(foo\uDC00bar) == ? /foo/u.test(bar\uD83Dbarfoo) == ? /foo/u.test(bar\uDC00barfoo) == ? /foo(.*)bar\1/u.test(foo\uD834bar\uD834\uDC00) == ? // Should the backreference be allowed to match the lead surrogate of a surrogate pair? /^(.+)\1$/u.test(\uDC00foobar\uD83D\uDC00foobar\uD83D) == ?? // Should we allow splitting the surrogate pair like this? Suggestion: a lonely surrogate should not be a character and it should not match . or [^a] etc. However, a lonely surrogate in the input string shouldn't prevent some other part of the string from matching. If a lonely surrogate is treated as a character, the matching rule for . gets complicated and difficult / slow to implement: . should not match individual surrogates inside a surrogate pair, but if it has to match a lonely surrogate, we'll end up needing lookahead and lookbehind logic to implement that behavior. For example, the current version of Mathias’s ES6 Unicode regular expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\u]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/ and afaics it’s not yet fully consistent wrt lonely surrogates, so, a consistent implementation is going to be more complex than this. If we convert the string into UC-32 before matching, then the lonely surrogate is a character behavior gets easier to implement, but we wouldn't want to be forced to do that. The intention behind the ES6 spec seems to be that strings can / should still be stored as UC-16. Converting strings to UC-32 before matching with /u regexps would require an additional pass over the string which we'd want to avoid, and converting only when strictly needed for the lonely surrogate is a character implementation adds complexity. E.g., with some regexps we don't need to scan the whole input string to find a match, and also most input strings, even for /u regexps, probably won't contain surrogates (to find that out we'd also need to scan the whole string, or some logic to fall back to UC-32 matching when we see a surrogate). BR, Marja
RE: Maximum String length
From: es-discuss [mailto:es-discuss-boun...@mozilla.org] On Behalf Of Jordan Harband Strings can't possibly have a length larger than Number.MAX_SAFE_INTEGER - otherwise, you'd be able to have a string whose length is not a number representable in JavaScript. So? That's a bit inconvenient, but no reason to argue that such a string can't exist. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: @@toStringTag spoofing for null and undefined
On Jan 28, 2015, at 4:40 PM, John-David Dalton john.david.dal...@gmail.com wrote: Kind of a bummer. The isTypedArray example from https://esdiscuss.org/topic/tostringtag-spoofing-for-null-and-undefined#content-59 is incorrect. Is there an updated reference somewhere? The toStringTag result is handy because it allows checking against several tags at once without having to invoke multiple functions each with their own try-catch and all that perf baggage. How is it incorrect? Are you referring to the fact that both typed arrays and DataView objects have a [[ViewedArrayBuffer]] internal slot. If so, I think this is a specification but that I should fix. Allen ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: @@toStringTag spoofing for null and undefined
Primary issue is in isTypedArray(a): Uin32Array.prototype.buffer.call(a); Besides the typos, accessing .buffer throws in at least Chrome Firefox. Then .buffer is an object so if it doesn't throw there's no .call to execute. -JDD On Wed, Jan 28, 2015 at 4:55 PM, Allen Wirfs-Brock al...@wirfs-brock.com wrote: On Jan 28, 2015, at 4:40 PM, John-David Dalton john.david.dal...@gmail.com wrote: Kind of a bummer. The isTypedArray example from https://esdiscuss.org/topic/tostringtag-spoofing-for-null-and-undefined#content-59 is incorrect. Is there an updated reference somewhere? The toStringTag result is handy because it allows checking against several tags at once without having to invoke multiple functions each with their own try-catch and all that perf baggage. How is it incorrect? Are you referring to the fact that both typed arrays and DataView objects have a [[ViewedArrayBuffer]] internal slot. If so, I think this is a specification but that I should fix. Allen ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Maximum String length
Strings can't possibly have a length larger than Number.MAX_SAFE_INTEGER - otherwise, you'd be able to have a string whose length is not a number representable in JavaScript. So, at the least, I think it would make sense to define a maximum string length as Number.MAX_SAFE_INTEGER, even if that provides no guarantees that strings of that length will work (ie, OOM errors etc are fine), whether it's exposed on String or not. It might also be nice if the spec included a non-normative note that suggested a lower bound for a maximum string length (where strings are guaranteed to work), so that at least there's a guideline. Thoughts? On Wed, Jan 28, 2015 at 6:53 AM, Mark S. Miller erig...@google.com wrote: On Wed, Jan 28, 2015 at 5:44 AM, Andreas Rossberg rossb...@google.com wrote: On 28 January 2015 at 13:14, Claude Pache claude.pa...@gmail.com wrote: To me, finite is just to be taken in the common mathematical sense of the term; in particular you could have theoretically a string of length 10^1. But yes, it would be reasonable to restrict oneself to strings of length at most 2^52, so that `string.length` could always return an exact answer. To me it would be reasonable to restrict oneself to much shorter strings, since no existing machine has the memory to represent a string of length 2^52, nor will any in the foreseeable future. ;) That's just four petabytes. If present trends... VMs can always run into out-of-memory conditions. In general, there is no way to predict those. Even strings with less then the hard-coded length limit might cause you to go OOM. So providing reflection on a constant like that might do little but giving a false sense of safety. Yes, OOM is always possible earlier and we can't set any limits on that. I agree that we shouldn't provide anything like String.MAX_LENGTH. But I also don't see how we could pleasantly support strings above 2^53. Array indexes are limited to 2^31 or so, and many integer operations truncate to that (,|), and strings support [] indexing, so it may make sense to agree on one of those as an *upper bound* -- you may not support strings longer than that. -- Cheers, --MarkM ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: @@toStringTag spoofing for null and undefined
To summarize the discussion at today's TC39 meeting: Given that the style of checks that Allen proposed ( https://esdiscuss.org/topic/tostringtag-spoofing-for-null-and-undefined#content-59 ) (using non-side-effecty non-generic methods that rely on internal slots, in a try/catch) is indeed reliable in ES3, and will continue to be reliable in ES6, any security-conscious code should update itself to use these kinds of checks rather than an Object.prototype.toString.call check. v8 (and any other implementations that are working on @@toStringTag) will leave Symbol.toStringTag behind a flag for a full two months, to give the relevant code time to release updates. In addition, anybody who modifies a builtin so that, say, a Boolean reports itself as a Number, surely intends the effects of this change, and so there is no concern about them. In accordance with this, step 17b of https://people.mozilla.org/~jorendorff/es6-draft.html#sec-object.prototype.tostring will be removed - if a developer wants to make a non-builtin value masquerade as a builtin, they similarly are intending those effects. I've updated and/or released the following npm packages to remain resilient with respect to this change in case anyone wants some specific examples of how to implement this: - https://www.npmjs.com/package/is-equal - https://www.npmjs.com/package/is-date-object - https://www.npmjs.com/package/is-number-object - https://www.npmjs.com/package/is-regex - https://www.npmjs.com/package/is-symbol In addition, I've closed and added similar comments to the spec bug I originally filed: https://bugs.ecmascript.org/show_bug.cgi?id=3506 Thanks, everyone, for your thoughts and time! - Jordan On Sat, Jan 24, 2015 at 2:59 PM, Mark Miller erig...@gmail.com wrote: Put better, the spec requires that Object.freeze(Object.prototype) works. On Sat, Jan 24, 2015 at 2:57 PM, Mark Miller erig...@gmail.com wrote: On Sat, Jan 24, 2015 at 2:42 PM, Isiah Meadows impinb...@gmail.com wrote: From: Mark S. Miller erig...@google.com To: Gary Guo nbdd0...@hotmail.com Cc: es-discuss@mozilla.org es-discuss@mozilla.org Date: Sat, 24 Jan 2015 07:11:35 -0800 Subject: Re: @@toStringTag spoofing for null and undefined Of course it can, by tamper proofing (essentially, freezing) Object.prototype. None of these protections are relevant anyway in an environment in which the primordials are not locked down. Yeah, pretty much. That proverbial inch was given a long time ago. And the proverbial mile taken. And I highly doubt the spec is going to require `Object.freeze(Object.prototype)`, Of course not. The key is the spec allows it. SES makes use of that. since that prohibits future polyfills and prolyfills of the Object prototype. Also, you could always straight up overwrite it, but that's even harder to protect against. (And how many cases do you know of literally overwriting built-in prototypes?) Or, to throw out an analog to Java, it is perfectly possible to call or even override a private method through reflection. JavaScript simply has more accessible reflection, more often useful since it's a more dynamic prototype-based OO language as opposed to a stricter class-based language. On Sat, Jan 24, 2015 at 6:11 AM, Gary Guo nbdd0...@hotmail.com wrote: Now I have a tendency to support the suggestion that cuts the anti-spoofing part. If coder wants to make an object and pretend it's a built-in, let it be. The anti-spoofing algorithm could not prevent this case: ``` Object.prototype.toString = function(){ return '[object I_Can_Be_Anything]'; } ``` Or this: ```js function handler() { throw new Error(No prototype for you!); } Object.defineProperty( Object, 'prototype', { get: handler, set: handler, enumerable: true }); ``` Me thinks this isn't going to get fixed. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss -- Cheers, --MarkM ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss -- Text by me above is hereby placed in the public domain Cheers, --MarkM -- Text by me above is hereby placed in the public domain Cheers, --MarkM ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: @@toStringTag spoofing for null and undefined
Kind of a bummer. The isTypedArray example from https://esdiscuss.org/topic/tostringtag-spoofing-for-null-and-undefined#content-59 is incorrect. Is there an updated reference somewhere? The toStringTag result is handy because it allows checking against several tags at once without having to invoke multiple functions each with their own try-catch and all that perf baggage. - JDD On Wed, Jan 28, 2015 at 4:29 PM, Jordan Harband ljh...@gmail.com wrote: To summarize the discussion at today's TC39 meeting: Given that the style of checks that Allen proposed ( https://esdiscuss.org/topic/tostringtag-spoofing-for-null-and-undefined#content-59 ) (using non-side-effecty non-generic methods that rely on internal slots, in a try/catch) is indeed reliable in ES3, and will continue to be reliable in ES6, any security-conscious code should update itself to use these kinds of checks rather than an Object.prototype.toString.call check. v8 (and any other implementations that are working on @@toStringTag) will leave Symbol.toStringTag behind a flag for a full two months, to give the relevant code time to release updates. In addition, anybody who modifies a builtin so that, say, a Boolean reports itself as a Number, surely intends the effects of this change, and so there is no concern about them. In accordance with this, step 17b of https://people.mozilla.org/~jorendorff/es6-draft.html#sec-object.prototype.tostring will be removed - if a developer wants to make a non-builtin value masquerade as a builtin, they similarly are intending those effects. I've updated and/or released the following npm packages to remain resilient with respect to this change in case anyone wants some specific examples of how to implement this: - https://www.npmjs.com/package/is-equal - https://www.npmjs.com/package/is-date-object - https://www.npmjs.com/package/is-number-object - https://www.npmjs.com/package/is-regex - https://www.npmjs.com/package/is-symbol In addition, I've closed and added similar comments to the spec bug I originally filed: https://bugs.ecmascript.org/show_bug.cgi?id=3506 Thanks, everyone, for your thoughts and time! - Jordan On Sat, Jan 24, 2015 at 2:59 PM, Mark Miller erig...@gmail.com wrote: Put better, the spec requires that Object.freeze(Object.prototype) works. On Sat, Jan 24, 2015 at 2:57 PM, Mark Miller erig...@gmail.com wrote: On Sat, Jan 24, 2015 at 2:42 PM, Isiah Meadows impinb...@gmail.com wrote: From: Mark S. Miller erig...@google.com To: Gary Guo nbdd0...@hotmail.com Cc: es-discuss@mozilla.org es-discuss@mozilla.org Date: Sat, 24 Jan 2015 07:11:35 -0800 Subject: Re: @@toStringTag spoofing for null and undefined Of course it can, by tamper proofing (essentially, freezing) Object.prototype. None of these protections are relevant anyway in an environment in which the primordials are not locked down. Yeah, pretty much. That proverbial inch was given a long time ago. And the proverbial mile taken. And I highly doubt the spec is going to require `Object.freeze(Object.prototype)`, Of course not. The key is the spec allows it. SES makes use of that. since that prohibits future polyfills and prolyfills of the Object prototype. Also, you could always straight up overwrite it, but that's even harder to protect against. (And how many cases do you know of literally overwriting built-in prototypes?) Or, to throw out an analog to Java, it is perfectly possible to call or even override a private method through reflection. JavaScript simply has more accessible reflection, more often useful since it's a more dynamic prototype-based OO language as opposed to a stricter class-based language. On Sat, Jan 24, 2015 at 6:11 AM, Gary Guo nbdd0...@hotmail.com wrote: Now I have a tendency to support the suggestion that cuts the anti-spoofing part. If coder wants to make an object and pretend it's a built-in, let it be. The anti-spoofing algorithm could not prevent this case: ``` Object.prototype.toString = function(){ return '[object I_Can_Be_Anything]'; } ``` Or this: ```js function handler() { throw new Error(No prototype for you!); } Object.defineProperty( Object, 'prototype', { get: handler, set: handler, enumerable: true }); ``` Me thinks this isn't going to get fixed. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss -- Cheers, --MarkM ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss -- Text by me above is hereby placed in the public domain Cheers, --MarkM -- Text by me above is hereby placed in the public domain Cheers, --MarkM ___ es-discuss mailing list es-discuss@mozilla.org
Re: Maximum String length
I suppose we could change the spec, but https://people.mozilla.org/~jorendorff/es6-draft.html#sec-ecmascript-language-types-string-type requires that The length of a String is the number of elements (i.e., 16-bit values) within it. - if the number can't be represented, then it seems that requirement can't be satisfied. I'm sure one can come up with a counterintuitive reading of the spec, but is that a realistic interpretation of it? On Wed, Jan 28, 2015 at 4:37 PM, Domenic Denicola d...@domenic.me wrote: From: es-discuss [mailto:es-discuss-boun...@mozilla.org] On Behalf Of Jordan Harband Strings can't possibly have a length larger than Number.MAX_SAFE_INTEGER - otherwise, you'd be able to have a string whose length is not a number representable in JavaScript. So? That's a bit inconvenient, but no reason to argue that such a string can't exist. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: @@toStringTag spoofing for null and undefined
On Jan 28, 2015, at 5:03 PM, John-David Dalton john.david.dal...@gmail.com wrote: Primary issue is in isTypedArray(a): Uin32Array.prototype.buffer.call(a); Besides the typos, accessing .buffer throws in at least Chrome Firefox. Then .buffer is an object so if it doesn't throw there's no .call to execute. the ES6 definition of %TypedArray%.prototype.buffer: %TypedArray%.prototype.buffer is an accessor property whose set accessor function is undefined. Its get accessor function performs the following steps: 1. Let O be the this value. 2. If Type(O) is not Object, throw a TypeError exception. 3. If O does not have a [[ViewedArrayBuffer]] internal slot throw a TypeError exception. 4. Let buffer be the value of O’s [[ViewedArrayBuffer]] internal slot. 5. Return buffer. ES6 expects buffer to be implemented as an accessor property. That means that the probe in my test should be: Object.getOwnProperty(Uint32Array.prototype.__proto__, ‘buffer’).get.call(a); Allen -JDD On Wed, Jan 28, 2015 at 4:55 PM, Allen Wirfs-Brock al...@wirfs-brock.com wrote: On Jan 28, 2015, at 4:40 PM, John-David Dalton john.david.dal...@gmail.com wrote: Kind of a bummer. The isTypedArray example from https://esdiscuss.org/topic/tostringtag-spoofing-for-null-and-undefined#content-59 is incorrect. Is there an updated reference somewhere? The toStringTag result is handy because it allows checking against several tags at once without having to invoke multiple functions each with their own try-catch and all that perf baggage. How is it incorrect? Are you referring to the fact that both typed arrays and DataView objects have a [[ViewedArrayBuffer]] internal slot. If so, I think this is a specification but that I should fix. Allen ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Maximum String length
On Wed, Jan 28, 2015 at 5:44 AM, Andreas Rossberg rossb...@google.com wrote: On 28 January 2015 at 13:14, Claude Pache claude.pa...@gmail.com wrote: To me, finite is just to be taken in the common mathematical sense of the term; in particular you could have theoretically a string of length 10^1. But yes, it would be reasonable to restrict oneself to strings of length at most 2^52, so that `string.length` could always return an exact answer. To me it would be reasonable to restrict oneself to much shorter strings, since no existing machine has the memory to represent a string of length 2^52, nor will any in the foreseeable future. ;) That's just four petabytes. If present trends... VMs can always run into out-of-memory conditions. In general, there is no way to predict those. Even strings with less then the hard-coded length limit might cause you to go OOM. So providing reflection on a constant like that might do little but giving a false sense of safety. Yes, OOM is always possible earlier and we can't set any limits on that. I agree that we shouldn't provide anything like String.MAX_LENGTH. But I also don't see how we could pleasantly support strings above 2^53. Array indexes are limited to 2^31 or so, and many integer operations truncate to that (,|), and strings support [] indexing, so it may make sense to agree on one of those as an *upper bound* -- you may not support strings longer than that. -- Cheers, --MarkM ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Q: Lonely surrogates and unicode regexps
On 1/28/2015 3:36 PM, Marja Hölttä wrote: Based on Ex1, looks like the input string is not read as a sequence of code points when we try to find a match for \1. So it's mostly read as a sequence of code points except when it's not. :/ Yep, back references are matched as a sequence of code units. The first link I've posted points to the relevant method in java.util.regex.Pattern. I've got no idea why it's implemented that way, for example when you enable case-insensitive matching, back references are no longer matched as a sequence of code units: --- int[] flags = { 0, Pattern.CASE_INSENSITIVE, Pattern.UNICODE_CASE, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE }; // Prints true, false, true, false Arrays.stream(flags).mapToObj(f - Pattern.compile(foo(.+)bar\\1, f)) .map(p - p.matcher(foo\uD834bar\uD834\uDC00).find()) .forEach(System.out::println); --- On Wed, Jan 28, 2015 at 3:11 PM, André Bargull andre.barg...@udo.edu mailto:andre.barg...@udo.edu wrote: On 1/28/2015 2:51 PM, André Bargull wrote: For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk 1.7.0_65) Pattern.UNICODE_CHARACTER___CLASS works: foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so, generally, lonely surrogates match /./. Backreferences are allowed to consume the leading surrogate of a valid surrogate pair: Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1 But surprisingly: Ex2: \uDC00foobar\uD834\__uDC00foobar\uD834 doesn't match ^(.+)\1$ ... So Ex2 works as if the input string was converted to UTF-32 before matching, but Ex1 works as if it was def not. Idk what's the correct mental model where both Ex1 and Ex2 would make sense. java.util.regex.Pattern matches back references by comparing (Java) chars [1], but reads patterns as a sequence of code points [2]. That should help to explain the differences between ex1 and ex2. [1] http://hg.openjdk.java.net/__jdk8u/jdk8u/jdk/file/__c46daef6edb5/src/share/__classes/java/util/regex/__Pattern.java#l4890 http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890 [2] http://hg.openjdk.java.net/__jdk8u/jdk8u/jdk/file/__c46daef6edb5/src/share/__classes/java/util/regex/__Pattern.java#l1671 http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671 Err, the part about how patterns are read is not important here. What I should have written is that the input string is (also) read as a sequence of code points [3]. So in ex2 `\uD834\uDC00` is read as a single code point (and not split into \uD834 and \uDC00 during backtracking). [3] http://hg.openjdk.java.net/__jdk8u/jdk8u/jdk/file/__c46daef6edb5/src/share/__classes/java/util/regex/__Pattern.java#l3773 http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l3773 ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Maximum String length
Le 29 janv. 2015 à 01:49, Jordan Harband ljh...@gmail.com a écrit : I suppose we could change the spec, but https://people.mozilla.org/~jorendorff/es6-draft.html#sec-ecmascript-language-types-string-type requires that The length of a String is the number of elements (i.e., 16-bit values) within it. - if the number can't be represented, then it seems that requirement can't be satisfied. I'm sure one can come up with a counterintuitive reading of the spec, but is that a realistic interpretation of it? It's not a requirement, it's a definition. But more on the point, the length of a String is simply a nonnegative integer, not a Number value representing such a integer. Not to be confused with the value of the length property of that String, which is necessarily a Number value. —Claude ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: @@toStringTag spoofing for null and undefined
At the moment that throws too. Anyways it's something to hammer on a bit. Maybe Jordan can kick it around too. Thanks, -JDD On Wed, Jan 28, 2015 at 5:16 PM, Allen Wirfs-Brock al...@wirfs-brock.com wrote: On Jan 28, 2015, at 5:03 PM, John-David Dalton john.david.dal...@gmail.com wrote: Primary issue is in isTypedArray(a): Uin32Array.prototype.buffer.call(a); Besides the typos, accessing .buffer throws in at least Chrome Firefox. Then .buffer is an object so if it doesn't throw there's no .call to execute. the ES6 definition of %TypedArray%.prototype.buffer: %TypedArray%.prototype.buffer is an *accessor property* whose set accessor function is undefined. Its get accessor function performs the following steps: 1. Let O be the this value. 2. If Type(O) is not Object, throw a TypeError exception. 3. If O does not have a [[ViewedArrayBuffer]] internal slot throw a TypeError exception. 4. Let buffer be the value of O’s [[ViewedArrayBuffer]] internal slot. 5. Return buffer. ES6 expects buffer to be implemented as an accessor property. That means that the probe in my test should be: Object.getOwnProperty(Uint32Array.prototype.__proto__, ‘buffer’).get.call(a); Allen -JDD On Wed, Jan 28, 2015 at 4:55 PM, Allen Wirfs-Brock al...@wirfs-brock.com wrote: On Jan 28, 2015, at 4:40 PM, John-David Dalton john.david.dal...@gmail.com wrote: Kind of a bummer. The isTypedArray example from https://esdiscuss.org/topic/tostringtag-spoofing-for-null-and-undefined#content-59 is incorrect. Is there an updated reference somewhere? The toStringTag result is handy because it allows checking against several tags at once without having to invoke multiple functions each with their own try-catch and all that perf baggage. How is it incorrect? Are you referring to the fact that both typed arrays and DataView objects have a [[ViewedArrayBuffer]] internal slot. If so, I think this is a specification but that I should fix. Allen ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Q: Lonely surrogates and unicode regexps
On Jan 28, 2015, at 5:26 AM, Mark Davis ☕️ m...@macchiato.com wrote: I think the cleanest mental model is where UTF-16 or UTF-8 strings are interpreted as if they were transformed into UTF-32. This is exactly the approach used in the ES6 spec (except that it doesn’t deal with UTF-8) While that is generally feasible, it often represents a cost in performance which is not acceptable in practice. So you see various approaches that involve some deviation from that mental model. While ES6 uses this approach in its specification, implementations are free to use any implementation technique that produces the same result. Allen ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Q: Lonely surrogates and unicode regexps
Cool, thanks for clarifications! To make sure, as per the intended semantics, we never allow splitting a valid surrogate pair (= matching only one of the surrogates but not the other), and thus we'll differ from the Java implementation here: /foo(.+)bar\1/u.test(foo\uD834bar\uD834\uDC00); we say false, Java says true. (In addition, /^(.+)\1$/u.test(\uDC00foobar\uD834\uDC00foobar\uD834) == false.) ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Q: Lonely surrogates and unicode regexps
On Jan 28, 2015, at 2:36 AM, Marja Hölttä ma...@chromium.org wrote: Hello es-discuss, TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ? The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters: TL;DR: in a unicode regexp lonely surrogates are considered to be a single “character”. As André has already covered “character” has a very specific meaning within the context of the ES6 RegExp specification in the second paragraph of http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics . The specification uses the same set of algorithms to describe both BCP (i.e., 16-bit elements) and unicode (i.e., 32-bit elements) patterns and matching semantics. “Character” is used in those algorithm to refer to a single element of the mode that is currently operating within. I think the ambiguity you find is in step 2.1 of http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern : 2. Return an internal closure that takes two arguments, a String str and an integer index, and performs the following: 1. If Unicode is true, let Input be a List consisting of the sequence of code points of str interpreted as a UTF-16 encoded Unicode string. Otherwise, let Input be a List consisting of the sequence of code units that are the elements of str. Input will be used throughout the algorithms in 21.2.2. Each element of Input is considered to be a character. Apparently I don’t have an adequate definition of “interpreted as a UTF-16 encoded Unicode string”. If you submit a bug to bugs.emncascript.org) I will provided one in the next spec. revisions. The intended semantics is that: In ascending string index order: Each valid UTF-16 surrogate pair is interpreted as a signal code point that is the UTF-16 encoded value Each “lonely” surrogate is interpreted as single code point that is the surrogate value Every other 16-bit code unit is interpreted as a single code point. Allen https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atom https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation E.g., “Let A be the set of all *characters* except LineTerminator.” “Let ch be the *character* Input[e].” But is a lonely surrogate a character? According to the Unicode standard, it’s not. If it's not, what will ch be if the input string contains a lonely surrogate in the relevant position? Q1: Are lonely surrogates allowed in /u regexps? E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed? Will it match a lead surrogate inside a surrogate pair? Suggestion: we shouldn't allow lonely surrogates in /u regexps. If users actually want to match lonely surrogates (e.g., to check for them or remove them) then they can use non-/u regexps. The regexp syntax treats a lonely surrogate as a normal unicode escape, and the rules say e.g., The production RegExpUnicodeEscapeSequence :: u Hex4Digits evaluates as follows: Return the character whose code is the SV of Hex4Digits. - it's also unclear what this means if no valid character has this code. Q2: If the string contains a lonely surrogate, what should it match? Should it match .? Should it match [^a] ? (Or is it undefined behavior?) Test cases: /foo.bar/u.test(foo\uD83Dbar) == ? /foo.bar/u.test(foo\uDC00bar) == ? /foo[^a]bar/u.test(foo\uD83Dbar) == ? /foo[^a]bar/u.test(foo\uDC00bar) == ? /foo/u.test(bar\uD83Dbarfoo) == ? /foo/u.test(bar\uDC00barfoo) == ? /foo(.*)bar\1/u.test(foo\uD834bar\uD834\uDC00) == ? // Should the backreference be allowed to match the lead surrogate of a surrogate pair? /^(.+)\1$/u.test(\uDC00foobar\uD83D\uDC00foobar\uD83D) == ?? // Should we allow splitting the surrogate pair like this? Suggestion: a lonely surrogate should not be a character and it should not match . or [^a] etc. However, a lonely surrogate in the input string shouldn't prevent some other part of the string from matching. If a lonely surrogate is treated as a character, the matching rule for . gets complicated and difficult / slow to implement: . should not match individual surrogates inside a surrogate pair, but if it has to match a lonely surrogate, we'll end up needing lookahead and lookbehind logic to implement that behavior. For example, the current version of Mathias’s ES6 Unicode regular expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into
RE: Figuring out the behavior of WindowProxy in the face of non-configurable properties
From: Mark S. Miller [mailto:erig...@google.com] On Tue, Jan 27, 2015 at 5:53 PM, Boris Zbarsky bzbar...@mit.edu wrote: I'd like to understand better the suggestion here, because I'm not sure I'm entirely following it. Specifically, I'd like to understand it in terms of the internal methods defined by https://github.com/domenic/window-proxy-spec. Presumably you're proposing that we keep all of that as-is except for [[DefineOwnProperty]], right? For [[DefineOwnProperty]], are we basically talking about changing step 1 to: 1) If the [[Configurable]] field of Desc is present and Desc.[[Configurable]] is false, then throw a TypeError exception. while keeping everything else as-is, Exactly correct. I didn't realize until reading your reply is that this is all that's necessary -- that it successfully covers all the cases I was thinking about without any further case division. I'm having a bit of trouble understanding how this maps to the solution described in your previous message, Mark. Your I didn't realize until reading your reply is that this is all that's necessary indicates I'm probably just missing something, so help appreciated. My question is, what happens if Desc.[[Configurable]] is not present, and P does not already exist on W? By my reading, we then fall through to calling the [[DefineOwnProperty]] internal method of W with arguments P and Desc. Assuming W's [[DefineOwnProperty]] is that of an ordinary object, I believe that takes us through OrdinaryDefineOwnProperty(W, P, Desc). Since P does not exist on W, and W is extensible, that takes us to ValidateAndApplyPropertyDescriptor(O, P, true, Desc, undefined). Then according to step 2.c, If the value of an attribute field of Desc is absent, the attribute of the newly created property is set to its default value. The default value is false, right? So won't this try to define a non-configurable property on W? I would have thought the modification needed to be more like: [[DefineOwnProperty]] (P, Desc) 1. If desc.[[Configurable]] is not present, set desc.[[Configurable]] to true. 2. If desc.[[Configurable]] is false, then throw a TypeError exception. 3. Return the result of calling the [[DefineOwnProperty]] internal method of W with arguments P and Desc. (here I have inserted step 1, but step 2 and 3 are unchanged from the previous incarnation). ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Q: Lonely surrogates and unicode regexps
On Jan 28, 2015, at 4:54 AM, Wes Garland w...@page.ca wrote: Some interesting questions here. These aren't discussion points. These are all things that must have answers that are directly derivable from the ES6 spec. If, after developing an adequate understand of that part of the specification, you can’t find the answer to these questions then there is probably something that needs to be clarified in the spec. 1 - What is a character? Is it a Unicode Code Point? defined in: paragraph 2 http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics 2 - Should we be able to match all possible JS Strings? yes, there is nothing in the algorithms that restrict JS String values 3 - Should we be able to match all possible Unicode Strings? yes, subject to what you mean but “Unicode Strings” as within JS Strings supplemental code points must be UTF-16 encoded. 4 - What do we do if there is a character in a String we cannot match? RegExp.exec returns null if a string cannot be matched by a pattern 5 - Do unmatchable characters match . ? there is no concept in the specification of an “unmatchable” character 6 - Are subsections of unmatchable strings matchable if they contain only matchable characters? there is no concept in the specification of an “unmatchable” character It is important to remember in these discussions that the Unicode specification allows strings which contain unmatched surrogates. and ES6 //u patterns can match them Do we want regular expressions that can't match some Unicode strings? No, the ES6 specificaiton can match all possible strings Do we extend the regexp syntax to have a symbol which matches an unmatched surrogate? we already have it: \u{D83D} How about reserved code points? What happens when they become assigned? Other than the initial decoding of valid surrogate pairs into 32-bit code points, the ES6 //u RegExp spec. applies no semantics to any code points in the string that is being matched. Allen ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Q: Lonely surrogates and unicode regexps
Cool, thanks for clarifications! To make sure, as per the intended semantics, we never allow splitting a valid surrogate pair (= matching only one of the surrogates but not the other), and thus we'll differ from the Java implementation here: /foo(.+)bar\1/u.test(foo\uD834bar\uD834\uDC00); we say false, Java says true. Correct, the captures List entry is [\uD834], so when performing 21.2.2.9 AtomEscape, \uD834 is matched against \uD834\uDC00 in step 8 which results in a failure state. (In addition, /^(.+)\1$/u.test(\uDC00foobar\uD834\uDC00foobar\uD834) == false.) Yes, this expression also returns false. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Figuring out the behavior of WindowProxy in the face of non-configurable properties
On Wed, Jan 28, 2015 at 8:51 AM, Domenic Denicola d...@domenic.me wrote: From: Mark S. Miller [mailto:erig...@google.com] On Tue, Jan 27, 2015 at 5:53 PM, Boris Zbarsky bzbar...@mit.edu wrote: I'd like to understand better the suggestion here, because I'm not sure I'm entirely following it. Specifically, I'd like to understand it in terms of the internal methods defined by https://github.com/domenic/window-proxy-spec. Presumably you're proposing that we keep all of that as-is except for [[DefineOwnProperty]], right? For [[DefineOwnProperty]], are we basically talking about changing step 1 to: 1) If the [[Configurable]] field of Desc is present and Desc.[[Configurable]] is false, then throw a TypeError exception. while keeping everything else as-is, Exactly correct. I didn't realize until reading your reply is that this is all that's necessary -- that it successfully covers all the cases I was thinking about without any further case division. I'm having a bit of trouble understanding how this maps to the solution described in your previous message, Mark. Your I didn't realize until reading your reply is that this is all that's necessary indicates I'm probably just missing something, so help appreciated. My question is, what happens if Desc.[[Configurable]] is not present, and P does not already exist on W? By my reading, we then fall through to calling the [[DefineOwnProperty]] internal method of W with arguments P and Desc. Assuming W's [[DefineOwnProperty]] is that of an ordinary object, I believe that takes us through OrdinaryDefineOwnProperty(W, P, Desc). Since P does not exist on W, and W is extensible, that takes us to ValidateAndApplyPropertyDescriptor(O, P, true, Desc, undefined). Then according to step 2.c, If the value of an attribute field of Desc is absent, the attribute of the newly created property is set to its default value. The default value is false, right? So won't this try to define a non-configurable property on W? In this situation, it will try and succeed. This more closely obeys the intent in the original code (e.g., the comment in the jQuery code), since it creates a non-configurable property on the *Window* W. It does not violate any invariant, since all that's observable on the *WindowProxy* (given the rest of your draft spec, which remain unchanged) is a configurable property of the same name. I would have thought the modification needed to be more like: [[DefineOwnProperty]] (P, Desc) 1. If desc.[[Configurable]] is not present, set desc.[[Configurable]] to true. 2. If desc.[[Configurable]] is false, then throw a TypeError exception. 3. Return the result of calling the [[DefineOwnProperty]] internal method of W with arguments P and Desc. (here I have inserted step 1, but step 2 and 3 are unchanged from the previous incarnation). -- Cheers, --MarkM ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Figuring out the behavior of WindowProxy in the face of non-configurable properties
On Wed, Jan 28, 2015 at 11:08 AM, Domenic Denicola d...@domenic.me wrote: From: Mark S. Miller [mailto:erig...@google.com] In this situation, it will try and succeed. This more closely obeys the intent in the original code (e.g., the comment in the jQuery code), since it creates a non-configurable property on the *Window* W. It does not violate any invariant, since all that's observable on the *WindowProxy* (given the rest of your draft spec, which remain unchanged) is a configurable property of the same name. Ah, I see! So then another non-intuitive (but invariant-preserving) consequence would be: ```js Object.defineProperty(window, prop, { value: foo }); var propDesc = Object.getOwnPropertyDescriptor(window, prop); if (propDesc.configurable) { Object.defineProperty(window, prop, { value: bar }); // this will fail, even though the property is supposedly configurable, // since when it forwards from the WindowProxy `window` to the underlying // Window object, it the Window's [[DefineOwnProperty]] fails. } ``` Am I getting this right? Exactly, yes. And again, if window is an ES6 proxy rather that a WindowProxy, it could also cause this behavior, so it doesn't create any situation which is not otherwise possible. The key points are: 1) The throw does (arguably) better obey the code's intent, since the property mostly acts like a non-configurable property until the window is navigated. 2) If a window navigation happens between your first step and your second, the second step may well succeed, which is what we (arguably) want, but which would have been prohibited if propDesc.configurable evaluated to true. -- Cheers, --MarkM ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
RE: Figuring out the behavior of WindowProxy in the face of non-configurable properties
From: Mark S. Miller [mailto:erig...@google.com] In this situation, it will try and succeed. This more closely obeys the intent in the original code (e.g., the comment in the jQuery code), since it creates a non-configurable property on the *Window* W. It does not violate any invariant, since all that's observable on the *WindowProxy* (given the rest of your draft spec, which remain unchanged) is a configurable property of the same name. Ah, I see! So then another non-intuitive (but invariant-preserving) consequence would be: ```js Object.defineProperty(window, prop, { value: foo }); var propDesc = Object.getOwnPropertyDescriptor(window, prop); if (propDesc.configurable) { Object.defineProperty(window, prop, { value: bar }); // this will fail, even though the property is supposedly configurable, // since when it forwards from the WindowProxy `window` to the underlying // Window object, it the Window's [[DefineOwnProperty]] fails. } ``` Am I getting this right? ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Figuring out the behavior of WindowProxy in the face of non-configurable properties
Mark S. Miller wrote: Exactly correct. I didn't realize until reading your reply is that this is all that's necessary -- that it successfully covers all the cases I was thinking about without any further case division. Here's another option, not clearly better or worse: [[DefineOwnProperty]] (P, Desc) 1. let R be the result of calling the [[DefineOwnProperty]] internal method of/W/with arguments/P/and /Desc/. 2. If/desc/.[[Configurable]] is present and*false*, then throw a*TypeError*exception. 3. return R. This is exactly like your solution, but with the order of the two steps switched. Perhaps the next breakage we see will tell us which of these to choose. If both are web compatible, then we need only pick which one we like better. I like the shorter one (filling in from cited text below, here it is in full: [[DefineOwnProperty]] (P, Desc) 1. If /desc/.[[Configurable]] is present and/desc/.[[Configurable]] is *false*, then throw a *TypeError* exception. 2. Return the result of calling the [[DefineOwnProperty]] internal method of /W/ with arguments /P/ and /Desc/. Besides being shorter, this doesn't call through to [[DOP]], which could have effects, and only then maybe-throw. /be as opposed to the behavior I'd understood we were aiming for, which was: 1) If the [[Configurable]] field of Desc is not present or Desc.[[Configurable]] is false, then throw a TypeError exception. ? If so, that's certainly a change that is much more likely to be web-compatible... Good! It certainly takes care of the one concrete breakage we know about so far. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss