Re: Q: Lonely surrogates and unicode regexps

2015-01-31 Thread Erik Corry
I think it's problematic that this is being standardized without a single
implementation.

On Wed, Jan 28, 2015 at 11:57 AM, André Bargull andre.barg...@udo.edu
wrote:

  On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä marja at chromium.org 
 https://mail.mozilla.org/listinfo/es-discuss wrote:

 * The ES6 unicode regexp spec is not very clear regarding what should happen
 ** if the regexp or the matched string contains lonely surrogates (a lead
 ** surrogate without a trail, or a trail without a lead). For example, for 
 the
 ** . operator, the relevant parts of the spec speak about characters:
 *
 ​Just a bit of terminology.

 The term character is overloaded, so Unicode provides the unambiguous
 term code point. For example, U+0378​ is not (currently) an encoded
 character according to Unicode, but it would certainly be a terrible idea
 to disregard it, or not match it. It is a reserved code point that may be
 assigned as an encoded character in the future. So both U+D83D and U+0378
 are not characters.

 If a ES spec uses the term character instead of code point, then at
 some point in the text it needs to disambiguate what is meant.


 character is defined in 21.2.2 Pattern Semantics [1]:

 In the context of describing the behaviour of a BMP pattern “character”
 means a single 16-bit Unicode BMP code point. In the context of describing
 the behaviour of a Unicode pattern “character” means a UTF-16 encoded code
 point.



 [1]
 https://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-31 Thread Norbert Lindenberg

 On Jan 28, 2015, at 8:11 , Allen Wirfs-Brock al...@wirfs-brock.com wrote:
 
 
 On Jan 28, 2015, at 2:36 AM, Marja Hölttä ma...@chromium.org wrote:
 
 Hello es-discuss,
 
 TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?
 
 The ES6 unicode regexp spec is not very clear regarding what should happen 
 if the regexp or the matched string contains lonely surrogates (a lead 
 surrogate without a trail, or a trail without a lead). For example, for the 
 . operator, the relevant parts of the spec speak about characters:
 
 TL;DR: in a unicode regexp lonely surrogates are considered to be a single 
 “character”. 
 
 As André has already covered “character” has a very specific meaning within 
 the context of the ES6 RegExp specification in the second paragraph of  
 http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics . 
 The specification uses the same set of algorithms to describe both BCP (i.e., 
 16-bit elements) and unicode (i.e., 32-bit elements) patterns and matching 
 semantics.  “Character” is used in those algorithm to refer to a single 
 element of the mode that is currently operating within.
 
 I think the ambiguity you find is in step 2.1 of 
 http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern :
 
 2.  Return an internal closure that takes two arguments, a String str and an 
 integer index, and performs the following:
 1. If Unicode is true, let Input be a List consisting of the sequence of code 
 points of str interpreted as a UTF-16 encoded Unicode string. Otherwise, let 
 Input be a List consisting of the sequence of code units that are the 
 elements of str. Input will be used throughout the algorithms in 21.2.2. Each 
 element of Input is considered to be a character. 
 
 Apparently I don’t have an adequate definition of “interpreted as a UTF-16 
 encoded Unicode string”. If you submit a bug to bugs.emncascript.org) I will 
 provided one in the next spec. revisions.  The intended semantics is that:
In ascending string index order:
   Each valid UTF-16 surrogate pair is interpreted as a signal code point 
 that is the UTF-16 encoded value
 Each “lonely” surrogate is interpreted as  single code point that is 
 the surrogate value
 Every other 16-bit code unit is interpreted as a single code point.

That definition is in section 6.1.4:
http://people.mozilla.org/~jorendorff/es6-draft.html#sec-ecmascript-language-types-string-type

A cross-reference would be useful.

Norbert
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-31 Thread Norbert Lindenberg

 On Jan 28, 2015, at 8:30 , Allen Wirfs-Brock al...@wirfs-brock.com wrote:
 
 
 On Jan 28, 2015, at 4:54 AM, Wes Garland w...@page.ca wrote:
 

 Do we extend the regexp syntax to have a symbol which matches an unmatched 
 surrogate?
 we already have it: \u{D83D}

Or to match any unpaired surrogate: /[\u{D800}-\u{DFFF}]/u

   How about reserved code points?  What happens when they become assigned?
 Other than the initial decoding of valid surrogate pairs into 32-bit code 
 points, the ES6 //u RegExp spec. applies no semantics to any code points in 
 the string that is being matched.

There are a few places where RegExp applies Unicode semantics:

– //ui uses Unicode case folding to compare case-insensitively. If the 
comparison involves code points that are unassigned in the Unicode version 
assumed by an ECMAScript implementation and in a later version get assigned to 
characters that are case-variants of each other, then the RegExp behavior can 
change. See section 21.2.2.8.2.

– RegExp knows a few character classes: \d, \D, \s, \S, \w, \W. \d, \D, \w, \W 
are defined by character lists that cannot change, but \s and therefore \S 
could change if Unicode assigns new characters with the category “Separator, 
space”. See section 21.2.2.12.

But in general //u is defined based on code points and doesn’t care whether 
code points are assigned or reserved.

Norbert

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-31 Thread Allen Wirfs-Brock

On Jan 31, 2015, at 12:01 AM, Norbert Lindenberg wrote:
 
 That definition is in section 6.1.4:
 http://people.mozilla.org/~jorendorff/es6-draft.html#sec-ecmascript-language-types-string-type
 
 A cross-reference would be useful.
 
 Norbert
 

done

allen

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Mathias Bynens

 On 28 Jan 2015, at 11:36, Marja Hölttä ma...@chromium.org wrote:
 
 TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?
 
 The ES6 unicode regexp spec is not very clear regarding what should happen if 
 the regexp or the matched string contains lonely surrogates (a lead surrogate 
 without a trail, or a trail without a lead). For example, for the . operator, 
 the relevant parts of the spec speak about characters:
 
 https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atom
 https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation
 https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation
 
 E.g.,
 “Let A be the set of all *characters* except LineTerminator.”
 “Let ch be the *character* Input[e].”
 
 But is a lonely surrogate a character? According to the Unicode standard, 
 it’s not. If it's not, what will ch be if the input string contains a lonely 
 surrogate in the relevant position?
 
 Q1: Are lonely surrogates allowed in /u regexps?
 
 E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed? 
 Will it match a lead surrogate inside a surrogate pair?
 
 Suggestion: we shouldn't allow lonely surrogates in /u regexps.
 
 If users actually want to match lonely surrogates (e.g., to check for them or 
 remove them) then they can use non-/u regexps.

You’re proposing to define “characters” in terms of Unicode scalar values in 
the case `/u` is used. I could get behind that — it reinforces the idea that 
`/u` is like a strict mode for regular expressions.

Playing devil’s advocate, the problem is that regular expressions and strings 
go hand in hand, and there is no guarantee that JavaScript strings only consist 
of valid code points. Making `.` not match lone surrogates breaks the developer 
expectation that `(.)` matches every “part” of the string. Having to avoid `/u` 
to prevent this seems like a potentially bad thing.

 The regexp syntax treats a lonely surrogate as a normal unicode escape, and 
 the rules say e.g., The production RegExpUnicodeEscapeSequence :: u 
 Hex4Digits evaluates as follows: Return the character whose code is the SV of 
 Hex4Digits. - it's also unclear what this means if no valid character has 
 this code.
 
 Q2: If the string contains a lonely surrogate, what should it match? Should 
 it match .? Should it match [^a] ? (Or is it undefined behavior?)
 
 Test cases:
 /foo.bar/u.test(foo\uD83Dbar) == ?
 /foo.bar/u.test(foo\uDC00bar) == ?
 /foo[^a]bar/u.test(foo\uD83Dbar) == ?
 /foo[^a]bar/u.test(foo\uDC00bar) == ?
 /foo/u.test(bar\uD83Dbarfoo) == ?
 /foo/u.test(bar\uDC00barfoo) == ?
 /foo(.*)bar\1/u.test(foo\uD834bar\uD834\uDC00) == ? // Should the 
 backreference be allowed to match the lead surrogate of a surrogate pair?
 /^(.+)\1$/u.test(\uDC00foobar\uD83D\uDC00foobar\uD83D) == ?? // Should we 
 allow splitting the surrogate pair like this?
 
 Suggestion: a lonely surrogate should not be a character and it should not 
 match . or [^a] etc. However, a lonely surrogate in the input string 
 shouldn't prevent some other part of the string from matching.
 
 If a lonely surrogate is treated as a character, the matching rule for . gets 
 complicated and difficult / slow to implement: . should not match individual 
 surrogates inside a surrogate pair, but if it has to match a lonely 
 surrogate, we'll end up needing lookahead and lookbehind logic to implement 
 that behavior.
 
 For example, the current version of Mathias’s ES6 Unicode regular expression 
 transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into 
 /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\u]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/
  and afaics it’s not yet fully consistent wrt lonely surrogates, so, a 
 consistent implementation is going to be more complex than this.

This is indeed an incomplete solution. The lack of lookbehind support in ES 
makes this hard to transpile correctly. Ideas welcome!

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread André Bargull

On 1/28/2015 2:51 PM, André Bargull wrote:

For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk
1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:

foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
generally, lonely surrogates match /./.

Backreferences are allowed to consume the leading surrogate of a valid
surrogate pair:

Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

But surprisingly:

Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$

... So Ex2 works as if the input string was converted to UTF-32 before
matching, but Ex1 works as if it was def not. Idk what's the correct mental
model where both Ex1 and Ex2 would make sense.


java.util.regex.Pattern matches back references by comparing (Java) chars [1], but reads patterns 
as a sequence of code points [2]. That should help to explain the differences between ex1 and ex2.


[1] 
http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890
[2] 
http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671


Err, the part about how patterns are read is not important here. What I should have written is that 
the input string is (also) read as a sequence of code points [3]. So in ex2 `\uD834\uDC00` is read 
as a single code point (and not split into \uD834 and \uDC00 during backtracking).


[3] 
http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l3773

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Marja Hölttä
For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk
1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:

foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
generally, lonely surrogates match /./.

Backreferences are allowed to consume the leading surrogate of a valid
surrogate pair:

Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

But surprisingly:

Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$

... So Ex2 works as if the input string was converted to UTF-32 before
matching, but Ex1 works as if it was def not. Idk what's the correct mental
model where both Ex1 and Ex2 would make sense.
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Mark Davis ☕️
I think the cleanest mental model is where UTF-16 or UTF-8 strings are
interpreted as if they were transformed into UTF-32.

While that is generally feasible, it often represents a cost in performance
which is not acceptable in practice. So you see various approaches that
involve some deviation from that mental model.


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Wed, Jan 28, 2015 at 2:15 PM, Marja Hölttä ma...@chromium.org wrote:

 For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk
 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:

 foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
 generally, lonely surrogates match /./.

 Backreferences are allowed to consume the leading surrogate of a valid
 surrogate pair:

 Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

 But surprisingly:

 Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$

 ... So Ex2 works as if the input string was converted to UTF-32 before
 matching, but Ex1 works as if it was def not. Idk what's the correct mental
 model where both Ex1 and Ex2 would make sense.


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread André Bargull

For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk
1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:

foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
generally, lonely surrogates match /./.

Backreferences are allowed to consume the leading surrogate of a valid
surrogate pair:

Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

But surprisingly:

Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$

... So Ex2 works as if the input string was converted to UTF-32 before
matching, but Ex1 works as if it was def not. Idk what's the correct mental
model where both Ex1 and Ex2 would make sense.


java.util.regex.Pattern matches back references by comparing (Java) chars [1], but reads patterns as 
a sequence of code points [2]. That should help to explain the differences between ex1 and ex2.


[1] 
http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890
[2] 
http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Marja Hölttä
Based on Ex1, looks like the input string is not read as a sequence of code
points when we try to find a match for \1. So it's mostly read as a
sequence of code points except when it's not. :/

On Wed, Jan 28, 2015 at 3:11 PM, André Bargull andre.barg...@udo.edu
wrote:

 On 1/28/2015 2:51 PM, André Bargull wrote:

 For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk
 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:

 foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
 generally, lonely surrogates match /./.

 Backreferences are allowed to consume the leading surrogate of a valid
 surrogate pair:

 Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

 But surprisingly:

 Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$

 ... So Ex2 works as if the input string was converted to UTF-32 before
 matching, but Ex1 works as if it was def not. Idk what's the correct
 mental
 model where both Ex1 and Ex2 would make sense.


 java.util.regex.Pattern matches back references by comparing (Java) chars
 [1], but reads patterns as a sequence of code points [2]. That should help
 to explain the differences between ex1 and ex2.

 [1] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/
 c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890
 [2] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/
 c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671


 Err, the part about how patterns are read is not important here. What I
 should have written is that the input string is (also) read as a sequence
 of code points [3]. So in ex2 `\uD834\uDC00` is read as a single code point
 (and not split into \uD834 and \uDC00 during backtracking).

 [3] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/
 c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l3773

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Marja Hölttä
Hello es-discuss,

TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?

The ES6 unicode regexp spec is not very clear regarding what should happen
if the regexp or the matched string contains lonely surrogates (a lead
surrogate without a trail, or a trail without a lead). For example, for the
. operator, the relevant parts of the spec speak about characters:

https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atom
https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation
https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation

E.g.,
“Let A be the set of all *characters* except LineTerminator.”
“Let ch be the *character* Input[e].”

But is a lonely surrogate a character? According to the Unicode standard,
it’s not. If it's not, what will ch be if the input string contains a
lonely surrogate in the relevant position?

Q1: Are lonely surrogates allowed in /u regexps?

E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed?
Will it match a lead surrogate inside a surrogate pair?

Suggestion: we shouldn't allow lonely surrogates in /u regexps.

If users actually want to match lonely surrogates (e.g., to check for them
or remove them) then they can use non-/u regexps.

The regexp syntax treats a lonely surrogate as a normal unicode escape, and
the rules say e.g., The production RegExpUnicodeEscapeSequence :: u
Hex4Digits evaluates as follows: Return the character whose code is the SV
of Hex4Digits. - it's also unclear what this means if no valid character
has this code.

Q2: If the string contains a lonely surrogate, what should it match? Should
it match .? Should it match [^a] ? (Or is it undefined behavior?)

Test cases:
/foo.bar/u.test(foo\uD83Dbar) == ?
/foo.bar/u.test(foo\uDC00bar) == ?
/foo[^a]bar/u.test(foo\uD83Dbar) == ?
/foo[^a]bar/u.test(foo\uDC00bar) == ?
/foo/u.test(bar\uD83Dbarfoo) == ?
/foo/u.test(bar\uDC00barfoo) == ?
/foo(.*)bar\1/u.test(foo\uD834bar\uD834\uDC00) == ? // Should the
backreference be allowed to match the lead surrogate of a surrogate pair?
/^(.+)\1$/u.test(\uDC00foobar\uD83D\uDC00foobar\uD83D) == ?? // Should we
allow splitting the surrogate pair like this?

Suggestion: a lonely surrogate should not be a character and it should not
match . or [^a] etc. However, a lonely surrogate in the input string
shouldn't prevent some other part of the string from matching.

If a lonely surrogate is treated as a character, the matching rule for .
gets complicated and difficult / slow to implement: . should not match
individual surrogates inside a surrogate pair, but if it has to match a
lonely surrogate, we'll end up needing lookahead and lookbehind logic to
implement that behavior.

For example, the current version of Mathias’s ES6 Unicode regular
expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into
/a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\u]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/
and afaics it’s not yet fully consistent wrt lonely surrogates, so, a
consistent implementation is going to be more complex than this.

If we convert the string into UC-32 before matching, then the lonely
surrogate is a character behavior gets easier to implement, but we
wouldn't want to be forced to do that. The intention behind the ES6 spec
seems to be that strings can / should still be stored as UC-16. Converting
strings to UC-32 before matching with /u regexps would require an
additional pass over the string which we'd want to avoid, and converting
only when strictly needed for the lonely surrogate is a character
implementation adds complexity. E.g., with some regexps we don't need to
scan the whole input string to find a match, and also most input strings,
even for /u regexps, probably won't contain surrogates (to find that out
we'd also need to scan the whole string, or some logic to fall back to
UC-32 matching when we see a surrogate).

BR,
Marja
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread André Bargull

On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä marja at chromium.org  
https://mail.mozilla.org/listinfo/es-discuss wrote:

/  The ES6 unicode regexp spec is not very clear regarding what should happen
//  if the regexp or the matched string contains lonely surrogates (a lead
//  surrogate without a trail, or a trail without a lead). For example, for the
//  . operator, the relevant parts of the spec speak about characters:
//
/
​Just a bit of terminology.

The term character is overloaded, so Unicode provides the unambiguous
term code point. For example, U+0378​ is not (currently) an encoded
character according to Unicode, but it would certainly be a terrible idea
to disregard it, or not match it. It is a reserved code point that may be
assigned as an encoded character in the future. So both U+D83D and U+0378
are not characters.

If a ES spec uses the term character instead of code point, then at
some point in the text it needs to disambiguate what is meant.


character is defined in 21.2.2 Pattern Semantics [1]:

In the context of describing the behaviour of a BMP pattern “character” means a single 16-bit 
Unicode BMP code point. In the context of describing the behaviour of a Unicode pattern 
“character” means a UTF-16 encoded code point.



[1] https://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Mark Davis ☕️
Good, that sounds right.


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Wed, Jan 28, 2015 at 12:57 PM, André Bargull andre.barg...@udo.edu
wrote:

  On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä marja at chromium.org 
 https://mail.mozilla.org/listinfo/es-discuss wrote:

 * The ES6 unicode regexp spec is not very clear regarding what should happen
 ** if the regexp or the matched string contains lonely surrogates (a lead
 ** surrogate without a trail, or a trail without a lead). For example, for 
 the
 ** . operator, the relevant parts of the spec speak about characters:
 *
 ​Just a bit of terminology.

 The term character is overloaded, so Unicode provides the unambiguous
 term code point. For example, U+0378​ is not (currently) an encoded
 character according to Unicode, but it would certainly be a terrible idea
 to disregard it, or not match it. It is a reserved code point that may be
 assigned as an encoded character in the future. So both U+D83D and U+0378
 are not characters.

 If a ES spec uses the term character instead of code point, then at
 some point in the text it needs to disambiguate what is meant.


 character is defined in 21.2.2 Pattern Semantics [1]:

 In the context of describing the behaviour of a BMP pattern “character”
 means a single 16-bit Unicode BMP code point. In the context of describing
 the behaviour of a Unicode pattern “character” means a UTF-16 encoded code
 point.



 [1]
 https://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Erik Corry
On Wed, Jan 28, 2015 at 11:45 AM, Mathias Bynens math...@qiwi.be wrote:


  On 28 Jan 2015, at 11:36, Marja Hölttä ma...@chromium.org wrote:
 
  For example, the current version of Mathias’s ES6 Unicode regular
 expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u
 into
 /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\u]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/
 and afaics it’s not yet fully consistent wrt lonely surrogates, so, a
 consistent implementation is going to be more complex than this.

 This is indeed an incomplete solution. The lack of lookbehind support in
 ES makes this hard to transpile correctly. Ideas welcome!


I don't think your transpiler can work without lookbehind.  If you could
guarantee that none of your transpiled regexp matches a substring that ends
in the middle of a pair, then I think you could get it right without
lookbehind, but consider:

TxL-TxLT.test(/(...)-\1./);

Where L stands for a lead surrogate, and T stands for a trailing
surrogate.  There's no way to stop the backreference from swallowing the
last L, and without lookbehind there is no way to stop the . from matching
the final T.  A second issue is having a match that starts in the middle of
a pair. You could test for this after the matching if JS gave you the index
of the match in the string, but I don't think it does.

Ignoring the start-of-match-in-the-middle-of-a-pair issue, and the
backreferences case, I think you can do without the backreference.
Assuming the lonely-surrogates-are-a-character scenario, the period (.)
transpiles to (ignore spaces added for readability):

(?:  \L(?!\T)  | \L\T  |  \T  |  [^\L\T\N])

where \L means leading surrogates, \T means trailing surrogates, \N means
all newlines.  Whatever comes before the . is not allowed to match a half

As an optimization, .x can transpile to (?: \L\T | . )x where the x stands
in for any literal characters.

For a JS engine implementor, like Marja, it is of course possible to add
1-character negative lookbehind (\b already has elements of this).  Then
your in-engine transpiler turns . into

(?:  \L(?!\T)  | \L\T  |  (?!\L)\T  |  [^\L\T\N])

Which is going to be truly horrible in terms of code size and performance.
It's not like the period operator is a rare thing in a regexp, and other
common things like [^a-z] and [^\d] will expand into similar horrors.

On the other hand, in the lonely-surrogates-match-nothing scenario, the .
transpiles to

(?: \l\t  |  [^\l\t\n] )

which is quite a lot nicer and faster.  In this scenario, .x expands to (?:
\L\T | [^\T\L\N )  which still has no lookaheads and lookbehinds.
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Mark Davis ☕️
On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä ma...@chromium.org wrote:

 The ES6 unicode regexp spec is not very clear regarding what should happen
 if the regexp or the matched string contains lonely surrogates (a lead
 surrogate without a trail, or a trail without a lead). For example, for the
 . operator, the relevant parts of the spec speak about characters:


​Just a bit of terminology.

The term character is overloaded, so Unicode provides the unambiguous
term code point. For example, U+0378​ is not (currently) an encoded
character according to Unicode, but it would certainly be a terrible idea
to disregard it, or not match it. It is a reserved code point that may be
assigned as an encoded character in the future. So both U+D83D and U+0378
are not characters.

If a ES spec uses the term character instead of code point, then at
some point in the text it needs to disambiguate what is meant.

As to how this should be handled in regex expressions, I'd suggest looking
at Java's approach.

Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Wes Garland
Some interesting questions here.

1 - What is a character? Is it a Unicode Code Point?
2 - Should we be able to match all possible JS Strings?
3 - Should we be able to match all possible Unicode Strings?
4 - What do we do if there is a character in a String we cannot match?
5 - Do unmatchable characters match . ?
6 - Are subsections of unmatchable strings matchable if they contain only
matchable characters?

It is important to remember in these discussions that the Unicode
specification allows strings which contain unmatched surrogates. Do we want
regular expressions that can't match some Unicode strings? Do we extend the
regexp syntax to have a symbol which matches an unmatched surrogate?  How
about reserved code points?  What happens when they become assigned?


On 28 January 2015 at 05:36, Marja Hölttä ma...@chromium.org wrote:

 Hello es-discuss,

 TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?

 The ES6 unicode regexp spec is not very clear regarding what should happen
 if the regexp or the matched string contains lonely surrogates (a lead
 surrogate without a trail, or a trail without a lead). For example, for the
 . operator, the relevant parts of the spec speak about characters:

 https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atom

 https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation

 https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation

 E.g.,
 “Let A be the set of all *characters* except LineTerminator.”
 “Let ch be the *character* Input[e].”

 But is a lonely surrogate a character? According to the Unicode standard,
 it’s not. If it's not, what will ch be if the input string contains a
 lonely surrogate in the relevant position?

 Q1: Are lonely surrogates allowed in /u regexps?

 E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed?
 Will it match a lead surrogate inside a surrogate pair?

 Suggestion: we shouldn't allow lonely surrogates in /u regexps.

 If users actually want to match lonely surrogates (e.g., to check for them
 or remove them) then they can use non-/u regexps.

 The regexp syntax treats a lonely surrogate as a normal unicode escape,
 and the rules say e.g., The production RegExpUnicodeEscapeSequence :: u
 Hex4Digits evaluates as follows: Return the character whose code is the SV
 of Hex4Digits. - it's also unclear what this means if no valid character
 has this code.

 Q2: If the string contains a lonely surrogate, what should it match?
 Should it match .? Should it match [^a] ? (Or is it undefined behavior?)

 Test cases:
 /foo.bar/u.test(foo\uD83Dbar) == ?
 /foo.bar/u.test(foo\uDC00bar) == ?
 /foo[^a]bar/u.test(foo\uD83Dbar) == ?
 /foo[^a]bar/u.test(foo\uDC00bar) == ?
 /foo/u.test(bar\uD83Dbarfoo) == ?
 /foo/u.test(bar\uDC00barfoo) == ?
 /foo(.*)bar\1/u.test(foo\uD834bar\uD834\uDC00) == ? // Should the
 backreference be allowed to match the lead surrogate of a surrogate pair?
 /^(.+)\1$/u.test(\uDC00foobar\uD83D\uDC00foobar\uD83D) == ?? // Should
 we allow splitting the surrogate pair like this?

 Suggestion: a lonely surrogate should not be a character and it should not
 match . or [^a] etc. However, a lonely surrogate in the input string
 shouldn't prevent some other part of the string from matching.

 If a lonely surrogate is treated as a character, the matching rule for .
 gets complicated and difficult / slow to implement: . should not match
 individual surrogates inside a surrogate pair, but if it has to match a
 lonely surrogate, we'll end up needing lookahead and lookbehind logic to
 implement that behavior.

 For example, the current version of Mathias’s ES6 Unicode regular
 expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u
 into
 /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\u]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/
 and afaics it’s not yet fully consistent wrt lonely surrogates, so, a
 consistent implementation is going to be more complex than this.

 If we convert the string into UC-32 before matching, then the lonely
 surrogate is a character behavior gets easier to implement, but we
 wouldn't want to be forced to do that. The intention behind the ES6 spec
 seems to be that strings can / should still be stored as UC-16. Converting
 strings to UC-32 before matching with /u regexps would require an
 additional pass over the string which we'd want to avoid, and converting
 only when strictly needed for the lonely surrogate is a character
 implementation adds complexity. E.g., with some regexps we don't need to
 scan the whole input string to find a match, and also most input strings,
 even for /u regexps, probably won't contain surrogates (to find that out
 we'd also need to scan the whole string, or some logic to fall back to
 UC-32 matching when we see a surrogate).

 BR,
 Marja


 

Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread André Bargull

On 1/28/2015 3:36 PM, Marja Hölttä wrote:

Based on Ex1, looks like the input string is not read as a sequence of code 
points when we try to
find a match for \1. So it's mostly read as a sequence of code points except 
when it's not. :/


Yep, back references are matched as a sequence of code units. The first link I've posted points to 
the relevant method in java.util.regex.Pattern. I've got no idea why it's implemented that way, for 
example when you enable case-insensitive matching, back references are no longer matched as a 
sequence of code units:


---
int[] flags = { 0, Pattern.CASE_INSENSITIVE, Pattern.UNICODE_CASE,
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE };

// Prints true, false, true, false
Arrays.stream(flags).mapToObj(f - Pattern.compile(foo(.+)bar\\1, f))
.map(p - p.matcher(foo\uD834bar\uD834\uDC00).find())
.forEach(System.out::println);
---





On Wed, Jan 28, 2015 at 3:11 PM, André Bargull andre.barg...@udo.edu
mailto:andre.barg...@udo.edu wrote:

On 1/28/2015 2:51 PM, André Bargull wrote:

For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and 
openjdk
1.7.0_65) Pattern.UNICODE_CHARACTER___CLASS works:

foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
generally, lonely surrogates match /./.

Backreferences are allowed to consume the leading surrogate of a 
valid
surrogate pair:

Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

But surprisingly:

Ex2: \uDC00foobar\uD834\__uDC00foobar\uD834 doesn't match ^(.+)\1$

... So Ex2 works as if the input string was converted to UTF-32 
before
matching, but Ex1 works as if it was def not. Idk what's the 
correct mental
model where both Ex1 and Ex2 would make sense.


java.util.regex.Pattern matches back references by comparing (Java) 
chars [1], but reads
patterns as a sequence of code points [2]. That should help to explain 
the differences
between ex1 and ex2.

[1]

http://hg.openjdk.java.net/__jdk8u/jdk8u/jdk/file/__c46daef6edb5/src/share/__classes/java/util/regex/__Pattern.java#l4890

http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890
[2]

http://hg.openjdk.java.net/__jdk8u/jdk8u/jdk/file/__c46daef6edb5/src/share/__classes/java/util/regex/__Pattern.java#l1671

http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671


Err, the part about how patterns are read is not important here. What I 
should have written is
that the input string is (also) read as a sequence of code points [3]. So 
in ex2 `\uD834\uDC00`
is read as a single code point (and not split into \uD834 and \uDC00 during 
backtracking).

[3]

http://hg.openjdk.java.net/__jdk8u/jdk8u/jdk/file/__c46daef6edb5/src/share/__classes/java/util/regex/__Pattern.java#l3773

http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l3773



___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Allen Wirfs-Brock

On Jan 28, 2015, at 5:26 AM, Mark Davis ☕️ m...@macchiato.com wrote:

 I think the cleanest mental model is where UTF-16 or UTF-8 strings are 
 interpreted as if they were transformed into UTF-32.

This is exactly the approach used  in the ES6 spec (except that it doesn’t deal 
with UTF-8)

 
 While that is generally feasible, it often represents a cost in performance 
 which is not acceptable in practice. So you see various approaches that 
 involve some deviation from that mental model.

While ES6 uses this approach in its specification, implementations are free to 
use any implementation technique that produces the same result.

Allen


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Marja Hölttä
Cool, thanks for clarifications!

To make sure, as per the intended semantics, we never allow splitting a
valid surrogate pair (= matching only one of the surrogates but not the
other), and thus we'll differ from the Java implementation here:

/foo(.+)bar\1/u.test(foo\uD834bar\uD834\uDC00); we say false, Java says
true.

(In addition, /^(.+)\1$/u.test(\uDC00foobar\uD834\uDC00foobar\uD834) ==
false.)
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Allen Wirfs-Brock

On Jan 28, 2015, at 2:36 AM, Marja Hölttä ma...@chromium.org wrote:

 Hello es-discuss,
 
 TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?
 
 The ES6 unicode regexp spec is not very clear regarding what should happen if 
 the regexp or the matched string contains lonely surrogates (a lead surrogate 
 without a trail, or a trail without a lead). For example, for the . operator, 
 the relevant parts of the spec speak about characters:

TL;DR: in a unicode regexp lonely surrogates are considered to be a single 
“character”. 

As André has already covered “character” has a very specific meaning within the 
context of the ES6 RegExp specification in the second paragraph of  
http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics . 
The specification uses the same set of algorithms to describe both BCP (i.e., 
16-bit elements) and unicode (i.e., 32-bit elements) patterns and matching 
semantics.  “Character” is used in those algorithm to refer to a single element 
of the mode that is currently operating within.

I think the ambiguity you find is in step 2.1 of 
http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern :

2.  Return an internal closure that takes two arguments, a String str and an 
integer index, and performs the following:
1. If Unicode is true, let Input be a List consisting of the sequence of code 
points of str interpreted as a UTF-16 encoded Unicode string. Otherwise, let 
Input be a List consisting of the sequence of code units that are the elements 
of str. Input will be used throughout the algorithms in 21.2.2. Each element of 
Input is considered to be a character. 

Apparently I don’t have an adequate definition of “interpreted as a UTF-16 
encoded Unicode string”. If you submit a bug to bugs.emncascript.org) I will 
provided one in the next spec. revisions.  The intended semantics is that:
   In ascending string index order:
Each valid UTF-16 surrogate pair is interpreted as a signal code point 
that is the UTF-16 encoded value
Each “lonely” surrogate is interpreted as  single code point that is 
the surrogate value
Every other 16-bit code unit is interpreted as a single code point.

Allen






 
 https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atom
 https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation
 https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation
 
 E.g.,
 “Let A be the set of all *characters* except LineTerminator.”
 “Let ch be the *character* Input[e].”
 
 But is a lonely surrogate a character? According to the Unicode standard, 
 it’s not. If it's not, what will ch be if the input string contains a lonely 
 surrogate in the relevant position?
 
 Q1: Are lonely surrogates allowed in /u regexps?
 
 E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed? 
 Will it match a lead surrogate inside a surrogate pair?
 
 Suggestion: we shouldn't allow lonely surrogates in /u regexps.
 
 If users actually want to match lonely surrogates (e.g., to check for them or 
 remove them) then they can use non-/u regexps.
 
 The regexp syntax treats a lonely surrogate as a normal unicode escape, and 
 the rules say e.g., The production RegExpUnicodeEscapeSequence :: u 
 Hex4Digits evaluates as follows: Return the character whose code is the SV of 
 Hex4Digits. - it's also unclear what this means if no valid character has 
 this code.
 
 Q2: If the string contains a lonely surrogate, what should it match? Should 
 it match .? Should it match [^a] ? (Or is it undefined behavior?)
 
 Test cases:
 /foo.bar/u.test(foo\uD83Dbar) == ?
 /foo.bar/u.test(foo\uDC00bar) == ?
 /foo[^a]bar/u.test(foo\uD83Dbar) == ?
 /foo[^a]bar/u.test(foo\uDC00bar) == ?
 /foo/u.test(bar\uD83Dbarfoo) == ?
 /foo/u.test(bar\uDC00barfoo) == ?
 /foo(.*)bar\1/u.test(foo\uD834bar\uD834\uDC00) == ? // Should the 
 backreference be allowed to match the lead surrogate of a surrogate pair?
 /^(.+)\1$/u.test(\uDC00foobar\uD83D\uDC00foobar\uD83D) == ?? // Should we 
 allow splitting the surrogate pair like this?
 
 Suggestion: a lonely surrogate should not be a character and it should not 
 match . or [^a] etc. However, a lonely surrogate in the input string 
 shouldn't prevent some other part of the string from matching.
 
 If a lonely surrogate is treated as a character, the matching rule for . gets 
 complicated and difficult / slow to implement: . should not match individual 
 surrogates inside a surrogate pair, but if it has to match a lonely 
 surrogate, we'll end up needing lookahead and lookbehind logic to implement 
 that behavior.
 
 For example, the current version of Mathias’s ES6 Unicode regular expression 
 transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into 
 

Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Allen Wirfs-Brock

On Jan 28, 2015, at 4:54 AM, Wes Garland w...@page.ca wrote:

 Some interesting questions here.

These aren't discussion points.  These are all things that must have answers 
that are directly derivable from the ES6 spec.  If, after developing an 
adequate understand of that part of the specification, you can’t find the 
answer to these questions then there is probably something that needs to be 
clarified in the spec.
 
 1 - What is a character? Is it a Unicode Code Point?
defined in: paragraph 2 
http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics 
 2 - Should we be able to match all possible JS Strings?
yes, there is nothing in the algorithms that restrict JS String values
 3 - Should we be able to match all possible Unicode Strings?
yes, subject to what you mean but “Unicode Strings” as within JS Strings 
supplemental code points must be UTF-16 encoded.
 4 - What do we do if there is a character in a String we cannot match?
RegExp.exec returns null if a string cannot be matched by a pattern
 5 - Do unmatchable characters match . ?
there is no concept in the specification of an “unmatchable” character
 6 - Are subsections of unmatchable strings matchable if they contain only 
 matchable characters?
there is no concept in the specification of an “unmatchable” character
 
 It is important to remember in these discussions that the Unicode 
 specification allows strings which contain unmatched surrogates.
and ES6 //u patterns can match them
 Do we want regular expressions that can't match some Unicode strings?
No, the ES6 specificaiton can match all possible strings
 Do we extend the regexp syntax to have a symbol which matches an unmatched 
 surrogate?
we already have it: \u{D83D}
   How about reserved code points?  What happens when they become assigned?
Other than the initial decoding of valid surrogate pairs into 32-bit code 
points, the ES6 //u RegExp spec. applies no semantics to any code points in the 
string that is being matched.

Allen

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread André Bargull

Cool, thanks for clarifications!

To make sure, as per the intended semantics, we never allow splitting a
valid surrogate pair (= matching only one of the surrogates but not the
other), and thus we'll differ from the Java implementation here:

/foo(.+)bar\1/u.test(foo\uD834bar\uD834\uDC00); we say false, Java says
true.


Correct, the captures List entry is [\uD834], so when performing 21.2.2.9 AtomEscape, \uD834 is 
matched against \uD834\uDC00 in step 8 which results in a failure state.





(In addition, /^(.+)\1$/u.test(\uDC00foobar\uD834\uDC00foobar\uD834) ==
false.)


Yes, this expression also returns false.
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss