Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Marja Hölttä
For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk
1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:

foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
generally, lonely surrogates match /./.

Backreferences are allowed to consume the leading surrogate of a valid
surrogate pair:

Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

But surprisingly:

Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$

... So Ex2 works as if the input string was converted to UTF-32 before
matching, but Ex1 works as if it was def not. Idk what's the correct mental
model where both Ex1 and Ex2 would make sense.
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Marja Hölttä
Based on Ex1, looks like the input string is not read as a sequence of code
points when we try to find a match for \1. So it's mostly read as a
sequence of code points except when it's not. :/

On Wed, Jan 28, 2015 at 3:11 PM, André Bargull andre.barg...@udo.edu
wrote:

 On 1/28/2015 2:51 PM, André Bargull wrote:

 For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk
 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:

 foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
 generally, lonely surrogates match /./.

 Backreferences are allowed to consume the leading surrogate of a valid
 surrogate pair:

 Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

 But surprisingly:

 Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$

 ... So Ex2 works as if the input string was converted to UTF-32 before
 matching, but Ex1 works as if it was def not. Idk what's the correct
 mental
 model where both Ex1 and Ex2 would make sense.


 java.util.regex.Pattern matches back references by comparing (Java) chars
 [1], but reads patterns as a sequence of code points [2]. That should help
 to explain the differences between ex1 and ex2.

 [1] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/
 c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890
 [2] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/
 c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671


 Err, the part about how patterns are read is not important here. What I
 should have written is that the input string is (also) read as a sequence
 of code points [3]. So in ex2 `\uD834\uDC00` is read as a single code point
 (and not split into \uD834 and \uDC00 during backtracking).

 [3] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/
 c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l3773

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Marja Hölttä
Hello es-discuss,

TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?

The ES6 unicode regexp spec is not very clear regarding what should happen
if the regexp or the matched string contains lonely surrogates (a lead
surrogate without a trail, or a trail without a lead). For example, for the
. operator, the relevant parts of the spec speak about characters:

https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atom
https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation
https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation

E.g.,
“Let A be the set of all *characters* except LineTerminator.”
“Let ch be the *character* Input[e].”

But is a lonely surrogate a character? According to the Unicode standard,
it’s not. If it's not, what will ch be if the input string contains a
lonely surrogate in the relevant position?

Q1: Are lonely surrogates allowed in /u regexps?

E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed?
Will it match a lead surrogate inside a surrogate pair?

Suggestion: we shouldn't allow lonely surrogates in /u regexps.

If users actually want to match lonely surrogates (e.g., to check for them
or remove them) then they can use non-/u regexps.

The regexp syntax treats a lonely surrogate as a normal unicode escape, and
the rules say e.g., The production RegExpUnicodeEscapeSequence :: u
Hex4Digits evaluates as follows: Return the character whose code is the SV
of Hex4Digits. - it's also unclear what this means if no valid character
has this code.

Q2: If the string contains a lonely surrogate, what should it match? Should
it match .? Should it match [^a] ? (Or is it undefined behavior?)

Test cases:
/foo.bar/u.test(foo\uD83Dbar) == ?
/foo.bar/u.test(foo\uDC00bar) == ?
/foo[^a]bar/u.test(foo\uD83Dbar) == ?
/foo[^a]bar/u.test(foo\uDC00bar) == ?
/foo/u.test(bar\uD83Dbarfoo) == ?
/foo/u.test(bar\uDC00barfoo) == ?
/foo(.*)bar\1/u.test(foo\uD834bar\uD834\uDC00) == ? // Should the
backreference be allowed to match the lead surrogate of a surrogate pair?
/^(.+)\1$/u.test(\uDC00foobar\uD83D\uDC00foobar\uD83D) == ?? // Should we
allow splitting the surrogate pair like this?

Suggestion: a lonely surrogate should not be a character and it should not
match . or [^a] etc. However, a lonely surrogate in the input string
shouldn't prevent some other part of the string from matching.

If a lonely surrogate is treated as a character, the matching rule for .
gets complicated and difficult / slow to implement: . should not match
individual surrogates inside a surrogate pair, but if it has to match a
lonely surrogate, we'll end up needing lookahead and lookbehind logic to
implement that behavior.

For example, the current version of Mathias’s ES6 Unicode regular
expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into
/a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\u]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/
and afaics it’s not yet fully consistent wrt lonely surrogates, so, a
consistent implementation is going to be more complex than this.

If we convert the string into UC-32 before matching, then the lonely
surrogate is a character behavior gets easier to implement, but we
wouldn't want to be forced to do that. The intention behind the ES6 spec
seems to be that strings can / should still be stored as UC-16. Converting
strings to UC-32 before matching with /u regexps would require an
additional pass over the string which we'd want to avoid, and converting
only when strictly needed for the lonely surrogate is a character
implementation adds complexity. E.g., with some regexps we don't need to
scan the whole input string to find a match, and also most input strings,
even for /u regexps, probably won't contain surrogates (to find that out
we'd also need to scan the whole string, or some logic to fall back to
UC-32 matching when we see a surrogate).

BR,
Marja
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Marja Hölttä
Cool, thanks for clarifications!

To make sure, as per the intended semantics, we never allow splitting a
valid surrogate pair (= matching only one of the surrogates but not the
other), and thus we'll differ from the Java implementation here:

/foo(.+)bar\1/u.test(foo\uD834bar\uD834\uDC00); we say false, Java says
true.

(In addition, /^(.+)\1$/u.test(\uDC00foobar\uD834\uDC00foobar\uD834) ==
false.)
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss