Updates:
        Status: Accepted
        Labels: Type-FeatureRequest

Comment #3 on issue 761 by [email protected]: Incorrect UTF-8 encoding/decoding for non-BMP characters in String related functions
http://code.google.com/p/v8/issues/detail?id=761

V8 currently only accepts characters in the BMP as input, using UCS-2 as internal representation (the same representation as JavaScript strings).

As such, the output is correct (the UTF-8 encoding of <U+D801,U+DC12> is six characters, even if the code-points have no meaning). This is unlikely to change for the standard output functions, since JavaScript strings are inherently UCS-2. If someone uses V8 and knows that a string really contains UTF-16 encoded data, they need to add their own output function that parses the string data and converts it to whatever is needed, and which can handle malformed UTF-16 data.


However, the input is correctly parsed as U+00010412, but is then silently truncated to 16-bits when building the string. That's not helpful behavior.

There are two things we can do here,:
1 - Make it an error to enter characters outside the BMP. That avoids the silent truncation. 2 - Convert the input to UTF-16, using surrogate pairs for non-BMP code points, knowing that it will be treated as UCS-2 internally.

The latter isn't as dangerous as it seems, since all valid BMP-only UTF-8 texts will be unchanged, it will handle characters outside of the BMP (without actually understanding it), and an input containing UTF-8 encodings of the surrogate pair range is invalid anyway.

I'm leaning towards the second option.

--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev

Reply via email to