Updates:
Status: Accepted
Labels: Type-FeatureRequest
Comment #3 on issue 761 by [email protected]: Incorrect UTF-8
encoding/decoding for non-BMP characters in String related functions
http://code.google.com/p/v8/issues/detail?id=761
V8 currently only accepts characters in the BMP as input, using UCS-2 as
internal representation (the same representation as JavaScript strings).
As such, the output is correct (the UTF-8 encoding of <U+D801,U+DC12> is
six characters, even if the code-points have no meaning).
This is unlikely to change for the standard output functions, since
JavaScript strings are inherently UCS-2. If someone uses V8 and knows that
a string really contains UTF-16 encoded data, they need to add their own
output function that parses the string data and converts it to whatever is
needed, and which can handle malformed UTF-16 data.
However, the input is correctly parsed as U+00010412, but is then silently
truncated to 16-bits when building the string. That's not helpful behavior.
There are two things we can do here,:
1 - Make it an error to enter characters outside the BMP. That avoids the
silent truncation.
2 - Convert the input to UTF-16, using surrogate pairs for non-BMP code
points, knowing that it will be treated as UCS-2 internally.
The latter isn't as dangerous as it seems, since all valid BMP-only UTF-8
texts will be unchanged, it will handle characters outside of the BMP
(without actually understanding it), and an input containing UTF-8
encodings of the surrogate pair range is invalid anyway.
I'm leaning towards the second option.
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev