[v8-dev] Re: Issue 761 in v8: Incorrect UTF-8 encoding/decoding for non-BMP characters in String related functions

codesite-noreply Wed, 30 Jun 2010 06:13:16 -0700

Updates:
        Status: Accepted
        Labels: Type-FeatureRequest

Comment #3 on issue 761 by [email protected]: Incorrect UTF-8encoding/decoding for non-BMP characters in String related functions

http://code.google.com/p/v8/issues/detail?id=761

V8 currently only accepts characters in the BMP as input, using UCS-2 asinternal representation (the same representation as JavaScript strings).

As such, the output is correct (the UTF-8 encoding of <U+D801,U+DC12> issix characters, even if the code-points have no meaning).This is unlikely to change for the standard output functions, sinceJavaScript strings are inherently UCS-2. If someone uses V8 and knows thata string really contains UTF-16 encoded data, they need to add their ownoutput function that parses the string data and converts it to whatever isneeded, and which can handle malformed UTF-16 data.

However, the input is correctly parsed as U+00010412, but is then silentlytruncated to 16-bits when building the string. That's not helpful behavior.


There are two things we can do here,:

1 - Make it an error to enter characters outside the BMP. That avoids thesilent truncation.2 - Convert the input to UTF-16, using surrogate pairs for non-BMP codepoints, knowing that it will be treated as UCS-2 internally.

The latter isn't as dangerous as it seems, since all valid BMP-only UTF-8texts will be unchanged, it will handle characters outside of the BMP(without actually understanding it), and an input containing UTF-8encodings of the surrogate pair range is invalid anyway.


I'm leaning towards the second option.

--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev

[v8-dev] Re: Issue 761 in v8: Incorrect UTF-8 encoding/decoding for non-BMP characters in String related functions

Reply via email to