Updates:
Status: Accepted
Labels: Priority-Low
Comment #14 on issue 761 by [email protected]: Incorrect UTF-8
encoding/decoding for non-BMP characters in String related functions
http://code.google.com/p/v8/issues/detail?id=761
I'm not sure what you are trying to send here. The "\u" suggests that it's
part of a string, but in that case the following should be ASCII hex digits.
You can't send the character U+1D356 to the V8 JavaScript engine, since it
simply doesn't recognize code points outside the BMP.
Since you are running in a browser, the above discussion doesn't apply -
that was about the V8 API. When running in the browser, UTF-8 decoding is
generally handled by WebKit.
If you want to send the two 16-bit words D834 and DF56, and the browser
will be the one interpreting it first, you send the UTF-8 encoding as part
of a normal HTML file or JS file. Then it will be expanded into the two
surrogate codes before being passed to V8. It only works for valid
character encodings (my U+10000 above should be encoded as F0 90 80 80,
then it works too).
I haven't checked whether Chrome does something else to characters coming
through a web-socket, but I would try the same thing there.
If you are embedding V8 directly, and creating strings through the API,
then it's a different matter, because then you use the V8 UTF-8 decoder,
which turns any non-BMP character into U+FFFD. That's the one that we might
consider changing (if it can be done without breaking the parser/preparser
interaction), but it's not a high priority. I'll reopen this feature
request.
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev