Comment #18 on issue 761 by [email protected]: Incorrect UTF-8 encoding/decoding for non-BMP characters in String related functions
http://code.google.com/p/v8/issues/detail?id=761

The \uxxxx sequence is recognized in ECMAScript string and RegExp literals and in identifiers only. It's always a six-character sequence, and the 'x's must be ASCII hex digits. The above example used four-byte sequences with non-ASCII hex digits, and not obviously inside a string or RegExp literal.

In any case, I agree that we should have conformity in behavior. If non-BMP code points encoded as UTF-8 is treated in one way when entering the browser as a script, but differently if entering as a web-socket, then it's a problem. I'd say it's the responsibility of the browser code to do the same thing before passing it on to JavaScript.

I'll see if I can reproduce it locally, and then I'll open a Chromium bug for it (or you can go ahead and do that, since you have an example already). Then we'll see if it should be handled inside Webkit or non-V8 Chromium (as the other incoming UTF-8 data), or if it should be delegated to V8 (in which case our UTF-8 decoder needs changing).

--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev

Reply via email to